2016-08-24 123 views
0

我有一个段落列表,我想在它们的组合上运行zipf分布。使用matplotlib构建Zipf分布,FITTED-LINE

我的代码是下面:

from itertools import * 
from pylab import * 
from collections import Counter 
import matplotlib.pyplot as plt 


paragraphs = " ".join(targeted_paragraphs) 
for paragraph in paragraphs: 
    frequency = Counter(paragraph.split()) 
counts = array(frequency.values()) 
tokens = frequency.keys() 

ranks = arange(1, len(counts)+1) 
indices = argsort(-counts) 
frequencies = counts[indices] 
loglog(ranks, frequencies, marker=".") 
title("Zipf plot for Combined Article Paragraphs") 
xlabel("Frequency Rank of Token") 
ylabel("Absolute Frequency of Token") 
grid(True) 
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)): 
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]], 
    verticalalignment="bottom", 
    horizontalalignment="left") 

目的我试图绘制在该图表“拟合线”,它的值分配给变量。但我不知道如何补充。任何帮助都将非常赞赏这两个问题。

回答

1

我知道这个问题被问了一段时间了。但是,我在scipy site遇到了此问题的可能解决方案。
我以为我会张贴在这里,以防其他人需要的情况。

我没有段落信息,所以这里有一个叫做frequency的鞭012 dict,它有段落发生作为它的值。

然后我们得到它的值并将其转换为numpy数组。定义zipf distribution parameter必须> 1。

最后显示的样本的直方图,随着概率密度函数

工作编码:

import random 
import matplotlib.pyplot as plt 
from scipy import special 
import numpy as np 

#Generate sample dict with random value to simulate paragraph data 
frequency = {} 
for i,j in enumerate(range(50)): 
    frequency[i]=random.randint(1,50) 

counts = frequency.values() 
tokens = frequency.keys() 


#Convert counts of values to numpy array 
s = np.array(counts) 

#define zipf distribution parameter. Has to be >1 
a = 2. 

# Display the histogram of the samples, 
#along with the probability density function 
count, bins, ignored = plt.hist(s, 50, normed=True) 
plt.title("Zipf plot for Combined Article Paragraphs") 
x = np.arange(1., 50.) 
plt.xlabel("Frequency Rank of Token") 
y = x**(-a)/special.zetac(a) 
plt.ylabel("Absolute Frequency of Token") 
plt.plot(x, y/max(y), linewidth=2, color='r') 
plt.show() 

剧情 enter image description here