2016-11-26 9 views
1

我试图找到在莎士比亚的三篇文章中出现的前50个词以及每个词出现在macbeth.txt,allswell.txt中的比例,和othello.txt。这是我到目前为止的代码:在列表中找到重复项并添加它们的值

def byFreq(pair): 
    return pair[1] 

def shakespeare(): 
counts = {} 
A = [] 
for words in ['macbeth.txt','allswell.txt','othello.txt']: 
    text = open(words, 'r').read() 
    test = text.lower() 

    for ch in '!"$%&()*+,-./:;<=>[email protected][\\]^_`{|}~': 
     text = text.replace(ch, ' ') 
     words = text.split() 

    for w in words: 
     counts[w] = counts.get(w, 0) + 1 

    items = list(counts.items()) 
    items.sort() 
    items.sort(key=byFreq, reverse = True) 

    for i in range(50): 
     word, count = items[i] 
     count = count/float(len(counts)) 
     A += [[word, count]] 
print A 

,其输出:

 >>> shakespeare() 
[['the', 0.12929982922664066], ['and', 0.09148572822639668], ['I', 0.08075140278116613], ['of', 0.07684801171017322], ['to', 0.07562820200048792], ['a', 0.05220785557453037], ['you', 0.04415711149060746], ['in', 0.041717492071236886], ['And', 0.04147353012929983], ['my', 0.04147353012929983], ['is', 0.03927787265186631], ['not', 0.03781410100024396], ['that', 0.0358624054647475], ['it', 0.03366674798731398], ['Macb', 0.03342278604537692], ['with', 0.03269090021956575], ['his', 0.03147109050988046], ['be', 0.03025128080019517], ['The', 0.028787509148572824], ['haue', 0.028543547206635766], ['me', 0.027079775555013418], ['your', 0.02683581361307636], ['our', 0.025128080019516955], ['him', 0.021956574774335203], ['Enter', 0.019516955354964626], ['That', 0.019516955354964626], ['for', 0.01927299341302757], ['this', 0.01927299341302757], ['he', 0.018541107587216395], ['To', 0.01780922176140522], ['so', 0.017077335935594046], ['all', 0.0156135642839717], ['What', 0.015369602342034643], ['are', 0.015369602342034643], ['thou', 0.015369602342034643], ['will', 0.015125640400097584], ['Macbeth', 0.014881678458160527], ['thee', 0.014881678458160527], ['But', 0.014637716516223469], ['but', 0.014637716516223469], ['Macd', 0.014149792632349353], ['they', 0.014149792632349353], ['their', 0.013905830690412296], ['we', 0.013905830690412296], ['as', 0.01341790680653818], ['vs', 0.01341790680653818], ['King', 0.013173944864601122], ['on', 0.013173944864601122], ['yet', 0.012198097096852892], ['Rosse', 0.011954135154915833], ['the', 0.15813168261114238], ['I', 0.14279684862127182], ['and', 0.1231007315700619], ['to', 0.10875070343275182], ['of', 0.10481148002250985], ['a', 0.08581879572312887], ['you', 0.08581879572312887], ['my', 0.06992121553179516], ['in', 0.061902082160945414], ['is', 0.05852560495216657], ['not', 0.05486775464265616], ['it', 0.05472706809229038], ['that', 0.05472706809229038], ['his', 0.04727068092290377], ['your', 0.04389420371412493], ['me', 0.043753517163759144], ['be', 0.04305008441193022], ['And', 0.04037703995498031], ['with', 0.038266741699493526], ['him', 0.037703995498030385], ['for', 0.03601575689364097], ['he', 0.03404614518851998], ['The', 0.03137310073157006], ['this', 0.030810354530106922], ['her', 0.029262802476083285], ['will', 0.0291221159257175], ['so', 0.027011817670230726], ['have', 0.02687113111986494], ['our', 0.02687113111986494], ['but', 0.024760832864378166], ['That', 0.02293190770962296], ['PAROLLES', 0.022791221159257174], ['To', 0.021384355655599326], ['all', 0.021384355655599326], ['shall', 0.021102982554867755], ['are', 0.02096229600450197], ['as', 0.02096229600450197], ['thou', 0.02039954980303883], ['Macb', 0.019274057400112548], ['thee', 0.019274057400112548], ['no', 0.01871131119864941], ['But', 0.01842993809791784], ['Enter', 0.01814856499718627], ['BERTRAM', 0.01758581879572313], ['HELENA', 0.01730444569499156], ['we', 0.01730444569499156], ['do', 0.017163759144625774], ['thy', 0.017163759144625774], ['was', 0.01674169949352842], ['haue', 0.016460326392796848], ['I', 0.19463784682531435], ['the', 0.17894627455055595], ['and', 0.1472513769094877], ['to', 0.12989712147978802], ['of', 0.12002494024732412], ['you', 0.1079704873739998], ['a', 0.10339810869791126], ['my', 0.0909279850358516], ['in', 0.07627558973293151], ['not', 0.07159929335965914], ['is', 0.0697287748103502], ['it', 0.0676504208666736], ['that', 0.06733866777512211], ['me', 0.06099968824690845], ['your', 0.0543489556271433], ['And', 0.053205860958121166], ['be', 0.05310194326093734], ['his', 0.05154317780317988], ['with', 0.04769822300737816], ['him', 0.04665904603553985], ['her', 0.04364543281720877], ['for', 0.04322976202847345], ['he', 0.042190585056635144], ['this', 0.04187883196508366], ['will', 0.035332017042502335], ['Iago', 0.03522809934531851], ['so', 0.03356541619037722], ['The', 0.03325366309882573], ['haue', 0.031902733035435935], ['do', 0.03138314454951678], ['but', 0.030240049880494647], ['That', 0.02857736672555336], ['thou', 0.027642107450898887], ['as', 0.027434272056531227], ['To', 0.026810765873428243], ['our', 0.02504416502130313], ['are', 0.024628494232567806], ['But', 0.024420658838200146], ['all', 0.024316741141016316], ['What', 0.024212823443832486], ['shall', 0.024004988049464823], ['on', 0.02265405798607503], ['thee', 0.022134469500155875], ['Enter', 0.021822716408604385], ['thy', 0.021199210225501402], ['no', 0.020783539436766082], ['she', 0.02026395095084693], ['am', 0.02005611555647927], ['by', 0.019848280162111608], ['have', 0.019848280162111608]] 

相反outputing三种文本,其输出的每个文本,150个字的前50字的前50字。我努力尝试删除重复项,但将它们的比率加在一起。例如,在macbeth.txt中,“the”这个词的比例为0.12929982922664066,allswell.txt的比例为0.15813168261114238,而其他的则为0.17894627455055595。我想结合他们三人的比例。我非常确定我必须使用for循环,但我正努力循环遍历列表中的列表。我更喜欢Java的人,所以任何帮助将不胜感激!

+0

您是否正在寻找每个文件中单词出现率或所有3个文件合并的比率?换句话说,“这个”的比率应该是每个作品单独出现的频率(因此它有3个不同的比率),还是应该是所有三个文本中“the”发生频率的比率(一个值)。 – TheF1rstPancake

+0

这应该是在所有三个文本组合(“值”)中出现“the”的频率对于混淆抱歉。 –

+0

好的。那么它会在你的逻辑上产生变化。你不能只将三个比率相加在一起。您必须记下所有三个文件中单词出现的次数,然后将其除以每个文件中总字数的总和。您需要将三个单独的文件视为一个大文件,然后进行数学计算。 @ phynfo的解决方案为你做了什么? @ zmbq的解决方案也适用。所有你需要做的就是将'items = list(count.items())'后面的所有内容移出''for'循环。 – TheF1rstPancake

回答

1

您正在汇总循环内的文件计数。将汇总代码移到for循环外部。

3

您可以使用列表中理解和反类:

from collections import Counter 

c = Counter([word for file in ['macbeth.txt','allswell.txt','othello.txt'] 
        for word in open(file).read().split()]) 

然后你得到它映射的话他们的计数的字典。你可以这样对它们进行排序:

sorted([(i,v) for v,i in c.items()]) 

如果你想在相对数量,那么你可以计算的话总人数:

numWords = sum([i for (v,i) in c.items()]) 

,并通过字典,理解适应的字典c

c = { v:(i/numWords) for (v,i) in c.items()} 
相关问题