在python中有很多txt文件的双元克

我有一个包含70,429个文件（296.5 mb）的语料库。我试图通过使用整个语料库来找到双格。我写了下面的代码;在python中有很多txt文件的双元克

allFiles = "" 
for dirName in os.listdir(rootDirectory): 
    for subDir in os.listdir(dirName): 
     for fileN in os.listdir(subDir): 
      FText = codecs.open(fileN, encoding="'iso8859-9'") 
      PText = FText.read() 
      allFiles += PText 
tokens = allFiles.split() 
finder = BigramCollocationFinder.from_words(tokens, window_size = 3) 
finder.apply_freq_filter(2) 
bigram_measures = nltk.collocations.BigramAssocMeasures() 
for k,v in finder.ngram_fd.most_common(100): 
    print(k,v)

有一个根目录，根目录包含子目录，每个子目录包含大量文件。我所做的是;

我读取所有文件，并将上下文添加到名为allFiles的字符串中。最后，我将字符串拆分为令牌并调用相关的双字母函数。问题是;

我跑了一天的程序，并没有得到任何结果。有没有更有效的方法来查找包含大量文件的语料库中的bigrams？

任何意见和建议将不胜感激。提前致谢。

来源

2016-03-13 yns

要做的一件事就是在循环中的目录遍历期间处理每个文件并存储'BigramCollocationFinder'的输出。可能会非常紧张，但可能会更快？ – avip

通过尝试将一个巨大的语料库一次读入内存，您正在吹出内存，迫使大量的交换使用，并放慢了一切。

NLTK提供了各种可以将您的单词逐个返回的“语料库阅读器”，以便整个语料库永远不会同时存储在内存中。

from nltk.corpus.reader import PlaintextCorpusReader 
reader = PlaintextCorpusReader(rootDirectory, "*/*/*", encoding="iso8859-9") 
finder = BigramCollocationFinder.from_words(reader.words(), window_size = 3) 
finder.apply_freq_filter(2) # Continue processing as before 
...

附录：如果我理解你的阴茎布局权这可能会实现你的方法有一个缺陷：你正在做的是从一个文档的结束到下一个的开始跨越卦...这是你想摆脱的废话。我推荐以下变体，它分别从每个文档中收集三元组。

document_streams = (reader.words(fname) for fname in reader.fileids()) 
BigramCollocationFinder.default_ws = 3 
finder = BigramCollocationFinder.from_documents(document_streams)

来源

2016-03-13 22:30:27 alexis

考虑将您的进程与Python的“多进程”线程池（https://docs.python.org/2/library/multiprocessing.html）并行化，为语料库中的每个文件发出一个带有{word：count}的字典到一些共享列表中。工作池完成后，在过滤之前合并字典，并按字出现次数进行合并。

来源

2016-03-13 20:12:08 manglano

在python中有很多txt文件的双元克

回答

相关问题