创建一个频率表，捕获在一定长度的字符串中流行的子串 - Python

我正试图计算我正在编译的斯瓦希里语语料库的频率分析。目前，这是我有：创建一个频率表，捕获在一定长度的字符串中流行的子串 - Python

import os 
import sys 
from collections import Counter 
import re 


path = 'C:\Python27\corpus\\' 
cnt = Counter() 
listing = os.listdir(path) 
for infile in listing: 
    print "Currently parsing: " + path + infile 
    corpus = open(path+infile, "r") 
    for lines in corpus: 
     for words in lines.split(' '): 
      if len(words) >= 2 and re.match("^[A-Za-z]*$", words): 
       words = words.strip() 
       cnt[words] += 1 
    print "Completed parsing: " + path + infile 
    #output = open(n + ".out", "w") 
    #print "current file is: " + infile 

    corpus.close() 
    #output.close() 
for (counter, content) in enumerate(cnt.most_common(1000)): 
    print str(counter+1) + " " + str(content)

所以这个程序会遍历所有文件在给定的路径，在每个文件的文本阅读，并显示1000个高频词。以下是问题：斯瓦希里语是一种凝聚性语言，它意味着将单词，后缀和前缀添加到单词中以传达诸如时态，因果关系，虚拟语气，介词等之类的东西。

因此，动词根就像'-fanya'意思是'做'可能是nitakufanya - '我要去做你'。因此，这个频率列表偏向于连接不使用所述中缀的'for'，'in'，'out'这样的词。

有没有一种简单的方式来看看像'nitakufanya'或'tunafanya'这样的单词，并在总数中加入'fanya'这个词？

看一些潜在的东西：

动词根将在字
结束在一个单词的开头的主题标志物可以是以下之一：“妮”（（你），'一'（他/她），'他们'（他们），'我们'，'我'，'你'全部
主题标记紧随其后的是时态标记它们是：'na'（现在），'li'（过去），'ta'（未来），'ji'（反身），'nge'（有条件的）

谢谢

来源

2012-07-31 Parseltongue

先做频率分析，不用担心前缀。然后修复频率列表中的前缀。为了做到这一点，根据单词对列表进行排序，以便前缀相同的单词彼此相邻。这将使得手工修剪非常容易。

来源

2012-07-31 01:33:18

你可以这样做：

root_words = [re.sub(
    '^(ni|u|a|wa|tu|m)(na|li|ta|ji|nge)', 
    '', x) for word in words]

从每个字去掉前缀，但没有太多，如果根词与这些序列开始，以及你可以做。

来源

2012-07-31 01:47:42

创建一个频率表，捕获在一定长度的字符串中流行的子串 - Python

回答

相关问题