2010-12-23 72 views
1

我有一个文本文件存储为字符串变量。该文本文件被处理,以便它只包含小写字和空格。现在,假设我有一个静态词典,它只是一个特定单词列表,我想从文本文件中计算词典中每个单词的频率。例如:计算文本文件中特定字的频率

Text file: 

i love love vb development although i m a total newbie 

Dictionary: 

love, development, fire, stone 

我想看到的输出如下所示,列出字典单词和它的计数。如果它使编码更简单,它也只能列出出现在文本中的字典单词。

=========== 

WORD, COUNT 

love, 2 

development, 1 

fire, 0 

stone, 0 

============ 

使用正则表达式(例如,“\ w +”),我可以得到所有的字比赛,但我不知道怎么去说也都在字典中的计数,所以我坚持。效率至关重要,因为字典非常大(约100,000字),文本文件也不小(每个约200kb)。

我很感激任何帮助。

Dictionary<string, int> count = 
    theString.Split(' ') 
    .GroupBy(s => s) 
    .ToDictionary(g => g.Key, g => g.Count()); 

现在你可以检查是否存在于字典的话,并表示如果计数:

+0

也许像将字符串拆分成一个`Array`或`List`,然后迭代/处理列表? – 2010-12-23 17:08:52

+0

您已将此标签标记为c#和vb.net。这是什么? – 2010-12-23 17:10:07

回答

5
var dict = new Dictionary<string, int>(); 

foreach (var word in file) 
    if (dict.ContainsKey(word)) 
    dict[word]++; 
    else 
    dict[word] = 1; 
6

您可以将它们分组,并把它变成一本字典数字符串中的单词它确实如此。

0

使用Groovy的正则表达式facilty,我会如下做到这一点: -

def input=""" 
    i love love vb development although i m a total newbie 
""" 

def dictionary=["love", "development", "fire", "stone"] 


dictionary.each{ 
    def pattern= ~/${it}/ 
    match = input =~ pattern 
    println "${it}" + "-"+ match.count 
} 
0

试试这个。单词变量显然是你的文本字符串。关键字数组是您想要统计的关键字列表。

对于不在文本中的字典单词,这不会返回0,但您指定此行为可以。这应该会在满足您的应用程序要求的同时为您提供相对较好的性能。

string words = "i love love vb development although i m a total newbie"; 
string[] keywords = new[] { "love", "development", "fire", "stone" }; 

Regex regex = new Regex("\\w+"); 

var frequencyList = regex.Matches(words) 
    .Cast<Match>() 
    .Select(c => c.Value.ToLowerInvariant()) 
    .Where(c => keywords.Contains(c)) 
    .GroupBy(c => c) 
    .Select(g => new { Word = g.Key, Count = g.Count() }) 
    .OrderByDescending(g => g.Count) 
    .ThenBy(g => g.Word); 

//Convert to a dictionary 
Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count); 

//Or iterate through them as is 
foreach (var item in frequencyList) 
    Response.Write(String.Format("{0}, {1}", item.Word, item.Count)); 

如果你想达到同样的事情,而无需使用正则表达式,因为您已表示自己知道的一切是小写用空格分开,你可以修改上面的代码如下所示:

string words = "i love love vb development although i m a total newbie"; 
string[] keywords = new[] { "love", "development", "fire", "stone" }; 

var frequencyList = words.Split(' ') 
    .Select(c => c) 
    .Where(c => keywords.Contains(c)) 
    .GroupBy(c => c) 
    .Select(g => new { Word = g.Key, Count = g.Count() }) 
    .OrderByDescending(g => g.Count) 
    .ThenBy(g => g.Word); 

Dictionary<string, int> dict = frequencyList.ToDictionary(d => d.Word, d => d.Count);