2013-03-01 71 views
1

我想找到特定文件集合中的文本文件的频率和反转文档频率(TF-IDF)一词。如何从文本文件中删除和统计单词?

因此,在这种情况下,我只想来计算总的话中的文件,尤其是词的出现次数在文件中并删除像aanthe的话,等

是否有任何解析器在vb.net?
在此先感谢。

+0

经过这个[教程](http://www.codeproject.com/Questions/302262/How-to-search-specific-string-into-分离文本文件),并告诉我是否有帮助。 – 2013-03-01 05:40:03

回答

1

最简单的方法,我知道是这样的:

Private Function CountWords(Filename as String) As Integer 
    Return IO.File.ReadAllText(Filename).Split(" ").Count 
End Function 

如果你想删除你可以的单词:

Private Sub RemoveWords (Filename as String, DeleteWords as List(Of String)) 
    Dim AllWords() As String = IO.File.ReadAllText(Filename).Split(" ") 
    Dim RemainingWords() As String = From Word As String In AllWords 
            Where DeleteWords.IndexOf(Word) = -1 

    'Do something with RemainingWords ex: 
    'IO.File.WriteAllText(Filename, String.Join(vbNewLine, RemainingWords) 
End Sub  

此假设字被与空间

0

也许regular expressions会帮助你:

Using System.IO 
Using System.Text.RegularExpressions 

... 

Dim anyWordPattern As String = "\b\w+\b" 
Dim myWordPattern As String = "\bMyWord\b" 
Dim replacePattern As String = "\b(?<sw>a|an|the)\b" 
Dim content As String = File.ReadAllText(<file name>) 
Dim coll = Regex.Matches(content, anyWordPattern) 
Console.WriteLine("Total words: {0}", coll.Count) 
coll = Regex.Matches(content, myWordPattern, RegexOptions.Multiline Or RegexOptions.IgnoreCase) 
Console.WEriteLine("My word occurrences: {0}", coll.Count) 
Dim replacedContent = Regex.Replace(content, replacePattern, String.Empty, RegexOptions.Multiline Or RegexOptions.IgnoreCase) 
Console.WriteLine("Replaced content: {0}", replacedContent) 

说明对正则表达式中使用:

  • \ b - 字边界;
  • \ w - 任何单词字符;
  • + - 量词,1或很多;
  • (?...) - 命名组,叫做 “SW” - 停止词
  • 一个|的|的 - 替代方案, “一” 或 “一” 或 “该”
1

最简单的这样做,这是阅读文本文件转换成一个字符串,然后使用.NET Framework找到匹配:

Dim text As String = File.ReadAllText("D:\Temp\MyFile.txt") 
Dim index As Integer = text.IndexOf("hello") 
If index >= 0 Then 
    ' String is in file, starting at character "index" 
End If 

或解决方案2您需要的StreamReader和至REGx了点。

//read file content in StreamReader 
StreamReadertxt Reader = new StreamReader(fName); 
szReadAll = txtReader.ReadToEnd();//Reads the whole text file to the end 
txtReader.Close(); //Closes the text file after it is fully read. 
txtReader = null; 
//search word in file content 
if (Regex.IsMatch(szReadAll, "SearchME", RegexOptions.IgnoreCase))//If the match is found in allRead 
    MessageBox.Show("found"); 
else 
    MessageBox.Show("not found"); 

这就是所有,我希望这可以解决您的疑问。 问候

0

你可以尝试这样的事:

Dim text As String = IO.File.ReadAllText("C:\file.txt") 
Dim wordsToSearch() As String = New String() {"Hello", "World", "foo"} 
Dim words As New List(Of String)() 
Dim findings As Dictionary(Of String, List(Of Integer)) 

'Dividing into words' 
words.AddRange(text.Split(New String() {" ", Environment.NewLine()}, StringSplitOptions.RemoveEmptyEntries)) 
'Discarting all the words you dont want' 
words.RemoveAll(New Predicate(Of String)(AddressOf WordsDeleter)) 

findings = SearchWords(words, wordsToSearch) 

Console.WriteLine("Number of 'foo': " & findings("foo").Count) 

和所使用的功能:

Private Function WordsDeleter(ByVal obj As String) As Boolean 
    Dim wordsToDelete As New List(Of String)(New String() {"a", "an", "then"}) 
    Return wordsToDelete.Contains(obj.ToLower) 
End Function 

Private Function SearchWords(ByVal allWords As List(Of String), ByVal wordsToSearch() As String) As Dictionary(Of String, List(Of Integer)) 
    Dim dResult As New Dictionary(Of String, List(Of Integer))() 
    Dim i As Integer = 0 

    For Each s As String In wordsToSearch 
     dResult.Add(s, New List(Of Integer)) 

     While i >= 0 AndAlso i < allWords.Count 
      i = allWords.IndexOf(s, i) 
      If i >= 0 Then dResult(s).Add(i) 
      i += 1 
     End While 
    Next 

    Return dResult 
End Function