2014-12-29 102 views
6

我有一个字符串很多,我有一个文本文件,其中包含一些我需要从我的字符串中删除的停用词。 比方说,我有一个字符串从Java字符串中删除停用词

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs." 

去除停用词后,字符串应该是这样的:

"love phone, super fast much cool jelly bean....but recently bugs." 

我已经能够做到这一点,但我现在面临的问题是,whenver在相邻的禁用词在串中的唯一除去第一和我得到的结果:

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs" 

这是我的stopwordslist.txt文件: Stopwords

我该如何解决这个问题。这是我迄今所做的:

int k=0,i,j; 
ArrayList<String> wordsList = new ArrayList<String>(); 
String sCurrentLine; 
String[] stopwords = new String[2000]; 
try{ 
     FileReader fr=new FileReader("F:\\stopwordslist.txt"); 
     BufferedReader br= new BufferedReader(fr); 
     while ((sCurrentLine = br.readLine()) != null){ 
      stopwords[k]=sCurrentLine; 
      k++; 
     } 
     String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
     StringBuilder builder = new StringBuilder(s); 
     String[] words = builder.toString().split("\\s"); 
     for (String word : words){ 
      wordsList.add(word); 
     } 
     for(int ii = 0; ii < wordsList.size(); ii++){ 
      for(int jj = 0; jj < k; jj++){ 
       if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ 
        wordsList.remove(ii); 
        break; 
       } 
      } 
     } 
     for (String str : wordsList){ 
      System.out.print(str+" "); 
     } 
    }catch(Exception ex){ 
     System.out.println(ex); 
    } 
+0

将拆分字符串第一个帮助?像“phrase.split(delims);”您可以先将不需要的部分过滤掉,然后再将它们缝合。这可能会解决你的“这个”和“他的”问题。 –

+0

[更具体的问题是在这里](http://stackoverflow.com/questions/22257598/best-way-to-remove-stop-words-from-files) – jsroyal

回答

2

该错误是因为您从您要迭代的列表中删除元素。 假设您有wordsList,其中包含|word0|word1|word2| 如果ii等于1并且如果测试为真,则请致电wordsList.remove(1);。之后,您的名单是|word0|word2|ii然后递增并等于2,现在它高于您的列表大小,因此word2将不会被测试。

从那里有几个解决方案。例如,您可以将值设置为“”,而不是删除值。或者创建一个特殊的“结果”列表。

1

这里试试以下的方法:

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
    String stopWords[]={"love","this","cool"}; 
    for(int i=0;i<stopWords.length;i++){ 
     if(s.contains(stopWords[i])){ 
      s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end 
     } 
    } 
    System.out.println(s); 

这样你最终的输出是没有你不想要它的话。只需获取数组中的停用词列表并替换为必需的字符串即可。
输出为我的禁用词:

I phone, its super fast and there's so much new and things with jelly bean....but of recently I've seen some bugs. 
+1

for循环后,s = s.replaceAll(“ “,”<单个空间>“);将两个空间改为单个空间? –

+0

另外,就像使用Vimal的aswer一样,你会从其他词的中间删除子串(尝试添加“a”作为停用词)) –

1

而是你为什么不使用下面的方法。这将是更容易阅读和理解:

for(String word : words){ 
    s = s.replace(word+"\\s*", ""); 
} 
System.out.println(s);//It will print removed word string. 
+0

确实注意到这个实现将导致两个空格。 –

+0

与此相关的问题是,它还会删除其他单词之间的停用词。就像它将“他的”从“这个”中移除一样。 – JavaLearner

+0

@AngelKoh感谢您指出:) –

4

这是一个更好的解决方案(恕我直言),仅使用正则表达式:

// instead of the ".....", add all your stopwords, separated by "|" 
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this" 
    // the "\\s?" is to suppress optional trailing white space 
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?"); 
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."); 
    String s = m.replaceAll(""); 
    System.out.println(s); 
+0

这不是所有问题的突破声明。他在第一个循环中接受了文本的第一个单词。然后他查看停用词表,如果它存在。如果他在停用词列表中找到该词,他会中断搜索循环。然后他取下一个单词并在停用词列表中再次搜索。 –

+0

是的,删除中断再次解决了问题 – JavaLearner

+0

,与其他答案一样,您将删除正常单词的子字符串的停用词。 –

0

尝试使用字符串replaceAll API,如:

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
String stopWords = "I|its|with|but"; 
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", ""); 
System.out.println(afterStopWords); 

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs. 
5

请尝试下面的程序。

String s="I love this phone, its super fast and there's so" + 
      " much new and cool things with jelly bean....but of recently I've seen some bugs."; 
    String[] words = s.split(" "); 
    ArrayList<String> wordsList = new ArrayList<String>(); 
    Set<String> stopWordsSet = new HashSet<String>(); 
    stopWordsSet.add("I"); 
    stopWordsSet.add("THIS"); 
    stopWordsSet.add("AND"); 
    stopWordsSet.add("THERE'S"); 

    for(String word : words) 
    { 
     String wordCompare = word.toUpperCase(); 
     if(!stopWordsSet.contains(wordCompare)) 
     { 
      wordsList.add(word); 
     } 
    } 

    for (String str : wordsList){ 
     System.out.print(str+" "); 
    } 

OUTPUT: 爱手机,Jelly Bean系统的超级快这么多新的很酷的事情....但最近我看到一些错误。

+0

好的抓住,而不是删除不需要的,但添加想要的! +1 – Charlie

0

尝试将停用词存储在集合集合中,并将您的字符串标记为列表。 之后您可以简单地使用'removeAll'来获得结果。

Set<String> stopwords = new Set<>() 
//fill in the set with your file 

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
List<String> listOfStrings = asList(s.split(" ")); 

listOfStrings.removeAll(stopwords); 
StringUtils.join(listOfStrings, " "); 

不需要循环 - 它们通常意味着问题。

2

可以使用全部替换功能这样

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs." 
yourString=yourString.replaceAll("stop" ,""); 
0

看来你打个不停一站式词在句子中移动到另一个停滞词语被删除:您需要删除每个句子都停止词。

你应该试着改变你的代码:

来源:

for(int ii = 0; ii < wordsList.size(); ii++){ 
    for(int jj = 0; jj < k; jj++){ 
     if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ 
      wordsList.remove(ii); 
      break; 
     } 
    } 
} 

喜欢的东西:

for(int ii = 0; ii < wordsList.size(); ii++) 
{ 
    for(int jj = 0; jj < k; jj++) 
    { 
     if(wordsList.get(ii).toLowerCase().contains(stopwords[jj]) 
     { 
      wordsList.remove(ii); 
     } 
    } 
} 

注意break被删除,stopword.contains(word)改为word.contains(stopword)

-1

最近有一个项目需要功能来筛选来自给定文本或文件的停止/干扰和发誓的话,在经历了几篇博客和文章之后。 创建了一个简单的库来过滤数据/文件并在maven中可用。希望这可以帮助一些人。

https://github.com/uttesh/exude

 <dependency> 
     <groupId>com.uttesh</groupId> 
     <artifactId>exude</artifactId> 
     <version>0.0.2</version> 
    </dependency> 
+0

这是一个马车图书馆 – MFARID

+0

@MFARID可以请你提供关于它是什么基础的车库的解释? –