我有一个字符串很多，我有一个文本文件，其中包含一些我需要从我的字符串中删除的停用词。比方说，我有一个字符串从Java字符串中删除停用词

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

去除停用词后，字符串应该是这样的：

"love phone, super fast much cool jelly bean....but recently bugs."

我已经能够做到这一点，但我现在面临的问题是，whenver在相邻的禁用词在串中的唯一除去第一和我得到的结果：

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"

这是我的stopwordslist.txt文件： Stopwords

我该如何解决这个问题。这是我迄今所做的：

int k=0,i,j; 
ArrayList<String> wordsList = new ArrayList<String>(); 
String sCurrentLine; 
String[] stopwords = new String[2000]; 
try{ 
     FileReader fr=new FileReader("F:\\stopwordslist.txt"); 
     BufferedReader br= new BufferedReader(fr); 
     while ((sCurrentLine = br.readLine()) != null){ 
      stopwords[k]=sCurrentLine; 
      k++; 
     } 
     String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
     StringBuilder builder = new StringBuilder(s); 
     String[] words = builder.toString().split("\\s"); 
     for (String word : words){ 
      wordsList.add(word); 
     } 
     for(int ii = 0; ii < wordsList.size(); ii++){ 
      for(int jj = 0; jj < k; jj++){ 
       if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ 
        wordsList.remove(ii); 
        break; 
       } 
      } 
     } 
     for (String str : wordsList){ 
      System.out.print(str+" "); 
     } 
    }catch(Exception ex){ 
     System.out.println(ex); 
    }

来源

2014-12-29 JavaLearner

将拆分字符串第一个帮助？像“phrase.split（delims）;”您可以先将不需要的部分过滤掉，然后再将它们缝合。这可能会解决你的“这个”和“他的”问题。 –

[更具体的问题是在这里]（http://stackoverflow.com/questions/22257598/best-way-to-remove-stop-words-from-files） – jsroyal

从那里有几个解决方案。例如，您可以将值设置为“”，而不是删除值。或者创建一个特殊的“结果”列表。

来源

2014-12-29 09:11:31

这里试试以下的方法：

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
    String stopWords[]={"love","this","cool"}; 
    for(int i=0;i<stopWords.length;i++){ 
     if(s.contains(stopWords[i])){ 
      s=s.replaceAll(stopWords[i]+"\\s+", ""); //note this will remove spaces at the end 
     } 
    } 
    System.out.println(s);

这样你最终的输出是没有你不想要它的话。只需获取数组中的停用词列表并替换为必需的字符串即可。
输出为我的禁用词：

I phone, its super fast and there's so much new and things with jelly bean....but of recently I've seen some bugs.

来源

2014-12-29 08:56:28

for循环后，s = s.replaceAll（“ “，”<单个空间>“）;将两个空间改为单个空间？ –

另外，就像使用Vimal的aswer一样，你会从其他词的中间删除子串（尝试添加“a”作为停用词）） –

而是你为什么不使用下面的方法。这将是更容易阅读和理解：

for(String word : words){ 
    s = s.replace(word+"\\s*", ""); 
} 
System.out.println(s);//It will print removed word string.

来源

2014-12-29 08:56:41

确实注意到这个实现将导致两个空格。 –

与此相关的问题是，它还会删除其他单词之间的停用词。就像它将“他的”从“这个”中移除一样。 – JavaLearner

@AngelKoh感谢您指出:) –

这是一个更好的解决方案（恕我直言），仅使用正则表达式：

// instead of the ".....", add all your stopwords, separated by "|" 
    // "\\b" is to account for word boundaries, i.e. not replace "his" in "this" 
    // the "\\s?" is to suppress optional trailing white space 
    Pattern p = Pattern.compile("\\b(I|this|its.....)\\b\\s?"); 
    Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."); 
    String s = m.replaceAll(""); 
    System.out.println(s);

来源

2014-12-29 08:58:20 geert3

这不是所有问题的突破声明。他在第一个循环中接受了文本的第一个单词。然后他查看停用词表，如果它存在。如果他在停用词列表中找到该词，他会中断搜索循环。然后他取下一个单词并在停用词列表中再次搜索。 –

是的，删除中断再次解决了问题 – JavaLearner

，与其他答案一样，您将删除正常单词的子字符串的停用词。 –

尝试使用字符串replaceAll API，如：

String myString = "I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
String stopWords = "I|its|with|but"; 
String afterStopWords = myString.replaceAll("(" + stopWords + ")\\s*", ""); 
System.out.println(afterStopWords); 

OUTPUT: 
love this phone, super fast and there's so much new and cool things jelly bean....of recently 've seen some bugs.

来源

2014-12-29 09:05:13 SMA

请尝试下面的程序。

String s="I love this phone, its super fast and there's so" + 
      " much new and cool things with jelly bean....but of recently I've seen some bugs."; 
    String[] words = s.split(" "); 
    ArrayList<String> wordsList = new ArrayList<String>(); 
    Set<String> stopWordsSet = new HashSet<String>(); 
    stopWordsSet.add("I"); 
    stopWordsSet.add("THIS"); 
    stopWordsSet.add("AND"); 
    stopWordsSet.add("THERE'S"); 

    for(String word : words) 
    { 
     String wordCompare = word.toUpperCase(); 
     if(!stopWordsSet.contains(wordCompare)) 
     { 
      wordsList.add(word); 
     } 
    } 

    for (String str : wordsList){ 
     System.out.print(str+" "); 
    }

OUTPUT：爱手机，Jelly Bean系统的超级快这么多新的很酷的事情....但最近我看到一些错误。

来源

2014-12-29 09:18:22 robin

好的抓住，而不是删除不需要的，但添加想要的！ +1 – Charlie

尝试将停用词存储在集合集合中，并将您的字符串标记为列表。之后您可以简单地使用'removeAll'来获得结果。

Set<String> stopwords = new Set<>() 
//fill in the set with your file 

String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."; 
List<String> listOfStrings = asList(s.split(" ")); 

listOfStrings.removeAll(stopwords); 
StringUtils.join(listOfStrings, " ");

不需要循环 - 它们通常意味着问题。

来源

2014-12-29 09:31:39

可以使用全部替换功能这样

String yourString ="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs." 
yourString=yourString.replaceAll("stop" ,"");

来源

2014-12-29 10:17:43

看来你打个不停一站式词在句子中移动到另一个停滞词语被删除：您需要删除每个句子都停止词。

你应该试着改变你的代码：

来源：

for(int ii = 0; ii < wordsList.size(); ii++){ 
    for(int jj = 0; jj < k; jj++){ 
     if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ 
      wordsList.remove(ii); 
      break; 
     } 
    } 
}

喜欢的东西：

for(int ii = 0; ii < wordsList.size(); ii++) 
{ 
    for(int jj = 0; jj < k; jj++) 
    { 
     if(wordsList.get(ii).toLowerCase().contains(stopwords[jj]) 
     { 
      wordsList.remove(ii); 
     } 
    } 
}

注意break被删除，stopword.contains(word)改为word.contains(stopword)。

来源

2015-10-13 00:50:35 Inquisitor

-1

最近有一个项目需要功能来筛选来自给定文本或文件的停止/干扰和发誓的话，在经历了几篇博客和文章之后。创建了一个简单的库来过滤数据/文件并在maven中可用。希望这可以帮助一些人。

https://github.com/uttesh/exude

 <dependency> 
     <groupId>com.uttesh</groupId> 
     <artifactId>exude</artifactId> 
     <version>0.0.2</version> 
    </dependency>

来源

2016-01-07 15:23:24

这是一个马车图书馆 – MFARID

@MFARID可以请你提供关于它是什么基础的车库的解释？ –

从Java字符串中删除停用词

回答

来源：

喜欢的东西：

相关问题