我有一个字符串很多,我有一个文本文件,其中包含一些我需要从我的字符串中删除的停用词。 比方说,我有一个字符串从Java字符串中删除停用词
s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."
去除停用词后,字符串应该是这样的:
"love phone, super fast much cool jelly bean....but recently bugs."
我已经能够做到这一点,但我现在面临的问题是,whenver在相邻的禁用词在串中的唯一除去第一和我得到的结果:
"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"
这是我的stopwordslist.txt文件: Stopwords
我该如何解决这个问题。这是我迄今所做的:
int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
FileReader fr=new FileReader("F:\\stopwordslist.txt");
BufferedReader br= new BufferedReader(fr);
while ((sCurrentLine = br.readLine()) != null){
stopwords[k]=sCurrentLine;
k++;
}
String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
StringBuilder builder = new StringBuilder(s);
String[] words = builder.toString().split("\\s");
for (String word : words){
wordsList.add(word);
}
for(int ii = 0; ii < wordsList.size(); ii++){
for(int jj = 0; jj < k; jj++){
if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
wordsList.remove(ii);
break;
}
}
}
for (String str : wordsList){
System.out.print(str+" ");
}
}catch(Exception ex){
System.out.println(ex);
}
将拆分字符串第一个帮助?像“phrase.split(delims);”您可以先将不需要的部分过滤掉,然后再将它们缝合。这可能会解决你的“这个”和“他的”问题。 –
[更具体的问题是在这里](http://stackoverflow.com/questions/22257598/best-way-to-remove-stop-words-from-files) – jsroyal