文本标记生成器 - 从文本

提取词和位置我有character分隔符（DELIMITERS）Set，如.,等使用此我要拆分文本，并与他们的文本位置得到的话。 String.split()工作正常，如果你只想要单词。与StringTokenizer相同。写了一些简单的方法来处理这个，但也许有更好的方法来实现这个结果？文本标记生成器 - 从文本

public List<String> extractWords(String text){ 
    List<String> words = new ArrayList<>(); 
    List<WordPos> positions = new ArrayList<>(); 
    int wordStart = -1; 
    for(int i=0; i < text.length(); i++){ 
     if(DELIMITERS.contains(text.charAt(i))){ 
      if(wordStart >=0){ //word just ended 
       String word = text.substring(wordStart, i); 
       positions.add(new WordPos(wordStart, i)); 
       words.add(word); 
      } 
      wordStart = -1; 
     }else{ //not delimiter == valid word 
      if(wordStart < 0){ //word just started 
       wordStart = i; 
      } 
     } 
    } 
    return words; 
} 

// inner static class for words positions 
public static class WordPos{ 
    int start; 
    int end; 
    public WordPos(int start, int end){ 
     this.start = start; 
     this.end = end; 
    } 
}

来源

2015-02-09 bartektartanus

我认为你应该在http://codereview.stackexchange.com/ – Matt 2015-02-09 10:14:51

从效率的角度来看，我认为你的解决方案并不差。从审美方面（代码看起来如何），我会用Apache Commons nStringUtils做这样的事情（没试过）：使用

吐所有令牌： splitPreserveAllTokens()
叠代产生的阵列和存储令牌以及每次从lastIndexOf调用中获得的位置。

来源

2015-02-09 10:19:04 aviad

上发布这个，但是每次调用'lastIndexOf'都会减慢循环... – 2015-02-09 10:24:20

@silvaran，你没有读过第一句我的答案？我并不是说明智的表现是最好的......但是，表现并没有明确提及。我认为从“干净的代码”的角度来看，最好是既苗条又可读。 – aviad 2015-02-09 12:55:50

List<String> words = new ArrayList<>(); 
List<WordPos> positions = new ArrayList<>(); 
int index = 0; 
String word = ""; 
StringTokenizer st = new StringTokenizer("., "); 


while(st.hasMoreTokens()) { 

word = st.nextToken(); 
words.add(word); 
positions.add(new WordPos(index,index+word.length())); 

index+= word.length() +1; 
}

利用上述的方法，我假设有不连续2个分隔符。如果发生这种情况，方法是相同的。

来源

2015-02-09 10:20:03

但是可能有两个分隔符。 “约翰回家，天空很蓝。”点和空间在一起。 – bartektartanus 2015-02-09 10:22:00

@bartektartanus是否有一组固定的分隔符或可以更改？ – 2015-02-09 10:34:10

文本标记生成器 - 从文本

回答

相关问题