2016-11-18 99 views
0

下面是我的代码来检测缩写及其长表格。代码循环遍历文档中的一行,循环遍历该行的每个单词并标识缩写候选项。然后它再次循环遍历文档的每一行以找到缩写的适当长格式。我的问题是,如果在文档中多次出现首字母缩略词,我的输出包含多个实例。我只想用所有可能的长格式打印缩写词一次。这里是我的代码:删除重复键值对中的值在列表中

public static void main(String[] args) throws FileNotFoundException 
    { 
     BufferedReader in = new BufferedReader(new FileReader("D:\\Workspace\\resource\\SampleSentences.txt")); 
     String str=null; 
     ArrayList<String> lines = new ArrayList<String>(); 
     String matchingLongForm; 
     List <String> matchingLongForms = new ArrayList<String>() ; 
     List <String> shortForm = new ArrayList<String>() ; 
     Map<String, List<String>> abbreviationPairs = new HashMap<String, List<String>>(); 


     try 
     { 
      while((str = in.readLine()) != null){ 
       lines.add(str); 
      } 
     } 
     catch (IOException e) 
     { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 
     String[] linesArray = lines.toArray(new String[lines.size()]); 




     // document wide search for abbreviation long form and identifying several appropriate matches 
     for (String line : linesArray){ 
      for (String word : (Tokenizer.getTokenizer().tokenize(line))){ 
       if (isValidShortForm(word)){ 
        for (int i = 0; i < linesArray.length; i++){ 
         matchingLongForm = extractBestLongForm(word, linesArray[i]); 
         //shortForm.add(word); 
         if (matchingLongForm != null && !(matchingLongForms.contains(matchingLongForm))){ 
          matchingLongForms.add(matchingLongForm); 

          //System.out.println(matchingLongForm); 
          abbreviationPairs.put(word, matchingLongForms); 
          //matchingLongForms.clear(); 
         } 
        } 

        if (abbreviationPairs != null){ 
         //for(abbreviationPairs.) 
         System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs); 
         abbreviationPairs.clear(); 
         matchingLongForms.clear(); 
         //System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairsNew); 
        } 


        else 
         continue; 
       } 
      } 
     } 
    } 

下面是电流输出:

Abbreviation Pair: {GLBA=[Gramm Leach Bliley act]} 
Abbreviation Pair: {NCUA=[National credit union administration]} 
Abbreviation Pair: {FFIEC=[Federal Financial Institutions Examination Council]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {OFAC=[Office of Foreign Assets Control]} 
+0

是'地图<字符串,请设置> abbreviationPairs'的选项? – bradimus

+0

请注意['Files.readAllLines']的存在(https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#readAllLines(java.nio.file.Path ,%20java.nio.charset.Charset))。通过重新发明轮子,你正在浪费你的时间......此外,你可以简单地写'for(String line:lines){...',而不需要将List的内容复制到数组中。 – Holger

回答

1

您希望缩写和文本具有关键值对。所以你应该使用Map。 地图不能包含重复键;每个键可以映射到最多一个值。

问题出在输出的位置上,而不是在地图上。 您尝试在循环中输出,因此多次显示地图。

移动代码外循环:

if (abbreviationPairs != null){ 
    //for(abbreviationPairs.) 
    System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs); 
    abbreviationPairs.clear(); 
    matchingLongForms.clear(); 
    //System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairsNew); 
} 
+2

更重要的是,在每次循环迭代中清除映射,这使得检测重复键不可能。但无论哪种情况,将打印代码移出循环都是正确的解决方案。必须小心地为每个映射创建一个匹配“LongForms”的新列表。那么'clear()'调用就会过时。 – Holger

+0

非常感谢!我用了你的答案的组合。每当我为matchingLongForms创建一个新列表时,将打印代码移到循环外部。 – serendipity

4

尝试使用java.util.Set来存储您的匹配短的形式和长形式。从该类的javadoc:

...如果此集合已包含该元素,则该调用将保持集合不变并返回false。结合对构造函数的限制,这可确保集合永远不会包含重复的元素...

0

这里的解决方案

感谢code_angel和Holger

移动打印代码外循环并创建一个新的列表为每个匹配的LongForm。

for (String line : linesArray){ 
     for (String word : (Tokenizer.getTokenizer().tokenize(line))){ 
      if (isValidShortForm(word)){ 
       for (int i = 0; i < linesArray.length; i++){ 
        matchingLongForm = extractBestLongForm(word, linesArray[i]); 
        List <String> matchingLongForms = new ArrayList<String>() ; 
        if (matchingLongForm != null && !(matchingLongForms.contains(matchingLongForm))&& !(abbreviationPairs.containsKey(word))){ 
         matchingLongForms.add(matchingLongForm); 
         //System.out.println(matchingLongForm); 
         abbreviationPairs.put(word, matchingLongForms); 
         //matchingLongForms.clear(); 
        } 
       } 

      } 
     } 
    } 
    if (abbreviationPairs != null){ 
     System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs); 
     //abbreviationPairs.clear(); 
     //matchingLongForms.clear(); 

    } 

} 

新的输出:

Abbreviation Pair: {NCUA=[National credit union administration], FFIEC=[Federal Financial Institutions Examination Council], OFAC=[Office of Foreign Assets Control], MSSP=[Managed Security Service Providers], IS=[Information Systems], SLA=[Service level agreements], CFR=[comments for the Report], MIS=[Management Information Systems], IDS=[Intrusion detection systems], TSP=[Technology Service Providers], RFI=[risk that FIs], EIC=[Examples of in the cloud], TIER=[The institution should ensure], BCP=[Business continuity planning], GLBA=[Gramm Leach Bliley act], III=[It is important], FI=[Financial Institutions], RFP=[Request for proposal]}