2015-04-04 69 views
3

StringTokenizer用于标记JAVA中的标记字符串。该字符串使用斯坦福大学的词性MaxentTagger进行标记。已标记文本的子字符串仅用于显示POS标记,并且只是迭代地显示该字词。JAVA中的StringTokenizer

这里的文本标记前:

人一直有这个概念,即中的英勇事迹在物理作用明显。虽然这不是完全错误的,但并不是单一的勇气之路。从古至今,这是反击野兽的力量标志。如果为防守而战,这是可以理解的。然而,走出更远的路,煽动动物并与之抗争是人类可以展现的最低文明程度。更多的是,在这个推理和知识的时代。传统可能会这样称呼它,但盲目坚持它是白痴,无论是在泰米尔纳德邦(印度相当于西班牙斗牛)的着名Jallikattu还是公鸡战斗。在狗身上Pel and石头并在痛苦中嚎叫是可怕的。如果只给予思想和良知的涓滴,这个问题在每个方面都会表现为可悲的。动物在我们的生态系统中与我们一起发挥作用。而且,有些动物更加热门:守卫我们街道的流浪狗,聪明的乌鸦,负担的野兽以及牧场上的日常动物。文学以其自己的方式表达出来:在指环王的团契中,小心翼翼地对待了比尔·费尔尼的小马;在哈利波特不听从赫敏关于房子精灵治疗的建议时,他们学到了导致自己失败的难题;和杰克伦敦,写的所有关于动物。恩,对动物的善良是一种美德。

这里的POS标记文本:

Man_NN has_VBZ always_RB had_VBN this_DT notion_NN that_IN brave_VBP deeds_NNS are_VBP manifest_JJ in_IN physical_JJ actions_NNS ._。 while_IN it_PRP is_VBZ not_RB whole_RB erroneous_JJ,_,there_EX does_VBZ n't_RB lie_VB the_DT singular_JJ path_NN to_TO valor_NN ._。 From_IN old_JJ,_,it_PRP is_VBZ a_DT sign_NN of_IN strength_NN to_TO fight_VB back_RP a_DT wild_JJ animal_NN ._。 It_PRP is_VBZ understandable_JJ if_IN fought_VBN in_IN defense_NN; _:otherwise_RB,_,to_TO go_VB the_DT extra_JJ mile_NN and_CC instigate_VB an_DT animal_NN and_CC fight_VB it_PRP is_VBZ_IND_index_NN_man_NN can_MD exhibit_VB__。 More_RBR so_RB,_,in_IN this_DT age_NN of_IN reasoning_NN and_CC knowledge_NN ._。 Tradition_NN may_MD call_VB it_PRP,_,but_CC adhering_JJ blindly_RB to_TO it_PRP is_VBZ idiocy_NN,_,be_VB it_PRP the_DT famed_JJ Jallikattu_NNP in_IN Tamil_NNP Nadu_NNP -LRB -_- LRB- The_DT Indian_JJ equivalent_NN to_TO the_DT Spanish_JJ Bullfighting_NN -RRB -_- RRB- or_CC the_DT cock-战斗_NNS ._。 Pelting_VBG stones_NNS at_IN a_DT dog_NN and_CC relishing_VBG it_PRP howl_NN in_IN pain_NN is_VBZ dreadful_JJ ._。 If_IN one_CD only_RB given_VBD as_RB much_JJ as_IN a_DT trick_VB of_IN thought_NN and_CC conscience_NN the_DT issue_NN would_MD surface_VB as_IN deplorable_JJ in_IN every_DT aspect_NN ._。 Animals_NNS play_VBP a_DT part_NN along_IN with_IN us_PRP in_IN our_PRP $ ecosystem_NN ._。 And_CC,_,some_DT animals_NNS are_VBP dearer_RBR:_:the_DT stray_JJ dogs_NNS that_WDT guard_VBP our_PRP $ street_NN,_,the_DT intelligent_JJ crow_NN,_,the_DT beast_NN of_IN burden_NN and_CC the_DT everyday_JJ animals_NNS of_IN pasture_NN ._。 Literature_NN has_VBZ voiced_VBN in_IN its_PRP $ own_JJ way_NN:_:In_IN The_DT Lord_NN of_IN the_DT Rings_NNP the_DT fellowship_NN treated_VBN Bill_NNP Ferny_NNP 's_POS pony_NN with_IN utmost_JJ care_NN; _:in_IN Harry_NNP Potter_NNP when_WRB they_PRP did_VBD n't_RB heed_VB Hermione_NNP' s_POS advice_NN on_IN the_DT treatment_NN of_IN house_NN elves_NNS they_PRP learned_VBD the_DT hard_JJ way_NN that_IN it_PRP caused_VBD their_PRP $ own_JJ undoing_NN; _:and_CC Jack_NNP London_NNP,_,writes_VBZ all_DT about_IN animals_NNS ._。 Indeed_RB,_,Kindness_NN to_TO animals_NNS is_VBZ a_DT virtue_NN ._。

下面是其寻求获得上述子代码:

String line; 
StringBuilder sb=new StringBuilder(); 
try(FileInputStream input = new FileInputStream("E:\\D.txt")) 
    { 
    int data = input.read(); 
    while(data != -1) 
     { 
     sb.append((char)data); 
     data = input.read(); 
     } 
    } 
catch(FileNotFoundException e) 
{ 
    System.err.println("File Not Found Exception : " + e.getMessage()); 
} 
line=sb.toString(); 
String line1=line;//Copy for Tagger 
line+=" T";  
List<String> sentenceList = new ArrayList<String>();//TAGGED DOCUMENT 
MaxentTagger tagger = new MaxentTagger("E:\\Installations\\Java\\Tagger\\english-left3words-distsim.tagger"); 
String tagged = tagger.tagString(line1); 
File file = new File("A.txt"); 
BufferedWriter output = new BufferedWriter(new FileWriter(file)); 
output.write(tagged); 
output.close(); 
DocumentPreprocessor dp = new DocumentPreprocessor("C:\\Users\\Admin\\workspace\\Project\\A.txt"); 
int largest=50; 
int m=0; 
StringTokenizer st1; 
for (List<HasWord> sentence : dp) 
{ 
    String sentenceString = Sentence.listToString(sentence); 
    sentenceList.add(sentenceString.toString()); 
} 
String[][] Gloss=new String[sentenceList.size()][largest]; 
String[] Adj=new String[largest]; 
String[] Adv=new String[largest]; 
String[] Noun=new String[largest]; 
String[] Verb=new String[largest]; 
int adj=0,adv=0,noun=0,verb=0; 
for(int i=0;i<sentenceList.size();i++) 
{ 
    st1= new StringTokenizer(sentenceList.get(i)," ,(){}[]/.;:&?!"); 
    m=0;//Count for Gloss 2nd dimension 
    //GETTING THE POS's COMPARTMENTALISED 
    while(st1.hasMoreTokens()) 
    { 
     String token=st1.nextToken(); 
     if(token.length()>1)//TO SKIP PAST TOKENS FOR PUNCTUATION MARKS 
     { 
     System.out.println(token); 
     String s=token.substring(token.lastIndexOf("_")+1,token.length()); 
     System.out.println(s); 
     if(s.equals("JJ")||s.equals("JJR")||s.equals("JJS")) 
     { 
      Adj[adj]=token.substring(0,token.lastIndexOf("_")); 
      System.out.println(Adj[adj]); 
      adj++; 
     } 
     if(s.equals("NN")||s.equals("NNS")) 
     { 
      Noun[noun]=token.substring(0, token.lastIndexOf("_")); 
      System.out.println(Noun[noun]); 
      noun++; 
     } 
     if(s.equals("RB")||s.equals("RBR")||s.equals("RBS")) 
     { 
      Adv[adv]=token.substring(0,token.lastIndexOf("_")); 
      System.out.println(Adv[adv]); 
      adv++; 
     } 
     if(s.equals("VB")||s.equals("VBD")||s.equals("VBG")||s.equals("VBN")||s.equals("VBP")||s.equals("VBZ")) 
     { 
      Verb[verb]=token.substring(0,token.lastIndexOf("_")); 
      System.out.println(Verb[verb]); 
      verb++; 
     } 
     } 
    } 
    i++;//TO SKIP PAST THE LINES WHERE AN EXTRA UNDERSCORE OCCURS FOR FULLSTOP 
} 

D.txt包含纯文本。

至于问题:

每一个字都被在空间标记化。除了'n't_RB',它被分别标记为不是和RB。

这是输出的外观:

Man_NN 
NN 
Man 
has_VBZ 
VBZ 
has 
always_RB 
RB 
always 
had_VBN 
VBN 
had 
this_DT 
DT 
notion_NN 
NN 
notion 
that_IN 
IN 
brave_VBP 
VBP 
brave 
deeds_NNS 
NNS 
deeds 
are_VBP 
VBP 
are 
manifest_JJ 
JJ 
manifest 
in_IN 
IN 
physical_JJ 
JJ 
physical 
actions_NNS 
NNS 
actions 
While_IN 
IN 
it_PRP 
PRP 
is_VBZ 
VBZ 
is 
not_RB 
RB 
not 
entirely_RB 
RB 
entirely 
erroneous_JJ 
JJ 
erroneous 
there_EX 
EX 
does_VBZ 
VBZ 
does 
n't 
n't 
RB 
RB 

但如果我只是运行 'there_EX does_VBZ n't_RB lie_VB' 的标记生成器 'n't_RB' 获取toknized在一起。当我运行程序时,我得到一个StringIndexOutOfBounds异常,这是可以理解的,因为'not'或'RB'中没有'_'。 任何人都可以看看它吗?谢谢。

+0

你想问什么? – Rahul 2015-04-04 09:59:26

+0

问题是为什么只有n't_RB'被分割为不是和RB,而其他每个单词都被下划线分割? – 2015-04-04 10:07:41

+0

if(token.length()> 1)的原因//跳过标记为PUNCTUATION MARKS行 – Rahul 2015-04-04 10:15:50

回答

1

DocumentPreprocessor文档据说

注意:如果使用一个空参数,则假定该文件将被标记化和DocumentPreprocessor不进行标记化。

因为您从文件加载文档已经在程序的第一步是标记化,你应该做的:

DocumentPreprocessor dp = new DocumentPreprocessor("./data/stanford-nlp/A.txt"); 
dp.setTokenizerFactory(null); 

然后它正确输出'的话,例如

... 
did_VBD 
VBD 
did 
n't_RB 
RB 
n't 
heed_VB 
VB 
heed 
Hermione_NNP 
NNP 
's_POS 
POS 
... 
+0

非常感谢。我认为我无法理解你们回答随机人员疑问的动机:) – 2015-04-04 13:38:23

+0

挑战,也许;) – 2015-04-04 14:04:13

+0

现在又出现了另一个问题。 DocumentProcessor不仅仅是分割句子。 – 2015-04-05 18:38:04

0

我会尝试String.split()而不是StringTokenizer

String str = "Man_NN has_VBZ always_RB had_VBN this_DT notion_NN that_IN brave_VBP deeds_NNS are_VBP manifest_JJ in_IN physical_JJ actions_NNS ._. While_IN it_PRP is_VBZ not_RB entirely_RB erroneous_JJ ,_, there_EX does_VBZ n't_RB lie_VB the_DT singular_JJ path_NN to_TO valor_NN ._. From_IN of_IN old_JJ ,_, it_PRP is_VBZ a_DT sign_NN of_IN strength_NN to_TO fight_VB back_RP a_DT wild_JJ animal_NN ._. It_PRP is_VBZ understandable_JJ if_IN fought_VBN in_IN defense_NN ;_: however_RB ,_, to_TO go_VB the_DT extra_JJ mile_NN and_CC instigate_VB an_DT animal_NN and_CC fight_VB it_PRP is_VBZ the_DT lowest_JJS degree_NN of_IN civilization_NN man_NN can_MD exhibit_VB ._. More_RBR so_RB ,_, in_IN this_DT age_NN of_IN reasoning_NN and_CC knowledge_NN ._. Tradition_NN may_MD call_VB it_PRP ,_, but_CC adhering_JJ blindly_RB to_TO it_PRP is_VBZ idiocy_NN ,_, be_VB it_PRP the_DT famed_JJ Jallikattu_NNP in_IN Tamil_NNP Nadu_NNP -LRB-_-LRB- The_DT Indian_JJ equivalent_NN to_TO the_DT Spanish_JJ Bullfighting_NN -RRB-_-RRB- or_CC the_DT cock-fights_NNS ._. Pelting_VBG stones_NNS at_IN a_DT dog_NN and_CC relishing_VBG it_PRP howl_NN in_IN pain_NN is_VBZ dreadful_JJ ._. If_IN one_CD only_RB gave_VBD as_RB much_JJ as_IN a_DT trickle_VB of_IN thought_NN and_CC conscience_NN the_DT issue_NN would_MD surface_VB as_IN deplorable_JJ in_IN every_DT aspect_NN ._. Animals_NNS play_VBP a_DT part_NN along_IN with_IN us_PRP in_IN our_PRP$ ecosystem_NN ._. And_CC ,_, some_DT animals_NNS are_VBP dearer_RBR :_: the_DT stray_JJ dogs_NNS that_WDT guard_VBP our_PRP$ street_NN ,_, the_DT intelligent_JJ crow_NN ,_, the_DT beast_NN of_IN burden_NN and_CC the_DT everyday_JJ animals_NNS of_IN pasture_NN ._. Literature_NN has_VBZ voiced_VBN in_IN its_PRP$ own_JJ way_NN :_: In_IN The_DT Lord_NN of_IN the_DT Rings_NNP the_DT fellowship_NN treated_VBN Bill_NNP Ferny_NNP 's_POS pony_NN with_IN utmost_JJ care_NN ;_: in_IN Harry_NNP Potter_NNP when_WRB they_PRP did_VBD n't_RB heed_VB Hermione_NNP 's_POS advice_NN on_IN the_DT treatment_NN of_IN house_NN elves_NNS they_PRP learned_VBD the_DT hard_JJ way_NN that_IN it_PRP caused_VBD their_PRP$ own_JJ undoing_NN ;_: and_CC Jack_NNP London_NNP ,_, writes_VBZ all_DT about_IN animals_NNS ._. Indeed_RB ,_, Kindness_NN to_TO animals_NNS is_VBZ a_DT virtue_NN ._. "; 

for(String word : str.split("\\s")){ 

    if(word.split("_").length==2){ 

     String filteredWord = word.split("_")[0]; 
     String wordType  = word.split("_")[1]; 

     System.out.println(word+" = "+filteredWord+ " - "+wordType); 

    } 

} 

与输出看起来像:

Man_NN = Man - NN 
has_VBZ = has - VBZ 
always_RB = always - RB 
had_VBN = had - VBN 
this_DT = this - DT 
notion_NN = notion - NN 
that_IN = that - IN 
brave_VBP = brave - VBP 
deeds_NNS = deeds - NNS 
are_VBP = are - VBP 
manifest_JJ = manifest - JJ 
in_IN = in - IN 
physical_JJ = physical - JJ 
actions_NNS = actions - NNS 
...... 

为什么只有n't_RB”越来越分裂不和RB

StringTokenizer stk = new StringTokenizer("n't_RB","_"); 

while(stk.hasMoreTokens()){ 
    System.out.println(stk.nextToken()); 
} 

这将拆分c orrectly,

n't 
RB 
+0

谢谢,但为什么'n't_RB'会被拆分为n't_RB,但是会被拆分为不是和RB。这让我很困惑。 – 2015-04-04 10:10:38

+0

String.split不能解决问题。从输出中可以推断出,每个单词都被拆分为'manifest_JJ',但为什么n't_RB被拆分为不是和RB? – 2015-04-04 12:13:37

1

方法lastIndexOf,当存在错误时,返回-1。 您收到的例外是由于您使用lastIndexOf方法无法在字符串中获取正确字符时使用的子字符串方法引起的。

我认为你可以做的是检查索引是否与-1不同,然后使用它。有了这个检查,你可以避免你收到的那个恼人的错误。不幸的是,没有整个输入文本真的很难理解哪些字符串不包含您指定的特定字符。

为了完整起见,我认为您还需要修复获得所有POS元素的方式。在我看来,String矩阵很容易出错(你需要弄清楚如何管理索引),而且对于这类任务来说效率相当低。

也许你可以使用一个Multimap为每个POS类型关联所有属于它的元素。我认为这样可以更好地管理一切。

+0

谢谢,我会研究你的建议。我也发布了全文。我能够理解异常错误。唯一不能理解的是为什么n't_RB在下划线处被拆分,而不像其他在单词间隙处拆分的元素。 – 2015-04-04 10:17:48