2015-03-13 50 views
1

段落我已这个文本,我已经从利用iText一个pdf提取并放置到字符串变量:正则表达式,从网页中提取以下

(1) A a, — al'-fah; of Hebrew origin; the first letter of the alphabet; 
figurative only (from its use as a numeral) the first: — Alpha. 
Often used (usually ajn an, before a vowel) also in composition 
(as a contraction from (427) (a]neu,)) in the sense of privation; 
so in many words beginning with this letter; occasionally in the 
sense of union (as a contraction of (260) (a[ma)). 
(2) ÆAarw>n, — ah-ar-ohn'; of Hebrew origin [Hebrew {175} 
('Aharown)]; Aaron, the brother of Moses: — Aaron. 
(3) ÆAbaddw>n, — ab-ad-dohn'; of Hebrew origin [Hebrew {11} 
('abaddown)]; a destroying angel: — Abaddon. 
(4) ajbarh>v, — ab-ar-ace'; from (1) (a) (as a negative particle) and (922) 
(ba>rov); weightless, i.e. (figurative) not burdensome: — from 
being burdensome. 
(5) ÆAbba~, — ab-bah'; of Chaldee origin [Hebrew {2} ('ab (Chaldee))]; 
father (as a vocative): — Abba. 
(6) &Abel, — ab'-el; of Hebrew origin [Hebrew {1893} (Hebel)]; Abel, 
the son of Adam: — Abel. 
(7) ÆAbia>, — ab-ee-ah'; of Hebrew origin [Hebrew {29} ('Abiyah)]; 
Abijah, the name of two Israelites: — Abia. 
(8) ÆAbia>qar, — ab-ee-ath'-ar; of Hebrew origin [Hebrew {54} 
('Ebyathar)]; Abiathar, an Israelite: — Abiathar. 
(9) ÆAbilhnh>, — ab-ee-lay-nay'; of foreign origin [compare Hebrew {58} 
('abel)]; Abilene, a region of Syria: — Abilene. 
(10) ÆAbiou>d, — ab-ee-ood'; of Hebrew origin [Hebrew {31} 
('Abiyhuwd)]; Abihud, an Israelite: — Abiud. 

字符串中的各段与([0-9])开始如(9)(5),我想用pagestring.split("regex")提取以此字符序列开头的每个段落。可以帮助吗?

回答

0

这样可以避免在文本中嵌入“(999)”。它基于这样一种假设,即行结束符指示段落开始的带括号的数字。还要注意,示例文本从第一个括号内没有任何文本产生空的“段落” - 因此是if语句。

String text = ...; 
    String[] paras = text.split("(?<=(^|\\n))\\(\\d+\\)"); 
    for(String para: paras){ 
     if(para.length() > 0){ 
      System.out.println("Para: " + para); 
     } 
    } 
+0

太棒了!有没有一个教程或指南,你可以推荐,因为正则表达式真的把我搞砸了? – Lema 2015-03-13 09:01:14

+0

我以前学过正则表达式,所以我不能真正推荐一个教程。但http://regexcrossword.com/提供了一种有趣的学习方式。 – laune 2015-03-13 09:21:59

0

您可以使用下面的正则表达式"[\n|.]\\([0-9]{1,2}\\)"与分割方法,它会提取所有的段落从你的文字(包括从0到99的数字):

String[] parts=st.split("[\n|.]\\([0-9]{1,2}\\)"); 

[\n|.]:考虑只有新段落忽略(n)在pragraphs文本。

\\([0-9]{1,2}\\):以匹配内()任何组的一个2个数字。

这里是the working DEMO,给出一个包含所有段落的数组。

有关使用正则表达式的更多信息,请参阅Java Regex Pattern