2016-04-29 111 views
-1

我需要从包含特定单词的文本块中提取句子。这一个我有:正则表达式来选择特定长度的句子

[A-Z][^\\.;\\?\\!]*(word)[^\\.;\\?\\!]* 

但我也需要这个句子是一个特定的长度,比如说30到250个符号。我知道这似乎很容易,但我不知道该怎么做。

所以输入可以是:

Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple! A full Reference & Help is available in the Library, or watch the video Tutorial hosted by Media Temple which are so amazingly awesome that just looking at the name I get a boner instantly, and I am really serious right now, it's that exciting if you didn't get it. 

以上案文包含2句:一个是76码,另一个是266两者都包含字托管,这将是我们的选择的话。所以正则表达式应该匹配第一句。输出应该是:

Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple 

在此先感谢。

+1

什么是输入:

与您的数据只是测试? – sweaver2112

+0

这是一个相当困难的问题,特别是因为我们没有任何上下文。请提供您的文本块的样子。困难的一个例子:美国的缩写,例如美国。 – lmo

+0

这也很困难,因为R中的正则表达式的能力非常有限。你可能会更好地检查它找到的匹配的长度。 – 4castle

回答

1

我假设你使用英文文本进行解析。

您可以使用NLP库文本分割成句子,然后只需要那些含有word和特定的长度。我使用了海明威传记摘录自维基百科,并使用“1970”一词来提取,然后再应用第二个grep以限制其长度。

> require(tm) 
> require(openNLP) 
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.") 
> sentence.boundaries <- annotate(text, sentence_token_annotator) 
> sentences <- text[sentence.boundaries] 
> sentences 
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."                                 
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."                                          
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]" 
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."                      
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."                             
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."               
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                       
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE) 
> with_word 
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]" 
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                       
> with_word[grep("^.{30,100}$", with_word)] 
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75." 

在你的情况下,用自己的文字和{30,250}限制量词得到公正那些你需要的句子。

注意,有可能到grep你需要1个操作的句子,但你会需要一个超前更复杂的PCRE正则表达式:

> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE) 
> my_sent 
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75." 

"(?s)(?=.{30,100}$).*1940.*$"正则表达式将需要串有30〜 100(设定自己的极限)字符从开始到结束,字符串应该包含1940词(注意,如果你的字中包含的特殊的正则表达式元字符,它们必须用\\转义)。

> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE) 
> with_word 
[1] "proudly hosted by Media Temple!" 
+1

Dear Wiktor。这真是太棒了!对这篇文章的几点评论让我相信,R可能不是使用自然语言文本的最佳工具。但是Ka-Boom!你给我看了一个合适的插件。而我的问题的整个解决方案。这表明我甚至没有接近结果。非常感谢你!我恳请你原谅这篇文章中的模糊解释。但是你确切地理解它。顺便说一下,我花了5个小时的解决方法来安装openNLP,我不得不降级Java并做很多其他事情,这就是为什么我迟到了这个答复。谢谢你,过上美好的生活:P – Denis

0

您可以使用positive lookahead

(?=[\p{Any}]{30,250}.*) 
+0

我请你原谅,但你可能注意到我并不擅长正则表达式。我不太明白在这个特定的例子中我能如何使用积极的向前看。我们的情况是什么? – Denis

+0

积极的前瞻将确保正则表达式的下一个内容必须与正向预见组相匹配。让我看看你最新的问题。 –

+0

我们如何知道第一句话的结尾?他们是分开的还是在同一条线上? –