我假设你使用英文文本进行解析。
您可以使用NLP库文本分割成句子,然后只需要那些含有word
和特定的长度。我使用了海明威传记摘录自维基百科,并使用“1970”一词来提取,然后再应用第二个grep
以限制其长度。
> require(tm)
> require(openNLP)
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.")
> sentence.boundaries <- annotate(text, sentence_token_annotator)
> sentences <- text[sentence.boundaries]
> sentences
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE)
> with_word
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
> with_word[grep("^.{30,100}$", with_word)]
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
在你的情况下,用自己的文字和{30,250}
限制量词得到公正那些你需要的句子。
注意,有可能到grep你需要1个操作的句子,但你会需要一个超前更复杂的PCRE正则表达式:
> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE)
> my_sent
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
的"(?s)(?=.{30,100}$).*1940.*$"
正则表达式将需要串有30〜 100(设定自己的极限)字符从开始到结束,字符串应该包含1940
词(注意,如果你的字中包含的特殊的正则表达式元字符,它们必须用\\
转义)。
> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE)
> with_word
[1] "proudly hosted by Media Temple!"
什么是输入:
与您的数据只是测试? – sweaver2112
这是一个相当困难的问题,特别是因为我们没有任何上下文。请提供您的文本块的样子。困难的一个例子:美国的缩写,例如美国。 – lmo
这也很困难,因为R中的正则表达式的能力非常有限。你可能会更好地检查它找到的匹配的长度。 – 4castle