虽然通常很难准确地判断句子的结束位置,但在这种情况下,每个句子都有标记句号的句号,所以我们可以使用它将句段分解为句子。你已经拥有的代码将其分割成话语权,但在这里它是:
paragraph = "Lorem Ipsum ... "
sentences = []
while paragraph.find('.') != -1:
index = paragraph.find('.')
sentences.append(paragraph[:index+1])
paragraph = paragraph[index+1:]
print sentences
输出:
['Lorem Ipsum is simply dummy text of the printing and typesetting industry.',
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.",
'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.',
'It was popularised in the 1960s with the release of Letraset sheets containing.',
'Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.']
然后我们将它们全部转换为词的数组:
word_matrix = []
for sentence in sentences:
word_matrix.append(sentence.strip().split(' '))
print word_matrix
哪些输出:
[['Lorem', 'Ipsum', 'is', 'simply', 'dummy', 'text', 'of', 'the', 'printing', 'and', 'typesetting', 'industry.'],
['Lorem', 'Ipsum', 'has', 'been', 'the', "industry's", 'standard', 'dummy', 'text', 'ever', 'since', 'the', '1500s,', 'when', 'an', 'unknown', 'printer', 'took', 'a', 'galley', 'of', 'type', 'and', 'scrambled', 'it', 'to', 'make', 'a', 'type', 'specimen', 'book.'],
['It', 'has', 'survived', 'not', 'only', 'five', 'centuries,', 'but', 'also', 'the', 'leap', 'into', 'electronic', 'typesetting,', 'remaining', 'essentially', 'unchanged.'],
['It', 'was', 'popularised', 'in', 'the', '1960s', 'with', 'the', 'release', 'of', 'Letraset', 'sheets', 'containing.'],
['Lorem', 'Ipsum', 'passages,', 'and', 'more', 'recently', 'with', 'desktop', 'publishing', 'software', 'like', 'Aldus', 'PageMaker', 'including', 'versions', 'of', 'Lorem', 'Ipsum.']]
运行结果中删除您需要将文本分割成句子,然后进言。你如何决定一个句子的结束可能很困难。你看过Python的NLTK包吗? – James
[i.split('')for string.split('。')]将给出包含单词列表的句子列表。希望这可以帮助! –