我试图做一个简单的位置索引,但有一些问题得到正确的输出。简单的内存位置倒排索引python
给出一个字符串(句子)的列表我想使用sting列表中的字符串位置作为文档id,然后迭代句子中的单词并使用句子中的单词index作为它的位置。然后使用文档ID的元组更新单词词典,并在文档中定位它。
代码:
主FUNC -
def doc_pos_index(alist):
inv_index= {}
words = [word for line in alist for word in line.split(" ")]
for word in words:
if word not in inv_index:
inv_index[word]=[]
for item, index in enumerate(alist): # find item and it's index in list
for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
if item2 in inv_index:
inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position
return inv_index
示例清单:
doc_list= [
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed'
]
期望的输出:
{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)],
'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)],
'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)],
ect...}
电流输出:
{'Delivered': [],
'necessary': [],
'dejection': [],
'do': [],
'objection': [],
'prevailed': [],
'mr': [],
'hello': []}
我知道收集libarary和NLTK,但我主要是为了学习/实践的原因这样做。
你已经得到了'枚举'退步的顺序。你想'索引,枚举项目(alist):' –