2016-11-14 90 views
1

我读被定义为两个线之间的文本/线下面的代码来学习doc2vec model.Each文件:如何解决gensim KeyError当我尝试拥有文档的向量?

  • clueweb09-en0001-XX-XXXXX
  • end_clueweb09-en0001-XX-XXXXX

这是我的代码:

path='/home/work/Step2/test-input/html' 


alldocs = [] # will hold all docs in original order 


for fname in os.listdir(path): 
    with open(path+'/'+fname) as alldata: 
     for line in alldata: 
      docId= line 
      print docId 
      context= alldata.next() 
      #print context 
      tokens = gensim.utils.to_unicode(context).split() 
      end=alldata.next() 
      alldocs.append(LabeledSentence(tokens[:],[docId])) 

model = Doc2Vec(alpha=0.025, min_alpha=0.025) # use fixed learning rate 
model.build_vocab(alldocs) 
for epoch in range(10): 
    model.train(alldocs) 
    model.alpha -= 0.002 # decrease the learning rate 
    model.min_alpha = model.alpha # fix the learning rate, no decay 

# store the model to mmap-able files 
model.save(path+'/my_html_model.doc2vec') 

但我得到的错误,当我写model.docvecs [ 'clueweb09-en0001-01-34238' ]但是当我写model.docvecs [0]我得到了结果。

这是我得到的错误:

Traceback (most recent call last): 
    File "getLearingDoc.py", line 40, in <module> 
    print model.docvecs['clueweb09-en0001-01-34238'] 
    File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 341, in __getitem__ 
    return self.doctag_syn0[self._int_index(index)] 
    File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/doc2vec.py", line 315, in _int_index 
    return self.max_rawint + 1 + self.doctags[index].offset 
KeyError: 'clueweb09-en0001-01-34238' 

我没有经验,Python和gensim请告诉我怎样才能解决这个问题。

回答

0

确定的标记正确'clueweb09-en0001-01-34238' - 没有杂散的换行符/ etc - 在培训期间提出吗?

您可以在model.docvecs.doctags字典的键或列表model.docvecs.offset2doctag中看到模型已知的所有字符串doctags。