2013-02-20 61 views
2

我想索引到一个excel文件,我用whoosh包,但是,我发现一个错误,列表索引超出范围。请问,任何人都可以帮我吗? 我的代码是:python中的索引

from whoosh import fields, index 
import os.path 
import csv 
import codecs 

# This list associates a name with each position in a row 
columns = ["juza","chapter","verse","analysis"] 

schema = fields.Schema(juza=fields.NUMERIC, 
         chapter=fields.NUMERIC, 
         verse=fields.NUMERIC, 
         analysis=fields.KEYWORD) 


# Create the Whoosh index 
indexname = "index" 
if not os.path.exists(indexname): 
    os.mkdir(indexname) 
ix = index.create_in(indexname, schema) 

# Open a writer for the index 
with ix.writer() as writer: 
    # Open the CSV file 
    with codecs.open("yom.csv", "rb","utf8") as csvfile: 
    # Create a csv reader object for the file 
    csvreader = csv.reader(csvfile) 

    # Read each row in the file 
    for row in csvreader: 

     # Create a dictionary to hold the document values for this row 
     doc = {} 

     # Read the values for the row enumerated like 
     # (0, "juza"), (1, "chapter"), etc. 
     for colnum, value in enumerate(row): 

     # Get the field name from the "columns" list 
     fieldname = columns[colnum] 

     # Strip any whitespace and convert to unicode 
     # NOTE: you need to pass the right encoding here! 
     value = unicode(value.strip(), "utf-8") 

     # Put the value in the dictionary 
     doc[fieldname] = value 

     # Pass the dictionary to the add_document method 
     writer.add_document(**doc) 
    writer.commit() 
` 

和我得到这个错误,我不知道为什么? 错误:

Traceback (most recent call last): 
    File "C:\Python27\yarab.py", line 39, in <module> 
    fieldname = columns[colnum] 
IndexError: list index out of range 

和我的csv文件:

1 3 1 Al+ POS:ADJ LEM:r~aHoma`n ROOT:rHm MS GEN 
1 3 2 Al+ POS:ADJ LEM:r~aHiym ROOT:rHm MS GEN 
1 4 1 POS:N ACT PCPL LEM:ma`lik ROOT:mlk M GEN 
1 4 2 POS:N LEM:yawom ROOT:ywm M GEN 
1 4 3 Al+ POS:N LEM:diyn ROOT:dyn M GEN 
1 5 1 POS:PRON LEM:&lt;iy~aA 2MS 

回答

0

csv.reader使用默认的分隔符逗号:,

你必须明确地定义你的分隔符:

csvreader = csv.reader(csvfile, delimiter=...) 

然而,您的CSV文件不是同质的。这将是更好的,而不csv来阅读:

columns = ["juza","chapter","verse","analysis"] 
with codecs.open("yom.csv", "rb","utf8") as f: 
    for line in f: 
     a, b, c, rest = line.split(' ', 3) 
     doc = {k:v.strip() for k,v in zip(columns, rest.split(':'))} 
     # a,b,c are the first three integers 
     # doc is a dictionary 
+0

你的意思是我应该删除“csvreader”,并与您的推荐代码代替它呢?但如果我这样做,现在的问题是,我如何将字段名称在以下行: “fieldname = columns [colnum]” – user2091683 2013-02-20 15:26:01

+0

@ user2091683 - 你不需要'colnum'了。 'zip(columns,rest.split(':'))''将它们拉到一起,'doc'-dictionary包含整个条目。 – eumiro 2013-02-20 15:29:01

+0

这是我的新代码,请打开此链接: (http://pastebin.com/qWPZsiyd) 出现此错误: 回溯(最近通话最后一个): 文件“C:\ Python27 \ yarab。 py“,第26行,在 juza,chapter,verse,analysis = line.split('',3) ValueError:需要多个值才能解包 – user2091683 2013-02-20 15:42:51