2014-09-13 323 views
0

我有一个csv文件,使用þ作为报价,段落符号作为逗号分隔值。read_csv使用不常见的分隔符

使用子类csv.Dialect不起作用。熊猫不会将这个数值解释为一个字符串。

任何想法?

# This works when the delimiters are more standard (; ") 
# But really trying to make it work with the ASCII chars commented out below 

import csv 

f = open('./data/Test_Quote_SemiColon.dat') 

class my_dialect(csv.Dialect): 
    lineterminator = '\n' 
    delimiter = ';' # ASCII: 020 
    quotechar = '"' # ASCII: 254 

reader = csv.reader(f, dialect=my_dialect, quoting=1) 

for line in reader: 
    print line 

这里是(报价和半结肠)数据:

“BEGID”; “endID所”, “名称”, “要”, “从”; “CC”, “BCC” “ABC_001”;“ABC_004”;“Smith,John”;“Doe,John”;“Roe,Jane”;“”;“”012_“ABC_005”;“ABC_007” John“;”“;”“”;“”012_“ABC_008”;“ABC_012”;“Doe,John”;“Doe,John”;“Smith,John”;“”;“”

+1

你能给你的数据的一个小例子(过去的CSV文件,或东西,看起来像它和复制问题的一部分),以及您用熊猫阅读的代码。 – joris 2014-09-13 17:58:39

+0

在csv上使用什么编码?你有没有试过改变编码?你知道这些符号的ASCII代码,所以你可以做sep ='something'和quote ='something'? – Inox 2014-09-13 20:53:18

回答

0

我发现文字和chr(254)工作解析此。这看起来正确吗?

>>> import StringIO 
>>> txt = '''þBEGIDþþENDIDþþNameþþToþþFromþþCCþþBCCþ þABC_001þþaBC_004þþSmith, JohnþþDoe, JohnþRoe, Janeþþþþþ þABC_005þþaBC_007þþSmith, JohnþþDoe, Johnþþþþþþ þABC_008þþaBC_012þþDoe, JohnþþDoe, JohnþSmith, Johnþþþþþ''' 
>>> reader = csv.reader(StringIO.StringIO(txt), delimiter=',', quotechar=chr(254)) 
>>> for line in reader: 
...  for entry in line: 
...   print unicode(entry, 'utf8') 
... 
þBEGIDþþENDIDþþNameþþToþþFromþþCCþþBCCþ þABC_001þþaBC_004þþSmith 
JohnþþDoe 
JohnþRoe 
Janeþþþþþ þABC_005þþaBC_007þþSmith 
JohnþþDoe 
Johnþþþþþþ þABC_008þþaBC_012þþDoe 
JohnþþDoe 
JohnþSmith 
Johnþþþþþ 

txt相呼应,如:

>>> txt 
'\xc3\xbeBEGID\xc3\xbe\xc3\xbeENDID\xc3\xbe\xc3\xbeName\xc3\xbe\xc3\xbeTo\xc3\xbe\xc3\xbeFrom\xc3\xbe\xc3\xbeCC\xc3\xbe\xc3\xbeBCC\xc3\xbe \xc3\xbeABC_001\xc3\xbe\xc3\xbeaBC_004\xc3\xbe\xc3\xbeSmith, John\xc3\xbe\xc3\xbeDoe, John\xc3\xbeRoe, Jane\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe \xc3\xbeABC_005\xc3\xbe\xc3\xbeaBC_007\xc3\xbe\xc3\xbeSmith, John\xc3\xbe\xc3\xbeDoe, John\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe \xc3\xbeABC_008\xc3\xbe\xc3\xbeaBC_012\xc3\xbe\xc3\xbeDoe, John\xc3\xbe\xc3\xbeDoe, John\xc3\xbeSmith, John\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe\xc3\xbe' 
+0

仅供参考,使用iPython Notebook 2.2,Python 2.7.6 我看到StringIO出错。什么是进口? – CAtoDC 2014-09-14 13:08:28

+0

关闭 - 但不完全。我认为它需要一个lineterminator值。 它应该看起来像这样(没有单引号): ['BEGID','ENDID','Name','To','From','CC','BCC'] ['ABC_001' ,'ABC_004','Smith,John','Doe,John','Roe,Jane','','']''ABC_005','ABC_007','Smith,John','Doe,John' ,'','',''] ['ABC_008','ABC_012','Doe,John','Doe,John','Smith,John','',''] – CAtoDC 2014-09-15 02:16:25