Pyparsing CSV串乱报价

我有类似下面的字符串：Pyparsing CSV串乱报价

<118>date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from="[email protected]",mailer="mta",client_name="example.org,[194.177.17.24]",resolved=OK,to="[email protected]",direction="in",message_length=6832079,virus="",disposition="Accept",classifier="Not,Spam",subject="=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="

我尝试使用CSV模块，它不适合，因为我还没有找到一个办法忽略什么的报价。 Pyparsing看起来像一个更好的答案，但我还没有找到一种方式来声明所有的语法。

目前，我使用我的旧Perl脚本来解析它，但我想用Python编写。如果你需要我的Perl片段，我会很乐意提供它。

任何帮助表示赞赏。

来源

2010-05-09 gtfx

我不知道你真正寻找，但

import re 
data = "date=2010-05-09,time=16:41:27,device_id=FE-2KA3F09000049,log_id=0400147717,log_part=00,type=statistics,subtype=n/a,pri=information,session_id=o49CedRc021772,from=\"[email protected]\",mailer=\"mta\",client_name=\"example.org,[194.177.17.24]\",resolved=OK,to=\"[email protected]\",direction=\"in\",message_length=6832079,virus=\"\",disposition=\"Accept\",classifier=\"Not,Spam\",subject=\"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=\"" 
pattern = r"""(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)""" 
print(re.findall(pattern, data))

给你

[('date', '2010-05-09'), ('time', '16:41:27'), ('device_id', 'FE-2KA3F09000049'), 
('log_id', '0400147717'), ('log_part', '00'), ('type', 'statistics'), 
('subtype', 'n/a'), ('pri', 'information'), ('session_id', 'o49CedRc021772'), 
('from', '"[email protected]"'), ('mailer', '"mta"'), 
('client_name', '"example.org,[194.177.17.24]"'), ('resolved', 'OK'), 
('to', '"[email protected]"'), ('direction', '"in"'), 
('message_length', '6832079'), ('virus', '""'), ('disposition', '"Accept"'), 
('classifier', '"Not,Spam"'), 
('subject', '"=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?="') 
]

你可能想事后清理引用的字符串（使用mystring.strip("'\"")）。

编辑：此正则表达式现在也可以正确处理带引号的字符串（a="She said \"Hi!\""）中的转义引号。

说明正则表达式的：

(\w+)=((?:"(?:\\.|[^\\"])*"|'(?:\\.|[^\\'])*'|[^\\,"'])+)

(\w+)：匹配所述标识符并捕获它变成反向引用无。 1

=：匹配一个=

(：捕获以下为反向引用没有。 2：

(?:：下列之一：

"(?:\\.|[^\\"])*"：一个双引号，随后任一零个或多个以下内容：一个转义字符或非报价/非反斜线字符，接着另一双引号

|：或

'(?:\\.|[^\\'])*'：见上面，只是单引号。

|：或

[^\\,"']：既不是反斜杠，逗号，也不报价一个字符。

)+：重复至少一次，尽可能多次。

)：捕获组号码结束。 2.

来源

2010-05-09 13:55:12

谢谢你这个做了我所需要的。 – gtfx 2010-05-09 14:42:04

这是你如何做正则表达式！ :) – jathanism 2010-05-14 06:58:07

利用现有解析器可能比使用临时正则表达式更好。

parse_http_list(s) 
    Parse lists as described by RFC 2068 Section 2. 

    In particular, parse comma-separated lists where the elements of 
    the list may include quoted-strings. A quoted-string could 
    contain a comma. A non-quoted string could have quotes in the 
    middle. Neither commas nor quotes count if they are escaped. 
    Only double-quotes count, not single-quotes. 

parse_keqv_list(l) 
    Parse list of key=value strings where keys are not duplicated.

例子：

>>> pprint.pprint(urllib2.parse_keqv_list(urllib2.parse_http_list(s))) 
{'<118>date': '2010-05-09', 
'classifier': 'Not,Spam', 
'client_name': 'example.org,[194.177.17.24]', 
'device_id': 'FE-2KA3F09000049', 
'direction': 'in', 
'disposition': 'Accept', 
'from': '[email protected]', 
'log_id': '0400147717', 
'log_part': '00', 
'mailer': 'mta', 
'message_length': '6832079', 
'pri': 'information', 
'resolved': 'OK', 
'session_id': 'o49CedRc021772', 
'subject':'=?windows-1255?B?Rlc6IEZ3OiDg5fDp5fog+fno5fog7Pf46eHp7S3u4+Tp7SE=?=', 
'subtype': 'n/a', 
'time': '16:41:27', 
'to': '[email protected]', 
'type': 'statistics', 
'virus': ''}

来源

2010-05-14 06:43:08 jfs

积分去@Piotr Czapla http://stackoverflow.com/questions/1349367/parse-an-http-request-authorization-header-with-python/1349626#1349626 – jfs 2010-05-14 06:53:39

好吧哇，这其实是一个宝石的解决方案。谢谢。 – jathanism 2010-05-14 07:00:06

非常好，尤其是因为它已经删除了不必要的引号。那么，这是你的Python的美丽 - 包括电池。（虽然我的正则表达式也不是:) :) – 2010-05-14 07:44:48

Pyparsing CSV串乱报价

回答

相关问题