2013-03-27 33 views
0

简单的正则表达式我有一个字符串,它看起来像为以下字符串

rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036 

现在,我想要做的是

extract timestamp: 134049600 
     event: EP002960010145 

现在isseue是有tmsid 我不经过%3D甚至知道它是什么..但无论如何,有时它的%3D%6D,我认为它甚至可以%16D?我不能确定

是否有一个强大的方式来处理上述字符串的这两个领域?

感谢

回答

3

您正在看的URL引用的数据:

>>> from urllib2 import unquote 
>>> unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036') 
'rand_id:?tmsid=1340496000_EP002960010145_11_0_10050_1_2_10036' 

您可以在第一=分裂或许,再拆上_

>>> unquoted = unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036') 
>>> unquoted.split('=', 1)[1].split('_') 
['1340496000', 'EP002960010145', '11', '0', '10050', '1', '2', '10036'] 
>>> timestamp, event = unquoted.split('=', 1)[1].split('_')[:2] 
>>> timestamp, event 
('1340496000', 'EP002960010145') 

相反,如果数据有多个字段,你也可以在那里找到&,你可以更好地解析问号后的所有内容作为URL查询条ng代替使用urlparse.parse_qs()

>>> from urlparse import parse_qs 
>>> parse_qs(unquoted.split('?', 1)[1]) 
{'tmsid': ['1340496000_EP002960010145_11_0_10050_1_2_10036']} 
>>> parsed = parse_qs(unquoted.split('?', 1)[1]) 
>>> timestamp, event = parsed['tmsid'][0].split('_', 2)[:2] 
>>> timestamp, event 
('1340496000', 'EP002960010145')