2014-09-02 74 views
1

我有一个解析以下日期的工作正则表达式:Scrapy日期捕获正则表达式

(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000)))) 

它解析以下字符串:

The owners of this address received a permit on Wednesday, July 31, 2014 

项目的输出scrapy是:

[u'June', u'31', u'2014', u'', u'', u'', u'', u'', u'', u''] 

我想scrapy项目是:

[u'June 31, 2014'] 

这里是我的scrapy代码:

date_scrape = response.css('#ctl00_MasterDiv > div.Divwidth100 td.content_panel_middle > div > p:contains("The owners of this address") > b ::text') 

permit_date = date_scrape.re(r'(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))') 

就如何解决这一问题有什么想法?

+0

注 - 我已经尝试添加^和$来表达我似乎无法弄清楚。我已经在regex101中测试了^和$的几种可能的用法,它们都失败了。 – dfriestedt 2014-09-02 13:29:00

回答

1
import re 
s='The owners of this address received a permit on Wednesday, July 31, 2014' 

words = (re.findall(r'(\w+ \d+, \d+)',s)) 
print words 

结果:

['July 31, 2014'] 
+0

我绝对浪费了很多时间,试图弄清楚这一点。我在其他帖子中看到了这个解决方案,只是没有尝试。思想太“简单”了。谢谢! – dfriestedt 2014-09-02 13:45:27

+0

我很高兴!好的com – Kasramvd 2014-09-02 13:46:12

1

如果你不想潜入正则表达式的美妙世界,这里有一个替代解决方案。

使用dateutil.parser.parse()fuzzy=True。从scrapy shell演示:

$ scrapy shell index.html 
>>> text = response.xpath('//body/b/text()').extract()[0] 
>>> text 
u'The owners of this address received a permit on Wednesday, July 31, 2014' 

>>> from dateutil.parser import parse 
>>> parse(text, fuzzy=True) 
datetime.datetime(2014, 7, 31, 0, 0) 

其中index.html包含HTML测试数据:

<body> 
    <b>The owners of this address received a permit on Wednesday, July 31, 2014</b> 
</body>