2012-03-04 125 views
1

原谅我,我是一个总编程noob。scrapy蜘蛛中的分裂变量

我想从下面的代码中提取一个URL的记录ID,我遇到了麻烦。如果我在shell中运行它,它似乎好工作(没有错误),但是当我通过scrapy运行它的框架产生错误

例子:
如果网址的http://域。 COM /路径/到/ RECORD_ID = 1599
然后record_link = /路径/到/ RECORD_ID = 1599
因此RECORD_ID应该=

for site in sites: 

     record_link = site.select('div[@class="description"]/h4/a/@href').extract() 
     record_id = record_link.strip().split('=')[1] 

     item['link'] = record_link 
     item['id'] = record_id 
     items.append(item) 

任何帮助是极大的赞赏

编辑::

Scrapy这样的错误是这样的:因为你长时间拍摄

[email protected]:/home/user/spiderdir/spiderdir/spiders# sudo scrapy crawl spider 
    2012-02-23 09:47:16+1100 [scrapy] INFO: Scrapy 0.13.0.2839 started (bot: spider) 
    2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState 
    2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 
    2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
    2012-02-23 09:47:16+1100 [scrapy] DEBUG: Enabled item pipelines: 
    2012-02-23 09:47:16+1100 [spider] INFO: Spider opened 
    2012-02-23 09:47:16+1100 [spider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
    2012-02-23 09:47:16+1100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6031 
    2012-02-23 09:47:16+1100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6088 
    2012-02-23 09:47:19+1100 [spider] DEBUG: Crawled (200) <GET http://www.domain.com/path/to/> (referer: None) 
    2012-02-23 09:47:21+1100 [spider] DEBUG: Crawled (200) <GET http://www.domain.com/path/to/record_id=2> (referer: http://www.domain.com/path/to/) 
    2012-02-23 09:47:21+1100 [spider] ERROR: Spider error processing <GET http://www.domain.com/path/to/record_id=2> 
    Traceback (most recent call last): 
     File "/usr/lib/python2.6/dist-packages/twisted/internet/base.py", line 778, in runUntilCurrent 
     call.func(*call.args, **call.kw) 
     File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 577, in _tick 
     taskObj._oneWorkUnit() 
     File "/usr/lib/python2.6/dist-packages/twisted/internet/task.py", line 458, in _oneWorkUnit 
     result = self._iterator.next() 
     File "/usr/lib/pymodules/python2.6/scrapy/utils/defer.py", line 57, in <genexpr> 
     work = (callable(elem, *args, **named) for elem in iterable) 
    --- <exception caught here> --- 
     File "/usr/lib/pymodules/python2.6/scrapy/utils/defer.py", line 96, in iter_errback 
     yield it.next() 
     File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/offsite.py", line 24, in process_spider_output 
     for x in result: 
     File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/referer.py", line 14, in <genexpr> 
     return (_set_referer(r) for r in result or()) 
     File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/urllength.py", line 32, in <genexpr> 
     return (r for r in result or() if _filter(r)) 
     File "/usr/lib/pymodules/python2.6/scrapy/contrib/spidermiddleware/depth.py", line 56, in <genexpr> 
     return (r for r in result or() if _filter(r)) 
     File "/usr/lib/pymodules/python2.6/scrapy/contrib/spiders/crawl.py", line 66, in _parse_response 
     cb_res = callback(response, **cb_kwargs) or() 
     File "/home/nick/googledir/googledir/spiders/google_directory.py", line 36, in parse_main 
     record_id = record_link.split("=")[1] 
    exceptions.AttributeError: 'list' object has no attribute 'split' 

`

+0

你应该张贴你的错误太 – goh 2012-03-04 11:12:21

回答

0

我想我以后是这样的:

for site in sites: 

     record_link = site.select('div[@class="description"]/h4/a/@href').extract() 
     record_id = [i.split('=')[1] for i in record_link] 

    item['link'] = record_link 
    item['id'] = record_id 
    items.append(item) 
3

类没有发布你的错误,但我猜你将不得不改变这一行:

record_id = record_link.strip().split('=')[1]

record_id = record_link[0].strip().split('=')[1]

由于HtmlXPathSelector总是返回选定的项目清单。

+0

会不会是里面record_link字符串需要引号被封装?即'/ path/to'而不是just/path/to? – skittles 2012-03-10 12:17:21

+0

,如果上述评论是正确的,我如何去引用报废数据? – skittles 2012-03-10 13:00:20

+0

该错误显示你给一个空字符串作为蜘蛛的URL。加入引号是什么意思? – 2012-03-12 08:34:17