我们可以使用下列内容:
request = Request(url="http://example.com")
request.meta['proxy'] = "host:port"
yield request
一个简单的实现是象下面这样:
import scrapy
class MySpider(scrapy.Spider):
name = "examplespider"
allowed_domains = ["somewebsite.com"]
start_urls = ['http://somewebsite.com/']
def parse(self, response):
# Here example.com is used. We usually get this URL by parsing desired webpage
request = scrapy.Request(url='example.com', callback=self.parse_url)
request.meta['proxy'] = "host:port"
yield request
def parse_url(self, response):
# Do rest of the parsing work
pass
如果要使用代理初始:
添加以下为蜘蛛类字段
class MySpider(scrapy.Spider):
name = "examplespider"
allowed_domains = ["somewebsite.com"]
start_urls = ['http://somewebsite.com/']
custom_settings = {
'HTTPPROXY_ENABLED': True
}
然后用如下方法start_requests()
:
def start_requests(self):
urls = ['example.com']
for url in urls:
proxy = 'some proxy'
yield scrapy.Request(url=url, callback=self.parse, meta={'proxy': proxy})
def parse(self, response):
item = StatusCehckerItem()
item['url'] = response.url
return item
我无法设置环境变量,它会影响到其他的服务和工作,我能不能把它放在scrapy脚本? – ZivHus
查看上面链接中的第二个答案 – Nabin
我可以在哪里设置request.meta? – ZivHus