2017-06-05 226 views
0

合并输出我有一个Scrapy输出是这样的:Scrapy在现场

[{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}, 
       {'name': 'Twiin Method Rib Mesh Flare Sleeve Top', 
       'price': {'currency': 'GBP', 
          'outlet': '22.0', 
          'retail': '32.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}, 
       {'name': 'Twiin Method Rib Mesh Flare Sleeve Top', 
       'price': {'currency': 'GBP', 
          'outlet': '22.0', 
          'retail': '32.0'}}, 
       {'name': 'Twiin End Game Varsity Denim Trucker Jacket', 
       'price': {'currency': 'GBP', 
          'outlet': '45.0', 
          'retail': '80.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}, 
       {'name': 'Estella Bartlet Silver Plated Heart Bracelet Duo Set', 
       'price': {'currency': 'GBP', 
          'outlet': '15.0', 
          'retail': '31.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}, 
       {'name': 'Estella Bartlet Silver Plated Heart Bracelet Duo Set', 
       'price': {'currency': 'GBP', 
          'outlet': '15.0', 
          'retail': '31.0'}}, 
       {'name': 'Ashiana Embroidered Large Toiletry Bag With Wateproof ' 
         'Lining', 
       'price': {'currency': 'GBP', 
          'outlet': '25.0', 
          'retail': '35.0'}}]}] 

这是因为在每一个产品的加工我使用Loader.load_item()。

如何建立一个管道或输出处理器,使其只返回最后处理项目,像下面?

[{'gender': 'women', 
    'name': 'NEW IN: CLOTHING', 
    'products': [{'name': 'Free People Cocoon Multi Way Neck Top', 
       'price': {'currency': 'GBP', 
          'outlet': '40.0', 
          'retail': '58.0'}}, 
       {'name': 'N12H Joshua Tree Dress', 
       'price': {'currency': 'GBP', 
          'outlet': '140.0', 
          'retail': '249.0'}}, 
       {'name': 'Twiin Method Rib Mesh Flare Sleeve Top', 
       'price': {'currency': 'GBP', 
          'outlet': '22.0', 
          'retail': '32.0'}}, 
       {'name': 'Twiin End Game Varsity Denim Trucker Jacket', 
       'price': {'currency': 'GBP', 
          'outlet': '45.0', 
          'retail': '80.0'}}]}, 
{'gender': 'women', 
    'name': 'NEW IN: SHOES & ACCESSORIES ', 
    'products': [{'name': 'Melissa Ultragirl Triple Bow Ballerina', 
       'price': {'currency': 'GBP', 
          'outlet': '48.0', 
          'retail': '68.0'}}, 
       {'name': 'Zaxy Tbar Flip Flops', 
       'price': {'currency': 'GBP', 
          'outlet': '20.0', 
          'retail': '26.0'}}, 
       {'name': 'Estella Bartlet Silver Plated Heart Bracelet Duo Set', 
       'price': {'currency': 'GBP', 
          'outlet': '15.0', 
          'retail': '31.0'}}, 
       {'name': 'Ashiana Embroidered Large Toiletry Bag With Wateproof ' 
         'Lining', 
       'price': {'currency': 'GBP', 
          'outlet': '25.0', 
          'retail': '35.0'}}]}] 

处理的最后一行包含该会话中的所有产品。我在蜘蛛关闭时尝试处理,但没有成功。

我即将结束这个项目,研究了很多,并试图很多事情,很多问题,但没有涉及到物品堆放在现场。

我的项目代码:

from scrapy.item import Item, Field 
from scrapy.loader.processors import TakeFirst, Join, Compose, MapCompose 


class Session(Item): 
    name = Field() 
    gender = Field() 
    products = Field(
     # no idea what to put... tryed Join, Compose and MapCompose 
    ) 


class Product(Item): 
    name = Field() 
    price = Field() 


class Price(Item): 
    outlet = Field() 
    retail = Field() 
    currency = Field() 

我的蜘蛛代码:

def parse(self, response): 
    sessions = response.css("article.feature:nth-of-type(-n+2)") 
    for session in sessions: 
     sessionlink = session.css("a.feature__link::attr(href)").extract_first() 

     lsession = ItemLoader(item=Session(), response=response) 
     lsession.add_value("name", session.css("div.feature__title h3::text").extract_first()) 
     lsession.add_value("gender", re.split("[/]+", response.request.url)[2]) 

     requestsession = response.follow(sessionlink, callback=self.parse_session) 
     requestsession.meta["lsession"] = lsession 
     requestsession.meta["pages"] = 1 
     yield requestsession 

def parse_session(self, response): 
    lsession = response.meta["lsession"] 
    pages = response.meta["pages"] 

    products = response.css("li.product-container:nth-of-type(-n+2)") 

    for product in products: 
     productlink = product.css("a.product-link::attr(href)").extract_first() 
     requestproduct = response.follow(productlink, callback=self.parse_product) 
     requestproduct.meta["lsession"] = lsession 
     requestproduct.meta["productlink"] = productlink 
     yield requestproduct 

    nextpage = response.css("ul.pager li.next a::attr(href)").extract_first() 
    if pages < 2: 
     pages += 1 
     requestnewpage = response.follow(nextpage, callback=self.parse_session) 
     requestnewpage.meta["lsession"] = lsession 
     requestnewpage.meta["pages"] = pages 
     yield requestnewpage 

def parse_product(self, response): 
    lsession = response.meta["lsession"] 
    productlink = response.meta["productlink"] 

    lproduct = ItemLoader(item=Product(), response=response) 

    name = response.css("div.product-hero>h1::text").extract_first() 

    lproduct.replace_value("name", str(name)) 

    pricelink = "AN AJAX LINK TO GET THE PRICE" 

    requestprice = response.follow(pricelink, callback=self.parse_price) 
    requestprice.meta["lsession"] = lsession 
    requestprice.meta["lproduct"] = lproduct 

    yield requestprice 

def parse_price(self, response): 
    lsession = response.meta["lsession"] 
    lproduct = response.meta["lproduct"] 

    lprice = ItemLoader(item=Price(), response=response) 

    pricejson = json.loads(response.body) 
    outletprice = pricejson[0]["productPrice"]["current"]["value"] 
    retailprice = pricejson[0]["productPrice"]["rrp"]["value"] 
    currency = pricejson[0]["productPrice"]["currency"] 

    lprice.replace_value("outlet", str(outletprice)) 
    lprice.replace_value("retail", str(retailprice)) 
    lprice.replace_value("currency", str(currency)) 
    lproduct.replace_value("price", lprice.load_item()) 
    lsession.add_value("products", dict(lproduct.load_item())) 

    yield lsession.load_item() 

回答

0

亚塔!记得我的上学时间,我记录了关闭。 我不知道python有这种功能行为。我是这个语言的初学者。

所以,因为我得到了很多的帮助,在这里,我要在这里发布我的解决方案,因此,如果需要其他人可以得到帮助。

我了这样的一个闭合计数器(只是一个基本的一个):

def counter(): 
    value = 0 
    def count(op): 
     nonlocal value 
     if op == "add": 
      value += 1 
     elif op == "sub": 
      value -= 1 
     elif op == "get": 
      return value 

    return count 

然后,我开始为每个部分的计数器:

requestsession = response.follow(sessionlink, callback=self.parse_session) 
requestsession.meta["lsession"] = lsession 
requestsession.meta["pcounter"] = counter() 
requestsession.meta["pages"] = 1 

当处理每个产品,我向上计数,并继续通过计数器,直到价格处理:

for product in products: 
    pcounter("add") 
    productlink = product.css("a.product-link::attr(href)").extract_first() 
    requestproduct = response.follow(productlink, callback=self.parse_product) 
    requestproduct.meta["lsession"] = lsession 
    requestproduct.meta["pcounter"] = pcounter 
    requestproduct.meta["productlink"] = productlink 
    yield requestproduct 

价格分析后,我倒计时,当我加载“lsession”项目装载机,我检查所有的产品进​​行了处理:

pcounter("sub") 

if pcounter("get") == 0: 
    yield lsession.load_item() 

希望这将是有用的人。