蟒蛇：排除字符串正则表达式

我试图建立一个网站刮板获得价格折扣http://fetch.co.uk/dogs/dog-food?per-page=20 蟒蛇：排除字符串正则表达式

我这里有下面的代码：

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen(url above) 
bsObj = BeautifulSoup(html,"html.parser") 

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")}) 
for wrap in wrapList: 
    print(wrap.find("",{"itemprop": re.compile("shelf-product__price.*(?!cut).*")}).get_text()) 
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

在每一个包裹，有时有2个不同的价格，我试图排除降价并获得低于该价格（促销价格）的价格。

我无法弄清楚如何排除切割价格，上述表达式不起作用。

"shelf-product__price shelf-product__price--cut [ v2 ]" 
"shelf-product__price shelf-product__price--promo [ v2 ]"

我用下面的方法，但我想了解我得到错误的正则表达式。对不起，如果代码不漂亮，我正在学习

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen(url above) 
bsObj = BeautifulSoup(html,"html.parser") 

wrapList = bsObj.findAll("",{"class": re.compile("shelf-product__self.*")}) 
for wrap in wrapList: 
    print(wrap.find("",{"itemprop": re.compile("price.*")}).get_text()) 
    print(wrap.find("",{"class": re.compile("shelf-product__title.*")}).get_text())

来源

2016-01-24 Elena ZdeG

所提到的URL不似乎与'itemprop =任何元件“保质product__price货架product__price - 切[V2]”'值用于'itemprop'要么'title'或'价格'。这就是为什么“price。*”的第二个正则表达式正在工作。 – mchackam

@mchackam：它的确是'class'属性而不是'itemprop'属性，但它不是唯一的问题。当一个属性有多个由空格分隔的值时，条件会分别在每个值上进行测试，直到一个成功*（而不是整个属性）*。在任何情况下，正则表达式都是错误的，使用正则表达式不是这里的好方法，它更容易使用函数作为条件。在循环中放置模式编译会减慢代码的速度。 –

有几个问题。首先是.*(?!cut).*相当于.*。这是因为第一个.*会消耗所有剩余的字符。那么当然(?!cut)检查通过，因为它在字符串的末尾。最后.*消耗0个字符。所以它总是一场比赛。这个正则表达式会给你误报。它给你什么都没有的唯一原因是你正在寻找itemprop当你正在寻找的文本是在class。

您的解决方法对我来说看起来不错。但是如果你想在课堂上进行搜索，我会这样做。

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

html = urlopen('http://fetch.co.uk/dogs/dog-food?per-page=20') 
bsObj = BeautifulSoup(html,"html.parser") 

wrapList = bsObj.findAll("",{"class": "shelf-product__self"}) 

def is_price(tag): 
    return tag.has_attr('class') and \ 
      'shelf-product__price' in tag['class'] and \ 
      'shelf-product__price--cut' not in tag['class'] 

for wrap in wrapList: 
    print(wrap.find(is_price).text) 
    x=wrap.find("",{"class": "shelf-product__title"}).get_text()

正则表达式很好，但我认为用布尔值来做布尔逻辑更容易。

来源

2016-01-24 15:40:04

你也可以避开第一个正则表达式。 –

当然，编辑的一致性。 –

为什么要使用那个复杂的代码，你可以尝试以下 - span[itemprop=price]意味着选择所有span有属性itemprop是price。

import re 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 

#get possible list of urls 
urls = ['http://fetch.co.uk/dogs/dog-food?per-page=%s'%n for n in range(1,100)] 

for url in urls: 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html,"html.parser") 
    for y in [i.text for i in bsObj.select("span[itemprop=price]")]: 
    print y.encode('utf-8')

来源

2016-01-24 16:30:25 SIslam

使用select似乎是合理的，但代码存在一些问题。它使用python2，其中的问题使用python3。它尝试不同的每页值，我不知道为什么（这不是一个页码）。 'respons.content'应该是'html'。 ''t'in ..]'什么都不做。价格也应与产品名称相关联。最后一点可能会阻止您使用select。 –

确定编辑了该拼写错误并复制了粘贴错误 – SIslam

蟒蛇：排除字符串正则表达式

回答

相关问题