使用urllib来计算网页上图片的数量

对于一个班级，我有一个练习，我需要计算任何给定网页上的图片数量。我知道每张图片都以图片开头，因此我正在使用正则表达式来尝试找到它们。不过，我不断收到一个计数我知道是错的，什么是错我的代码：使用urllib来计算网页上图片的数量

import urllib 
import urllib.request 
import re 
img_pat = re.compile('<img.*>',re.I) 

def get_img_cnt(url): 
    try: 
     w = urllib.request.urlopen(url) 
    except IOError: 
     sys.stderr.write("Couldn't connect to %s " % url) 
     sys.exit(1) 
    contents = str(w.read()) 
    img_num = len(img_pat.findall(contents)) 
    return (img_num) 

print (get_img_cnt('http://www.americascup.com/en/schedules/races'))

来源

2013-08-18 kflaw

啊啊正则表达式。

你的正则表达式模式<img.*>说：“找我什么事，与<img和东西开始，并确保它与>结束

正则表达式是贪婪的，虽然，它会填补这一.*字面上一切可能，而留下一个单独的>字符以满足该模式，在这种情况下，它会一路走到最后，<html>并说“看！我发现了一个>右边有“

你应该拿出正确的计数通过使.*非贪婪，像这样：

<img.*?>

来源

2013-08-18 20:02:42

感谢，做的工作，我不要。 '不明白'正在做什么？' – kflaw

它告诉正则表达式在第一个'>'遇到的时候停止搜索，而不是最新的。所以它会捕获每个''而不只是一个大的''包含其他

The '?' tells the regular expression to match the arbitrary '.*' pattern with as _few_ characters as possible, rather than as _many_ (which is the default). So if we personify regex a bit longer, it would see ''尽快可能结束那场比赛。 –

你的正则表达式是贪婪的，所以它不是你想要更多的匹配。我建议使用HTML解析器。

img_pat = re.compile('<img.*?>',re.I)会做的伎俩，如果你必须这样做正则表达式的方式。 ?使它不贪婪。

一个好的网站检查你的正则表达式的飞行比赛：http://www.pyregex.com/
了解更多关于正则表达式：http://docs.python.org/2/library/re.html

来源

2013-08-18 19:58:31

感谢，伟大的网站 – kflaw

永远不要使用正则表达式解析HTML，可以使用HTML解析器，如lxml或BeautifulSoup。这里的工作为例，介绍如何使用BeautifulSoup和requests得到img标签计数：

from bs4 import BeautifulSoup 
import requests 


def get_img_cnt(url): 
    response = requests.get(url) 
    soup = BeautifulSoup(response.content) 

    return len(soup.find_all('img')) 


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

下面是使用lxml和requests一个工作示例：

from lxml import etree 
import requests 


def get_img_cnt(url): 
    response = requests.get(url) 
    parser = etree.HTMLParser() 
    root = etree.fromstring(response.content, parser=parser) 

    return int(root.xpath('count(//img)')) 


print(get_img_cnt('http://www.americascup.com/en/schedules/races'))

两个片段打印106。

另见：

希望有所帮助。

来源

2013-08-18 19:59:49 alecxe

使用urllib来计算网页上图片的数量

回答

相关问题