解析html文件并将找到的图像添加到zip文件

我想解析所有img标签的html，下载src指向的所有图像，然后将这些文件添加到zip文件。我宁愿在记忆中做所有这些，因为我可以保证不会有那么多的图像。解析html文件并将找到的图像添加到zip文件

假设图像变量已经从解析html中填充。我需要帮助的是将图像放入zip文件中。

from zipfile import ZipFile 
from StringIO import StringIO 
from urllib2 import urlopen 

s = StringIO() 
zip_file = ZipFile(s, 'w') 
try: 
    for image in images: 
     internet_image = urlopen(image) 
     zip_file.writestr('some-image.jpg', internet_image.fp.read()) 
     # it is not obvious why I have to use writestr() instead of write() 
finally: 
    zip_file.close()

来源

2009-12-22 Jason Christa

使用的urllib2/LXML/XPath的/谷歌 – mykhal 2009-12-22 22:22:51

第二布莱恩·阿格纽的言论，看起来你已经差不多把一切都整理。你必须使用zip_file.writestr（），因为你是从一个字节缓冲区（即：一个字节字符串）写入数据，而不是从位于文件系统上的文件（这是zip_file.write（）希望接收的文件）。 – 2009-12-22 23:29:37

不要忘记其中引用的样式表和图像... – 2013-08-19 21:37:28

要回答关于如何创建ZIP归档文件的其他问题（其他人在此讨论了解析URL），我测试了您的代码。你已经非常接近完成产品了。

以下是我将如何扩充您必须创建Zip存档的内容（在本例中，我正在将存档写入驱动器，以便我可以验证它是否已正确书写）。

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED 
import zlib 
from cStringIO import StringIO 
from urllib2 import urlopen 
from urlparse import urlparse 
from os import path 

images = ['http://sstatic.net/so/img/logo.png', 
      'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png'] 

buf = StringIO() 
# By default, zip archives are not compressed... adding ZIP_DEFLATED 
# to achieve that. If you don't want that, or don't have zlib on or 
# system, delete the compression kwarg 
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED) 

for image in images: 
    internet_image = urlopen(image) 
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read()) 

zip_file.close() 

output = open('images.zip', 'wb') 
output.write(buf.getvalue()) 
output.close() 
buf.close()

来源

2009-12-22 23:53:29

我不太清楚你在这里问什么，因为你似乎有大部分排序。

您是否调查过HtmlParser实际执行HTML解析？我不会尝试自己手动翻译解析器 - 这是一个有许多边缘案例的主要任务。除了最微不足道的情况外，别考虑任何其他的正则表达式。

对于每个<img/>标记，您可以使用HttpLib实际获取每个图像。在多个线程中获取图像可能会加快编译zip文件的速度。

来源

2009-12-22 22:24:42

+1用于建议解析html！ – Mongoose 2009-12-22 22:31:34

Downvoted为什么？ – 2009-12-22 22:50:30

我能想到的最简单的方法就是使用BeautifulSoup库。

线沿线的东西：

from BeautifulSoup import BeautifulSoup 
from collections import defaultdict 

def getImgSrces(html): 
    srcs = [] 
    soup = BeautifulSoup(html) 

    for tag in soup('img'): 
     attrs = defaultdict(str) 
     for attr in tag.attrs: 
      attrs[ attr[0] ] = attr[1] 
     attrs = dict(attrs) 

     if 'src' in attrs.keys(): 
      srcs.append(attrs['src']) 

    return srcs

这应该给你从你的img标签通过派生循环的URL列表。

来源

2009-12-22 22:31:05 KingRadical

为什么不只有：'for attr in tag.attrs：if attr [0] =='src'：srcs.append（attr [1]）'而不是？为什么要打扰你的attrs字典？ – 2009-12-23 00:16:19

我刚刚写了一个例程，我写了一个例程，我想要一个所有属性的字典，尽管你可以这样做。虽然我不确定在性能方面有太多收获。 – KingRadical 2009-12-23 16:55:56

解析html文件并将找到的图像添加到zip文件

回答

相关问题