2017-03-08 285 views
-1

我正在学习python爬虫这些天,并且我写了一个简单的爬虫来获取Pixiv ID在Pixiv上的图片。如何优化我的python爬虫的内存使用情况

它工作的很好,但是这里出现了一个大问题:当它运行时,它占用了我的计算机上近1.2G的内存。

但是,有时它只占用10M内存,我真的不知道哪些代码导致如此大的内存使用率。

我已经上传脚本到我的VPS(只有768M内存Vulter服务器)并试图运行。结果,我得到了一个MerroyError。

所以我想知道如何优化内存使用(即使花费更多的时间来运行)。

这里是我的代码:

(我已经重写了所有代码,使其通过pep8,如果还不清楚,请告诉我哪些代码让你感到困惑)

from lxml import etree 
import re 
import os 
import requests 


# Get a single Picture. 
def get_single(Pixiv_ID, Tag_img_src, Headers): 
    Filter_Server = re.compile("[\d]+") 
    Filter_Posttime = re.compile("img\/[^_]*_p0") 
    Posttime = Filter_Posttime.findall(Tag_img_src)[0] 
    Server = Filter_Server.findall(Tag_img_src)[0] 
    Picture_Type = [".png", ".jpg", ".gif"] 
    for i in range(len(Picture_Type)): 
     Original_URL = "http://i" + str(Server) + ".pixiv.net/img-original/"\ 
         + Posttime+Picture_Type[i] 
     Picture = requests.get(Original_URL, headers=Headers, stream=True) 
     if Picture.status_code == 200: 
      break 
    if Picture.status_code != 200: 
     return -1 
    Filename = "./pic/"\ 
       + str(Pixiv_ID) + "_p0"\ 
       + Picture_Type[i] 
    Picture_File = open(Filename, "wb+") 
    for chunk in Picture.iter_content(None): 
     Picture_File.write(chunk) 
    Picture_File.close() 
    Picture.close() 
    return 200 


# Get manga which is a bundle of pictures. 
def get_manga(Pixiv_ID, Tag_a_href, Tag_img_src, Headers): 
    os.mkdir("./pic/" + str(Pixiv_ID)) 
    Filter_Server = re.compile("[\d]+") 
    Filter_Posttime = re.compile("img\/[^_]*_p") 
    Manga_URL = "http://www.pixiv.net/"+Tag_a_href 
    Manga_HTML = requests.get(Manga_URL, headers=Headers) 
    Manga_XML = etree.HTML(Manga_HTML.content) 
    Manga_Pages = Manga_XML.xpath('/html/body' 
            '/nav[@class="page-menu"]' 
            '/div[@class="page"]' 
            '/span[@class="total"]/text()')[0] 
    Posttime = Filter_Posttime.findall(Tag_img_src)[0] 
    Server = Filter_Server.findall(Tag_img_src)[0] 
    Manga_HTML.close() 
    Picture_Type = [".png", ".jpg", ".gif"] 
    for Number in range(int(Manga_Pages)): 
     for i in range(len(Picture_Type)): 
      Original_URL = "http://i" + str(Server) + \ 
          ".pixiv.net/img-original/"\ 
          + Posttime + str(Number) + Picture_Type[i] 
      Picture = requests.get(Original_URL, headers=Headers, stream=True) 
      if Picture.status_code == 200: 
       break 
     if Picture.status_code != 200: 
      return -1 
     Filename = "./pic/"+str(Pixiv_ID) + "/"\ 
        + str(Pixiv_ID) + "_p"\ 
        + str(Number) + Picture_Type[i] 
     Picture_File = open(Filename, "wb+") 
     for chunk in Picture.iter_content(None): 
      Picture_File.write(chunk) 
     Picture_File.close() 
     Picture.close() 
    return 200 


# Main function. 
def get_pic(Pixiv_ID): 
    Index_URL = "http://www.pixiv.net/member_illust.php?"\ 
       "mode=medium&illust_id="+str(Pixiv_ID) 
    Headers = {'referer': Index_URL} 
    Index_HTML = requests.get(Index_URL, headers=Headers, stream=True) 
    if Index_HTML.status_code != 200: 
     return Index_HTML.status_code 
    Index_XML = etree.HTML(Index_HTML.content) 
    Tag_a_href_List = Index_XML.xpath('/html/body' 
             '/div[@id="wrapper"]' 
             '/div[@class="newindex"]' 
             '/div[@class="newindex-inner"]' 
             '/div[@class="newindex-bg-container"]' 
             '/div[@class="cool-work"]' 
             '/div[@class="cool-work-main"]' 
             '/div[@class="img-container"]' 
             '/a/@href') 
    Tag_img_src_List = Index_XML.xpath('/html/body' 
             '/div[@id="wrapper"]' 
             '/div[@class="newindex"]' 
             '/div[@class="newindex-inner"]' 
             '/div[@class="newindex-bg-container"]' 
             '/div[@class="cool-work"]' 
             '/div[@class="cool-work-main"]' 
             '/div[@class="img-container"]' 
             '/a/img/@src') 
    if Tag_a_href_List == [] or Tag_img_src_List == []: 
     return 404 
    else: 
     Tag_a_href = Tag_a_href_List[0] 
     Tag_img_src = Tag_img_src_List[0] 
    Index_HTML.close() 
    if Tag_a_href.find("manga") != -1: 
     return get_manga(Pixiv_ID, Tag_a_href, Tag_img_src, Headers) 
    else: 
     return get_single(Pixiv_ID, Tag_img_src, Headers) 


# Check whether the picture already exists. 
def check_exist(Pixiv_ID): 
    if not os.path.isdir("Pic"): 
     os.mkdir("Pic") 
    if os.path.isdir("./Pic/"+str(Pixiv_ID)): 
     return True 
    Picture_Type = [".png", ".jpg", ".gif"] 
    Picture_Exist = False 
    for i in range(len(Picture_Type)): 
     Path = "./Pic/" + str(Pixiv_ID)\ 
       + "_p0" + Picture_Type[i] 
     if os.path.isfile(Path): 
      return True 
    return Picture_Exist 


# The script starts here. 
for i in range(0, 38849402): 
    Pixiv_ID = 38849402-i 
    Picture_Exist = check_exist(Pixiv_ID) 
    if not Picture_Exist: 
     Return_Code = get_pic(Pixiv_ID) 
     if Return_Code == 200: 
      print str(Pixiv_ID), "finish!" 
     elif Return_Code == -1: 
      print str(Pixiv_ID), "got an unknown error." 
     elif Return_Code == 404: 
      print str(Pixiv_ID), "not found. Maybe deleted." 
    else: 
     print str(Pixiv_ID), "picture exists!" 
+4

这是一个太大的混乱通过,你应该尝试memory_profiler。乍看之下,它看起来像是一次读取图像。尝试编写一个[MCVE],如果可能的话,很难遵循所有的全局变量,非标准的命名等。 – pvg

+0

@pvg我已经评论了我的脚本的变量和逻辑。现在清楚了吗? – Kon

+0

它并没有太大的帮助,试着像'r = requests.get(url,stream = True)'这样的请求流式传输。在'iter_content'中设置chunk_size为None,因为5很荒谬。 – pvg

回答

1

OMG!

最后,我知道出了什么问题。

我使用mem_top()来查看占用内存的东西。

猜猜是什么?

它是for i in range(0, 38849402):

在存储器中,有一个列表[0,1,2,3 ... 38849401],它占用了我的记忆。

我将其更改为:

Pixiv_ID = 38849402 
while Pixiv_ID > 0: 

    some code here 

    Pixiv_ID = Pixiv_ID-1 

现在的内存使用量只是没有超过20M以上。

感觉兴奋!

+0

啊哈!优秀。这是切换到Python 3的另一个真正的好理由。但现在看看这个代码,组合函数,实际的html解析器。你的痛苦并不是没有用的。 – pvg

+0

或者使用'xrange' –

+0

@ColonelThirtyTwo Killjoy。 – pvg