如何在使用Python机械化网页抓取后实现结果缓存

用Python编写，我的网页抓取脚本使用机械化。这是我的脚本看起来像：（替换为敏感信息）如何在使用Python机械化网页抓取后实现结果缓存

import mechanize 
import cookielib 
from bs4 import BeautifulSoup 
import html2text 
import json 

br = mechanize.Browser() 
cj = cookielib.LWPCookieJar() 
br.set_cookiejar(cj) 
br.set_handle_equiv(True) 
br.set_debug_responses(True) 
br.set_handle_gzip(True) 
br.set_handle_redirect(True) 
br.set_handle_referer(True) 
br.set_handle_robots(False) 
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) 
br.addheaders = [('User-agent', 'Safari/8.0')] 
br.open('https://example.com/login.jsp') 
for f in br.forms(): 
    print f 
br.select_form(nr=0) 
br.form['u'] = 'abcd' 
br.form['p'] = '1234' 
br.submit() 

def get_information(): 
    locations=[] 
    data=json.load(br.open('https://example.com/iWantThisJson.jsp')) 
    for entry in data["stores"]: 
     location=entry["name"].split("(",1)[0] 
     locations.append(location) 
    return locations

登录后，我get_information（）方法检索商店位置的列表和切片他们到我想要的东西后，我将它们保存到字典locations.This方法被称为在我的网站建立与Flask，目前在本地运行。这里是它被称为在我的网站代码：

class reportDownload(Form): 
    locations={} 
    locations=get_information() 
    locations_names=list(enumerate(locations)) 
    location=SelectField(u'Location',choices=locations_names)

这份名单是不是显示在下拉菜单上我的网站，让用户选择一个选项。

我的问题在于如何去执行从我的get_information（）方法收到的结果的缓存，因为我不想每次访问网页（使用信息的地方）都要执行网页抓取由用户（这是相当频繁的，因为它是主页之一）。我试图寻找如何实现缓存，但由于我对此仍然很陌生，因此我无法理解需要做什么。如果有人能指点我一个相关的例子，将不胜感激！

谢谢！ :)

来源

2014-12-19 ctyneg

有很多简单的方法来实现缓存。

将数据写入文件。如果你只是处理少量的数据，这特别容易。 JSON blob也可以轻松写入文件。

with open("my_cache_file", "a+") as file_: file_.write(my_json_blob)
使用键值存储缓存像Redis的数据。 Installing redis很简单。使用它也非常简单。有一个非常良好的书面python client for redis

redis=redis.StrictRedis() redis.set(key, json_blob) json_blob = redis.get(key)
您还可以使用类似的Postgres，MongoDB的，MariaDB的，或任何常用的数据库，将数据存储到磁盘上，但只有当你的数据是如此之大，这是它极大地outscaling你有你的计算机上的内存量刮时（如果你使用requests）（Redis的写入内存）

来源

2014-12-19 08:47:30 Greg

谢谢@Greg会试试这个方法:) – ctyneg 2014-12-22 01:29:30

万一别人访问此线程，另一个很好的选择对于缓存requests-cache模块。

它是一个requests的插件，并且在配置几行后，将为您处理缓存。

import requests 
import requests_cache 

requests_cache.install_cache('name/of/cache' 
          backend='mongdb', 
          expire_after=3600) 

# use requests as usual

如在上面的例子中所示，模块允许我们很容易地定义一个高速缓存名，后端，和过期时间。

来源

2017-11-10 23:05:07

如何在使用Python机械化网页抓取后实现结果缓存

回答

相关问题