Python BeautifulSoup - 刮掉Google财务历史数据

我试图取消Google财务历史数据。我需要的总行数，这是与分页位于一起。下面是div标签，它是负责显示行总数：Python BeautifulSoup - 刮掉Google财务历史数据

<div class="tpsd">1 - 30 of 1634 rows</div>

我用下面的代码来获取数据，但它返回一个空单尝试：

soup.find_all('div', 'tpsd')

我尝试获取整个表，但即使如此，我没有成功，当我检查页面源时，我能够找到JavaScript函数内的值。当我谷歌如何从脚本标记获取值时，提到使用正则表达式。所以，我尝试使用正则表达式和下面是我的代码：

import requests 
import re 
from bs4 import BeautifulSoup 
r = requests.get('https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ') 
soup = BeautifulSoup(r.content,'lxml') 
var = soup.find_all("script")[8].string 
a = re.compile('google.finance.applyPagination\((.*)\'http', re.DOTALL) 
b = a.search(var) 
num = b.group(1) 
print(num.replace(',','').split('\n')[3])

我能得到我想要的值，但我的疑问是我用来获取值上面的代码是否正确，或有没有其他更好的方法。请帮助。

来源

2016-08-19 Jeril

*疑问是什么*上述代码是否用于获取值是正确的？意思是？它会给你你需要的东西吗？ –

@PadraicCunningham是的..我从脚本标记中获取所需的值。但我没有通过使用div标签获取值。有什么办法可以使用div标签获取值吗？ – Jeril

如果你想解析页面，就像你在浏览器中看到的那样，你将需要像运行Javascript的selenium之类的东西，你是想解析表格还是什么？ –

你可以很容易地通过偏移即开始= ..到URL获得在30行这正是分页逻辑发生的情况：

from bs4 import BeautifulSoup 
import requests 

url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \ 
     "enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}" 


with requests.session() as s: 
    start = 0 
    req = s.get(url.format(start)) 
    soup = BeautifulSoup(req.content, "lxml") 
    table = soup.select_one("table.gf-table.historical_price") 
    all_rows = table.find_all("tr") 
    while True: 
     start += 30 
     soup = BeautifulSoup(s.get(url.format(start)).content, "lxml") 
     table = soup.select_one("table.gf-table.historical_price") 
     if not table: 
      break 
     all_rows.extend(table.find_all("tr"))

您还可以使用s CRIPT标签，使用与范围：

with requests.session() as s: 
    req = s.get(url.format(0)) 
    soup = BeautifulSoup(req.content, "lxml") 
    table = soup.select_one("table.gf-table.historical_price") 
    scr = soup.find("script", text=re.compile('google.finance.applyPagination')) 
    total = int(scr.text.split(",", 3)[2]) 
    all_rows = table.find_all("tr") 

    for start in range(30, total+1, 30): 
     soup = BeautifulSoup(s.get(url.format(start)).content, "lxml") 
     table = soup.select_one("table.gf-table.historical_price") 
     all_rows.extend(table.find_all("tr")) 
print(len(all_rows))

的num=30是每页行的数量，让你可以将其设置为200，这似乎是最大和工作中的步/从偏移量较少的请求。

url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \ 
     "enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}" 


with requests.session() as s: 
    req = s.get(url.format(0)) 
    soup = BeautifulSoup(req.content, "lxml") 
    table = soup.select_one("table.gf-table.historical_price") 
    scr = soup.find("script", text=re.compile('google.finance.applyPagination')) 
    total = int(scr.text.split(",", 3)[2]) 
    all_rows = table.find_all("tr") 
    for start in range(200, total+1, 200): 
     soup = BeautifulSoup(s.get(url.format(start)).content, "lxml") 
     print(url.format(start) 
     table = soup.select_one("table.gf-table.historical_price") 
     all_rows.extend(table.find_all("tr"))

如果我们运行的代码，你会看到我们得到1643行：

In [7]: with requests.session() as s: 
    ...:   req = s.get(url.format(0)) 
    ...:   soup = BeautifulSoup(req.content, "lxml") 
    ...:   table = soup.select_one("table.gf-table.historical_price") 
    ...:   scr = soup.find("script", text=re.compile('google.finance.applyPagination')) 
    ...:   total = int(scr.text.split(",", 3)[2]) 
    ...:   all_rows = table.find_all("tr") 
    ...:   for start in range(200, total+1, 200): 
    ...:     soup = BeautifulSoup(s.get(url.format(start)).content, "lxml") 
    ...:     table = soup.select_one("table.gf-table.historical_price") 
    ...:     all_rows.extend(table.find_all("tr")) 
    ...:   print(len(all_rows)) 
    ...:   

1643 

In [8]:

来源

2016-08-19 11:01:01

太棒了...超过我们的预期。非常感谢... – Jeril

@Jeril，没有问题，不客气。 –

你可以只使用Python模块：https://pypi.python.org/pypi/googlefinance

的API很简单：

#The google finance API that we need. 
from googlefinance import getQuotes 
#The json handeler, since the API returns a JSON. 
import json 


intelJSON = (getQuotes('INTC')) 

intelDump = json.dumps(intelJSON, indent=2) 

intelInfo = json.loads(intelDump) 

intelPrice = intelInfo[0]['LastTradePrice'] 
intelTime = intelInfo[0]['LastTradeDateTimeLong'] 

print ("As of " + intelTime + ", Intel stock is trading at: " + intelPrice)

来源

2016-08-19 08:19:43 Rich

获取值我也能够获取历史数据吗？ – Jeril

在那里GitHub页面被提及如下：“这个模块不提供延迟，实时股票数据在纽约证券交易所和纳斯达克。”我想它不会获得历史数据。 – Jeril

非常感谢@Rich – Jeril

我喜欢有所有可用从谷歌财经下载原始的CSV文件。我编写了一个快速的Python脚本来自动下载公司列表的所有历史价格信息 - 这相当于人类可能如何手动使用“下载到电子表格”链接。

这里的GitHub库，以下载的CSV文件所有S &标准普尔500股票（在rawCSV文件夹）：https://github.com/liezl200/stockScraper

它使用这个链接http://www.google.com/finance/historical?q=googl&startdate=May+3%2C+2012&enddate=Apr+30%2C+2017&output=csv其中这里的关键是最后的输出参数，output=csv 。我使用urllib.urlretrieve(download_url, local_csv_filename)来检索CSV。

来源

2017-05-01 22:51:45 liezlp

我有这个想法，但每当我想更新，这可能需要一些时间。感谢您的回应。 – Jeril

Python BeautifulSoup - 刮掉Google财务历史数据

回答

相关问题