2017-08-06 95 views
0

我试图从网站检索数据。我的代码如下:如何使用美丽的汤从标签中提取数据

import re 
from urllib2 import urlopen 
from bs4 import BeautifulSoup 

# gets a file-like object using urllib2.urlopen 
url = 'http://ecal.forexpros.com/e_cal.php?duration=weekly' 
html = urlopen(url) 

soup = BeautifulSoup(html) 

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr' 
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}): 
    # finds desired data by looking up <td> elements with class names 

    event = tr.find('td', {'class': 'ec_td_event'}).text 
    currency = tr.find('td', {'class': 'ec_td_currency'}).text 
    actual = tr.find('td', {'class': 'ec_td_actual'}).text 
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text 
    previous = tr.find('td', {'class': 'ec_td_previous'}).text 
    time = tr.find('td', {'class': 'ec_td_time'}).text 
    importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt') 

    # the returned strings are unicode, so to print them we need to use a unicode string 
    if importance == 'High': 
     print(u'\t{:5}\t{}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, importance, currency, event, actual, forecast, previous)) 

在结果集中的前几个记录如下:

05:00 High EUR CPI (YoY)         1.3%  1.3%  1.3%  
10:00 High USD Pending Home Sales (MoM)     1.5%  0.7%  -0.7% 
21:45 High CNY Caixin Manufacturing PMI     51.1  50.4  50.4  
00:30 High AUD RBA Interest Rate Decision     1.50%  1.50%  1.50% 
00:30 High AUD RBA Rate Statement               
03:55 High EUR German Manufacturing PMI     58.1  58.3  58.3  
03:55 High EUR German Unemployment Change     -9K   -5K   6K  

我想现在从以下网站检索类似的数据:

https://www.fxstreet.com/economic-calendar

为此,我修改了上述代码如下:

import re 
from urllib2 import urlopen 
from bs4 import BeautifulSoup 

# gets a file-like object using urllib2.urlopen 
url = 'https://www.fxstreet.com/economic-calendar' 
html = urlopen(url) 

soup = BeautifulSoup(html) 


for tr in soup.find_all('tr', {'class': re.compile('fxst-tr-event fxst-oddRow fxit-eventrow fxst-evenRow ')}): 
    # finds desired data by looking up <div> elements with class names 

    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 
    currency = tr.find('div', {'class': 'fxit-event-name'}).text 
    actual = tr.find('div', {'class': ' fxit-actual'}).text 
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text 
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text 
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 
# importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt') 

    # the returned strings are unicode, so to print them we need to use a unicode string 
    if importance == 'High': 
     print(u'\t{:5}\t{:3}\t{:40}\t{:8}\t{:8}\t{:8}'.format(time, currency, event, actual, forecast, previous)) 

此代码不会返回任何结果(大概是因为我引用了不正确的标记和/或类)。有没有人看到我的错误在哪里?

谢谢!

+0

我在网站上看了一下,没有_class_名为'fxst-tr-event fxst-oddRow fxit-eventrow fxst-evenRow' – ksai

回答

1

您应该使用selenium + Chromedriver/PhantomJS通过动态创建JavaScript内容解析,urllib2不处理。我认为在这里使用regex没什么意义,您可以使用lxml解析器来允许多个类并在列表中使用它们。下面是使用已经提到的工具的例子:

from bs4 import BeautifulSoup 
from selenium import webdriver 

url = 'https://www.fxstreet.com/economic-calendar' 

driver = webdriver.Chrome() 
driver.get(url) 
html = driver.page_source 
soup = BeautifulSoup(html, 'lxml') 

for tr in soup.findAll('tr',{'class':['fxst-tr-event', 'fxst-oddRow', 'fxit-eventrow', 'fxst-evenRow', 'fxs_cal_nextEvent']}): 
    event = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 
    currency = tr.find('div', {'class': 'fxit-event-name'}).text 
    actual = tr.find('div', {'class': 'fxit-actual'}).text 
    forecast = tr.find('div', {'class': 'fxit-consensus'}).text 
    previous = tr.find('div', {'class': 'fxst-td-previous fxit-previous'}).text 
    time = tr.find('div', {'class': 'fxit-eventInfo-time fxs_event_time'}).text 

    print(time, currency, event, actual, forecast, previous) 

lxml是库本身,您可以使用标准html.parser处理多个类,但它不是在我看来那样直观。此代码打印:

14:00 
CAD          14:00 None 59.2 
61.6          
14:00 
CAD          14:00 52.9 
63.9          
17:00 
USD          17:00 765 
... 
... 

,因为我真的不知道你想他们是什么,我没有改变任何变量,因此,进一步的调整是和格式化输出应该是理想的。

+0

谢谢。我试图通过插入'volatility = tr.find('div',{'class':'fxit-eventInfo-vol-c fxit-event-info-desktop')来修改您的代码以包含'期望波动率'。 ).text'作为for循环中的最后一个变量。它似乎没有工作。任何想法为什么? – equanimity

+0

它适用于我,一堆1和2。预期产出会是多少? –

+0

预期产出为:1 =“预期波动率低”,2 =“预期中等波动率”和3 =“预期波动率高 – equanimity