2017-08-02 92 views
0

我能够成功地从网站提取数据,除了一个字段,其标签是img alt。下面是代码:使用美丽的汤提取img alt标签的文本

#import pandas as pd 
import re 
from urllib2 import urlopen 
from bs4 import BeautifulSoup 

# gets a file-like object using urllib2.urlopen 
url = 'http://ecal.forexpros.com/e_cal.php?duration=daily' 
html = urlopen(url) 

soup = BeautifulSoup(html) 

# loops over all <tr> elements with class 'ec_bg1_tr' or 'ec_bg2_tr' 
for tr in soup.find_all('tr', {'class': re.compile('ec_bg[12]_tr')}): 
    # finds desired data by looking up <td> elements with class names 
    event = tr.find('td', {'class': 'ec_td_event'}).text 
    currency = tr.find('td', {'class': 'ec_td_currency'}).text 
    actual = tr.find('td', {'class': 'ec_td_actual'}).text 
    forecast = tr.find('td', {'class': 'ec_td_forecast'}).text 
    previous = tr.find('td', {'class': 'ec_td_previous'}).text 
    time = tr.find('td', {'class': 'ec_td_time'}).text 
    importance = tr.find('td', {'class': 'ec_td_importance'}).text 

    # the returned strings are unicode, so to print them we need a unicode string 
    print u'{:3}\t{}\t{:5}\t{:8}\t{:8}\t{:8}\t{}'.format(currency, importance, time, actual, forecast, previous, event) 

输出的前几记录如下:

JPY  01:00 43.8  43.6  43.3  Household Confidence 
CHF  01:45 -3   -3   -8   SECO Consumer Climate 
RON  02:00 2.50%     3.30%  PPI (YoY) 
EUR  03:00 -26.9K  -66.5K  -98.3K  Spanish Unemployment Change 
CHF  03:15 1.5%  1.3%  -0.8%  Retail Sales (YoY) 
CHF  03:30 60.9  58.9  60.1  SVME PMI 
GBP  04:30 51.9  54.5  54.8  Construction PMI 

importance字段未在上面的输出显示(大概是因为数据被包含在imgalt )。

有谁知道如何解决这个问题?

谢谢!

编辑:

问题是通过更换得到解决:

importance = tr.find('td', {'class': 'ec_td_importance'}).text 

有:

importance = tr.find('td', {'class': 'ec_td_importance'}).img.get('alt') 

回答

1

在此更换你的importance行:

importance = tr.find('td', {'class': 'ec_td_importance'}).img['alt']