2017-09-03 196 views
2

我试图从获得“收益公告表”:https://www.zacks.com/stock/research/amzn/earnings-announcements无法获取表中的数据 - HTML

我使用不同beautifulsoup选项,但没有得到该表。

table = soup.find('table', attrs={'class': 'earnings_announcements_earnings_table'}) 

table = soup.find_all('table') 

当我检查表时,表中的元素在那里。

我正在粘贴我为表(js,json?)获得的部分代码。

document.obj_data = { 
"earnings_announcements_earnings_table" : 
     [ [ "10/26/2017", "9/2017", "$0.06", "--", "--", "--", "--" ] , [ "7/27/2017", "6/2017", "$1.40", "$0.40", "<div class=\"right neg negative neg_icon showinline down\">-1.00</div>", "<div class=\"right neg negative neg_icon showinline down\">-71.43%</div>", "After Close" ] , [ "4/27/2017", "3/2017", "$1.03", "$1.48", "<div class=\"right pos positive pos_icon showinline up\">+0.45</div>", "<div class=\"right pos positive pos_icon showinline up\">+43.69%</div>", "After Close" ] , [ "2/2/2017", "12/2016", "$1.40", "$1.54", "<div class=\"right pos positive pos_icon showinline up\">+0.14</div>", "<div class=\"right pos positive pos_icon showinline up\">+10.00%</div>", "After Close" ] , [ "10/27/2016", "9/2016", "$0.85", "$0.52", "<div class=\"right neg negative neg_icon showinline down\">-0.33</div>", "<div class=\"right neg negative neg_icon showinline down\">-38.82%</div>", "After Close" ] , [ "7/28/2016", "6/2016", "$1.14", "$1.78", "<div class=\"right pos positive pos_icon showinline up\">+0.64</div>", "<div class=\"right pos positive pos_icon showinline up\">+56.14%</div>", "After Close" ] , [ "4/28/2016", "3/2016", "$0.61", "$1.07", "<div class=\"right pos positive pos_icon showinline up\">+0.46</div>", "<div class=\"right pos positive pos_icon showinline up\">+75.41%</div>", "After Close" ] , [ "1/28/2016", "12/2015", "$1.61", "$1.00", "<div class=\"right neg negative neg_icon showinline down\">-0.61</div>", "<div class=\"right neg negative neg_icon showinline down\">-37.89%</div>", "After Close" ] , [ "10/22/2015", "9/2015", "-$0.1", "$0.17", "<div class=\"right pos positive pos_icon showinline up\">+0.27</div>", "<div class=\"right pos positive pos_icon showinline up\">+270.00%</div>", "After Close" ] , [ "7/23/2015", "6/2015", "-$0.15", "$0.19", "<div class=\"right pos positive pos_icon showinline up\">+0.34</div>", "<div class=\"right pos positive pos_icon showinline up\">+226.67%</div>", "After Close" ] , [ "4/23/2015", "3/2015", "-$0.13", "-$0.12", "<div class=\"right pos positive pos_icon showinline up\">+0.01</div>", "<div class=\"right pos positive pos_icon showinline up\">+7.69%</div>", "After Close" ] , [ "1/29/2015", "12/2014", "$0.24", "$0.45", "<div class=\"right pos positive pos_icon showinline up\">+0.21</div>", "<div class=\"right pos positive pos_icon showinline up\">+87.50%</div>", "After Close" ] , [ "10/23/2014", "9/2014", "-$0.73", "-$0.95", "<div class=\"right neg negative neg_icon showinline down\">-0.22</div>", "<div class=\"right neg negative neg_icon showinline down\">-30.14%</div>", "After Close" ] , [ "7/24/2014", "6/2014", "-$0.13", "-$0.27", "<div class=\"right neg negative neg_icon showinline down\">-0.14</div>", "<div class=\"right neg negative neg_icon showinline down\">-107.69%</div>", "After Close" ] , [ "4/24/2014", "3/2014", "$0.22", "$0.23", "<div class=\"right pos positive pos_icon showinline up\">+0.01</div>", "<div class=\"right pos positive pos_icon showinline up\">+4.55%</div>", "After Close" ] , [ "1/30/2014", "12/2013", "$0.68", "$0.51", "<div class=\"right neg negative neg_icon showinline down\">-0.17</div>", "<div class=\"right neg negative neg_icon showinline down\">-25.00%</div>", "After Close" ] , [ "10/24/2013", "9/2013", "-$0.09", "-$0.09", "<div class=\"right pos_na showinline\">0.00</div>", "<div class=\"right pos_na showinline\">0.00%</div>", "After Close" ] , [ "7/25/2013", "6/2013", "$0.04", "-$0.02", "<div class=\"right neg negative neg_icon showinline down\">-0.06</div>", "<div class=\"right neg negative neg_icon showinline down\">-150.00%</div>", "After Close" ] , [ "4/25/2013", "3/2013", "$0.10", "$0.18", "<div class=\"right pos positive pos_icon showinline up\">+0.08</div>", "<div class=\"right pos positive pos_icon showinline up\">+80.00%</div>", "After Close" ] , [ "1/29/2013", "12/2012", "$0.28", "$0.21", "<div class=\"right neg negative neg_icon showinline down\">-0.07</div>", "<div class=\"right neg negative neg_icon showinline down\">-25.00%</div>", "After Close" ] , [ "10/25/2012", "9/2012", "-$0.08", "-$0.23", "<div class=\"right neg negative neg_icon showinline down\">-0.15</div>", "<div class=\"right neg negative neg_icon showinline down\">-187.50%</div>", "After Close" ] , [ "7/26/2012", "6/2012", "--", "--", "--", "--", "After Close" ] , [ "4/26/2012", "3/2012", "--", "--", "--", "--", "After Close" ] , [ "1/31/2012", "12/2011", "--", "--", "--", "--", "After Close" ] , [ "10/25/2011", "9/2011", "--", "--", "--", "--", "After Close" ] , [ "7/26/2011", "6/2011", "--", "--", "--", "--", "After Close" ] , [ "4/26/2011", "3/2011", "--", "--", "--", "--", "--" ] , [ "1/27/2011", "12/2010", "--", "--", "--", "--", "After Close" ] , [ "10/21/2010", "9/2010", "--", "--", "--", "--", "After Close" ] , [ "7/22/2010", "6/2010", "--", "--", "--", "--", "After Close" ] , [ "4/22/2010", "3/2010", "--", "--", "--", "--", "After Close" ] , [ "1/28/2010", "12/2009", "--", "--", "--", "--", "After Close" ] , [ "10/22/2009", "9/2009", "--", "--", "--", "--", "After Close" ] , [ "7/23/2009", "6/2009", "--", "--", "--", "--", "After Close" ] ] 

我怎么能得到这张桌子? 谢谢!

+1

数据是动态加载的而不是在html中播种的,所以你必须解析你所得到的数据。 –

+0

谢谢! PhantomJS,硒? – Diego

+0

我查看了页面源代码,仍然看起来一样,所以我不认为这会有所帮助。但是,还是可以试试看。 –

回答

0

因此,解决方案是使用Python的字符串和RegExp函数而不是BeautifulSoup解析整个HTML文档,因为我们不是试图从HTML标签获取数据,而是想要将它们放入JS代码中。

所以,这段代码基本上是在“earnings_announcements_earnings_table”中获得JS数组,因为JS数组与Python的列表结构相同,所以我只是使用ast来解析它。结果是您可以循环访问的列表,并显示表中所有页面的所有数据。

import urllib2 
import re 
import ast 

user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'} 
req = urllib2.Request('https://www.zacks.com/stock/research/amzn/earnings-announcements', None, user_agent) 
source = urllib2.urlopen(req).read() 

compiled = re.compile('"earnings_announcements_earnings_table"\s+\:', flags=re.IGNORECASE | re.DOTALL) 
match = re.search(compiled, source) 
if match: 
    source = source[match.end(): len(source)] 

compiled = re.compile('"earnings_announcements_webcasts_table"', flags=re.IGNORECASE | re.DOTALL) 
match = re.search(compiled, source) 
if match: 
    source = source[0: match.start()] 

result = ast.literal_eval(str(source).strip('\r\n\t, ')) 
print result 

让我知道你是否需要澄清。

+0

非常感谢!它很棒!每个列表中的元素4和5出现w-html代码。 – Diego

+0

太棒了!如果你喜欢清理那些html代码,你可以调用BeautifulSoup。 – chad