2017-10-17 200 views
0

关于python网页抓取的关于无关的知识。使用Python从网页获取表格

我需要从this页面得到一个表:

http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF 

我感兴趣的表是这样的: enter image description here (忽略表上方的图表)

这是我现在有:

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get all tables 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.findAll('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name 

我从哪里出发?

+0

有没有什么建议同时使用BeautifulSoup和硒具体的原因是什么? – Goralight

+0

有人告诉我,当页面嵌入JavaScript时,你需要先加载它,然后用美丽的方式解析? –

+0

我并不是说这是问题,而是因为你需要它的原因 - 你需要整桌吗?或者一个特定的细胞? – Goralight

回答

1

要想从该网页中的数据,你可以去这样的:

from selenium import webdriver 
from bs4 import BeautifulSoup 
import time 

driver = webdriver.Chrome() 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
time.sleep(3) 

soup = BeautifulSoup(driver.page_source, 'lxml') 
driver.quit() 

tab_data = soup.select('table')[1] 
for items in tab_data.select('tr'): 
    item = [elem.text for elem in items.select('th,td')] 
    print(' '.join(item)) 

部分结果:

Total Return %  1-Day 1-Week 1-Month 3-Month YTD 1-Year 3-Year 5-Year 10-Year 15-Year 
IWF (Price) 0.13 0.83 2.68 5.67 23.07 26.60 15.52 15.39 8.97 10.14 
IWF (NAV) 0.20 0.86 2.66 5.70 23.17 26.63 15.52 15.40 8.98 10.14 
S&P 500 TR USD (Price) 0.18 0.52 2.42 4.52 16.07 22.40 13.51 14.34 7.52 9.76 
+0

你执行过代码吗?如果是,那么你的反馈是什么?你没有从该表中获取数据吗? – SIM

0

OK所以这里是我是如何做的:

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get table 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.find_all('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name 

# column and row names 
rows = tbl.find_all('tr') 
column_names = [x.get_text() for x in rows[0].find_all('th')[1:]] 
row_names = [x.find_all('th')[0].get_text() for x in rows[1:]] 

# table content 
df = pd.DataFrame(columns=column_names, index=row_names) 
for row in rows[1:]: 
    row_name = row.find_all('th')[0].get_text() 
    df.ix[row_name] = [column.get_text() for column in row.find_all('td')] 
print(df) 

有没有更优雅的方式,即不通过行和列等循环,但关闭的,现成的方法,我可以打电话?