使用Python从网页获取表格

关于python网页抓取的关于无关的知识。使用Python从网页获取表格

我需要从this页面得到一个表：

http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF

我感兴趣的表是这样的：（忽略表上方的图表）

这是我现在有：

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get all tables 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.findAll('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name

我从哪里出发？

来源

2017-10-17 Ledger Yu

有没有什么建议同时使用BeautifulSoup和硒具体的原因是什么？ – Goralight

有人告诉我，当页面嵌入JavaScript时，你需要先加载它，然后用美丽的方式解析？ –

我并不是说这是问题，而是因为你需要它的原因 - 你需要整桌吗？或者一个特定的细胞？ – Goralight

要想从该网页中的数据，你可以去这样的：

from selenium import webdriver 
from bs4 import BeautifulSoup 
import time 

driver = webdriver.Chrome() 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
time.sleep(3) 

soup = BeautifulSoup(driver.page_source, 'lxml') 
driver.quit() 

tab_data = soup.select('table')[1] 
for items in tab_data.select('tr'): 
    item = [elem.text for elem in items.select('th,td')] 
    print(' '.join(item))

部分结果：

Total Return %  1-Day 1-Week 1-Month 3-Month YTD 1-Year 3-Year 5-Year 10-Year 15-Year 
IWF (Price) 0.13 0.83 2.68 5.67 23.07 26.60 15.52 15.39 8.97 10.14 
IWF (NAV) 0.20 0.86 2.66 5.70 23.17 26.63 15.52 15.40 8.98 10.14 
S&P 500 TR USD (Price) 0.18 0.52 2.42 4.52 16.07 22.40 13.51 14.34 7.52 9.76

来源

2017-10-17 10:17:25 SIM

你执行过代码吗？如果是，那么你的反馈是什么？你没有从该表中获取数据吗？ – SIM

OK所以这里是我是如何做的：

from selenium import webdriver 
from bs4 import BeautifulSoup 

# load chrome driver 
driver = webdriver.Chrome('C:/.../chromedriver_win32/chromedriver') 

# load web page and get source html 
link = 'http://performance.morningstar.com/funds/etf/total-returns.action?t=IWF' 
driver.get(link) 
html = driver.page_source 

# make soup and get table 
soup = BeautifulSoup(html, 'html.parser') 
tables = soup.find_all('table',{'class':'r_table3'}) 
tbl = tables[1] # ideally we should select table by name 

# column and row names 
rows = tbl.find_all('tr') 
column_names = [x.get_text() for x in rows[0].find_all('th')[1:]] 
row_names = [x.find_all('th')[0].get_text() for x in rows[1:]] 

# table content 
df = pd.DataFrame(columns=column_names, index=row_names) 
for row in rows[1:]: 
    row_name = row.find_all('th')[0].get_text() 
    df.ix[row_name] = [column.get_text() for column in row.find_all('td')] 
print(df)

有没有更优雅的方式，即不通过行和列等循环，但关闭的，现成的方法，我可以打电话？

来源

2017-10-17 10:03:01

使用Python从网页获取表格

回答

相关问题