使用熊猫获取多个表从网页

我用熊猫来解析从以下页面的数据：http://kenpom.com/index.php?y=2014 使用熊猫获取多个表从网页

来获取数据，我写：

dfs = pd.read_html(url)

的数据看起来不错，被完全解析，除了它仅从第40行开始获取数据。这似乎是分离表的问题，这使得熊猫不能获得所有的信息。

如何让大熊猫获得该网页上所有表格的所有数据？

来源

2017-02-14 user7012893

您发布的网页的HTML有多个<thead>和<tbody>标签极其混淆pandas.read_html。

在此之后SO thread可以手动unwrap那些标签：

import urllib 
from bs4 import BeautifulSoup 

html_table = urllib.request.urlopen(url).read() 

# fix HTML 
soup = BeautifulSoup(html_table, "html.parser") 
# warn! id ratings-table is your page specific 
for table in soup.findChildren(attrs={'id': 'ratings-table'}): 
    for c in table.children: 
     if c.name in ['tbody', 'thead']: 
      c.unwrap() 

df = pd.read_html(str(soup), flavor="bs4") 
len(df[0])

返回369。

来源

2017-02-14 13:04:46 tworec

使用熊猫获取多个表从网页

回答

相关问题