我试图从这个特定网页webscrape统计:https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/的访问评论HTML线,BeautifulSoup
然而,出现了“防守日志”表被注释掉当我在看的HTML源代码(因此,当试图使用BeautifulSoup4时,以下代码只抓取在防御性数据被注释掉时未被注释掉的冒犯性数据。
from urllib.request import Request,urlopen
from bs4 import BeautifulSoup
import re
accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/'
req = Request(accessurl)
link = urlopen(req)
soup = BeautifulSoup(link.read(), "lxml")
tables = soup.find_all(['th', 'tr'])
my_table = tables[0]
rows = my_table.findChildren(['tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print(value)
我很好奇,如果有任何解决方案,能够将所有的防御值的添加到列表中以同样的方式在进攻数据存储无论是内部还是BeautifulSoup4之外。谢谢!
注意,我加入到解决方案如下来源于here:
data = []
table = defensive_log
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele]) # Get rid of empty values
你是什么意思的“注释”吗? – snapcrack