2017-07-15 50 views
1

我试图从这个特定网页webscrape统计:https://www.sports-reference.com/cfb/schools/louisville/2016/gamelog/的访问评论HTML线,BeautifulSoup

然而,出现了“防守日志”表被注释掉当我在看的HTML源代码(因此,当试图使用BeautifulSoup4时,以下代码只抓取在防御性数据被注释掉时未被注释掉的冒犯性数据。

from urllib.request import Request,urlopen 
from bs4 import BeautifulSoup 
import re 

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/' 
req = Request(accessurl) 
link = urlopen(req) 
soup = BeautifulSoup(link.read(), "lxml") 


tables = soup.find_all(['th', 'tr']) 
my_table = tables[0] 
rows = my_table.findChildren(['tr']) 
for row in rows: 
    cells = row.findChildren('td') 
    for cell in cells: 
     value = cell.string 
     print(value) 

我很好奇,如果有任何解决方案,能够将所有的防御值的添加到列表中以同样的方式在进攻数据存储无论是内部还是BeautifulSoup4之外。谢谢!

注意,我加入到解决方案如下来源于here

data = [] 

table = defensive_log 
table_body = table.find('tbody') 

rows = table_body.find_all('tr') 
for row in rows: 
    cols = row.find_all('td') 
    cols = [ele.text.strip() for ele in cols] 
    data.append([ele for ele in cols if ele]) # Get rid of empty values 
+0

你是什么意思的“注释”吗? – snapcrack

回答

2

Comment对象会给你想要的东西:

from urllib.request import Request,urlopen 
from bs4 import BeautifulSoup, Comment 

accessurl = 'https://www.sports-reference.com/cfb/schools/oklahoma-state/2016/gamelog/' 
req = Request(accessurl) 
link = urlopen(req) 
soup = BeautifulSoup(link, "lxml") 

comments=soup.find_all(string=lambda text:isinstance(text,Comment)) 
for comment in comments: 
    comment=BeautifulSoup(str(comment), 'lxml') 
    defensive_log = comment.find('table') #search as ordinary tag 
    if defensive_log: 
     break 
+0

@Storm,有没有反馈?我的解决方案有用吗? –

+0

很抱歉,需要很长时间才能回到你身边 - 我一直在移动并最终回到项目。我正在通过它尝试将其合并。 – Storm

+0

我从[这里]添加了以下代码(https://stackoverflow.com/questions/23377533/python-beautifulsoup-parsing-table)。它允许我把它放在一张桌子上。我在上面的问题中放入最终的代码字符串。 – Storm