Python beautifulsoup迭代表

我想将表数据转换为CSV文件。不幸的是，我遇到了一个障碍，下面的代码简单地重复从所有后续TR中的第一个TR开始的TD。Python beautifulsoup迭代表

import urllib.request 
from bs4 import BeautifulSoup 

f = open('out.txt','w') 

url = "http://www.international.gc.ca/about-a_propos/atip-aiprp/reports-rapports/2012/02-atip_aiprp.aspx" 
page = urllib.request.urlopen(url) 

soup = BeautifulSoup(page) 

soup.unicode 

table1 = soup.find("table", border=1) 
table2 = soup.find('tbody') 
table3 = soup.find_all('tr') 

for td in table3: 
    rn = soup.find_all("td")[0].get_text() 
    sr = soup.find_all("td")[1].get_text() 
    d = soup.find_all("td")[2].get_text() 
    n = soup.find_all("td")[3].get_text() 

    print(rn + "," + sr + "," + d + ",", file=f)

这是我的第一个Python脚本，所以任何帮助将不胜感激！我已经看过其他问题的答案，但无法弄清楚我在这里做错了什么。

来源

2012-04-25 Will

你开始在文档的每次使用find()或find_all()时间顶层，所以当你要求，例如，所有的“TD”`标签你得到所有的“TD”标签在文档中，不仅仅是您搜索的表格和行中的那些文档。您可能不会搜索这些内容，因为它们没有以您的代码编写的方式使用。

我想你想要做这样的事情：

table1 = soup.find("table", border=1) 
table2 = table1.find('tbody') 
table3 = table2.find_all('tr')

或者，你知道的，更多的东西就是这样，有更多的描述变量名引导：

rows = soup.find("table", border=1).find("tbody").find_all("tr") 

for row in rows: 
    cells = row.find_all("td") 
    rn = cells[0].get_text() 
    # and so on

来源

2012-04-25 05:08:39 kindall

的问题是，每次你试图缩小你的搜索范围（获得这个tr的第一个td等）时，你只需要打电话回汤。汤是最高级别的对象 - 它代表整个文档。你只需要喝汤一次，然后用下面的步骤代替汤的结果。

例如（变量名更改为更清晰），

table = soup.find('table', border=1) 
rows = table.find_all('tr') 

for row in rows: 
    data = row.find_all("td") 
    rn = data[0].get_text() 
    sr = data[1].get_text() 
    d = data[2].get_text() 
    n = data[3].get_text() 

    print(rn + "," + sr + "," + d + ",", file=f)

我不知道这print语句就是做你想在这里做（在最什么是最好的方式至少，你应该使用字符串格式，而不是加法），但是我现在离开它，因为它不是核心问题。

另外，为完成：soup.unicode不会做任何事情。你不是在那里调用一个方法，也没有任务。我不记得BeautifulSoup首先有一个名为unicode的方法，但我已经习惯BS 3.0，所以它可能在4中是新的。

来源

2012-04-25 05:09:41

Python beautifulsoup迭代表

回答

相关问题