2016-11-30 161 views
1

我试图在维基百科文章中刮一张表,每个表元素的类型看起来都是<class 'bs4.element.Tag'><class 'bs4.element.NavigableString'>BeautifulSoup标记是类型bs4.element.NavigableString和bs4.element.Tag

import requests 
import bs4 
import lxml 


resp = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts') 

soup = bs4.BeautifulSoup(resp.text, 'lxml') 

munis = soup.find(id='mw-content-text')('table')[1] 

for muni in munis: 
    print type(muni) 
    print '============' 

产生以下输出中:

<class 'bs4.element.Tag'> 
============ 
<class 'bs4.element.NavigableString'> 
============ 
<class 'bs4.element.Tag'> 
============ 
<class 'bs4.element.NavigableString'> 
============ 
<class 'bs4.element.Tag'> 
============ 
<class 'bs4.element.NavigableString'> 
... 

当我试图找回muni.contents我得到了AttributeError: 'NavigableString' object has no attribute 'contents'错误。

我在做什么错?如何获得每个munibs4.element.Tag对象?

(使用Python 2.7)。

+0

你可能知道, ** munis **是维基百科页面中表格的表示形式。如果你打印它,你会看到表格的html。如果你想查看** munis **的孩子的标签,即它的行,那么你可以在munis.childGenerator()中使用代码'child.name' - 只是一系列的tr引号。我怀疑这是你想要的。你是否应该问如何删除表中每一行的内容,可能是以Python列表的形式? –

回答

0

如果在节点之间的标记中有空格,BeautifulSoup会将这些空格转换为NavigableString。只要把一个尝试捕捉,看看内容是否越来越牵强,因为你希望他们 -

for muni in munis: 
    #print type(muni) 
    try: 
     print muni.contents 
    except AttributeError: 
     pass 
    print '============' 
0
from bs4 import BeautifulSoup 
import requests 

r = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts') 
soup = BeautifulSoup(r.text, 'lxml') 
rows = soup.find(class_="wikitable sortable").find_all('tr')[1:] 

for row in rows: 
    cell = [i.text for i in row.find_all('td')] 
    print(cell) 

出来:

['Abington', 'Town', 'Plymouth', 'Open town meeting', '15,985', '1712'] 
['Acton', 'Town', 'Middlesex', 'Open town meeting', '21,924', '1735'] 
['Acushnet', 'Town', 'Bristol', 'Open town meeting', '10,303', '1860'] 
['Adams', 'Town', 'Berkshire', 'Representative town meeting', '8,485', '1778'] 
['Agawam', 'City[4]', 'Hampden', 'Mayor-council', '28,438', '1855'] 
['Alford', 'Town', 'Berkshire', 'Open town meeting', '494', '1773'] 
['Amesbury', 'City', 'Essex', 'Mayor-council', '16,283', '1668'] 
['Amherst', 'Town', 'Hampshire', 'Representative town meeting', '37,819', '1775'] 
['Andover', 'Town', 'Essex', 'Open town meeting', '33,201', '1646'] 
['Aquinnah', 'Town', 'Dukes', 'Open town meeting', '311', '1870'] 
['Arlington', 'Town', 'Middlesex', 'Representative town meeting', '42,844', '1807'] 
['Ashburnham', 'Town', 'Worcester', 'Open town meeting', '6,081', '1765'] 
['Ashby', 'Town', 'Middlesex', 'Open town meeting', '3,074', '1767'] 
['Ashfield', 'Town', 'Franklin', 'Open town meeting', '1,737', '1765'] 
['Ashland', 'Town', 'Middlesex', 'Open town meeting', '16,593', '1846'] 
2
#!/usr/bin/env python 
# coding:utf-8 
'''黄哥Python''' 

import requests 
import bs4 
from bs4 import BeautifulSoup 
# from urllib.request import urlopen 

html = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies') 
soup = BeautifulSoup(html.text, 'lxml') 

symbolslist = soup.find('table').tr.next_siblings 
for sec in symbolslist: 
    # print(type(sec)) 
    if type(sec) is not bs4.element.NavigableString: 
     print(sec.get_text()) 

result screenshot