2011-02-26 43 views
0
<a href="/browse.php?cat=67" class="bb_a">2057: Discovery<br><span>(2057: Discovery channel)</span></a> 
<a href="/browse.php?cat=36" class="bb_a">The 4400<br><span>(The 4400)</span></a> 

<a href="/browse.php?cat=47" class="bb_a">Aqua<br><span>(Aquaman)</span></a> 

如何解析上述字符串?Python美丽的汤,提供网址和名称

我想在列表中的网址和名称如下所示:

[["2057", "The 4400", "Aquaman"], 
["/browse.php?cat=67", "/browse.php?cat=36", "/browse.php?cat=47"]] 

使用下面的代码,我已经可以解析网址

i = 0 
for incident in soup.findAll('a'): 
    if ("browse.php?" in soup('a')[i]['href']): 
     print (soup('a')[i]['href']) 
     print soup('a')[i]['class'] 
    i = i + 1; 

回答

0
#!/usr/bin/env python 
from BeautifulSoup import BeautifulSoup 
body = """ 
<a href="/browse.php?cat=67" class="bb_a">2057: Discovery<br><span>(2057: Discovery channel)</span></a> 
<a href="/browse.php?cat=36" class="bb_a">The 4400<br><span>(The 4400)</span></a> 

<a href="/browse.php?cat=47" class="bb_a">Aqua<br><span>(Aquaman)</span></a> 
""" 

soup = BeautifulSoup(body) 
i = 0 
for incident in soup.findAll('a'): 
    if ("browse.php?" in soup('a')[i]['href']): 
     print (soup('a')[i]['href']) 
     print soup('a')[i]['class'] 
     print soup('a')[i].contents # Pick out contents of the tag. 
    i = i + 1; 

产地:

/browse.php?cat=67 
bb_a 
[u'2057: Discovery', <br />, <span>(2057: Discovery channel)</span>] 
/browse.php?cat=36 
bb_a 
[u'The 4400', <br />, <span>(The 4400)</span>] 
/browse.php?cat=47 
bb_a 
[u'Aqua', <br />, <span>(Aquaman)</span>] 

你应该能够将soup('a')[i].contents结果按摩成你可以使用的形式。