2013-09-24 29 views
1

我正在学习和尝试两种方式Python(2.7)美丽的汤(3.2.0)。我已经得到了一些帮助,在这里与我的第一个问题(Beautiful Soup throws `IndexError`如何找到符合美味汤特定条件的元素

这是迄今为止Python代码:

# Import the classes that are needed 
import urllib2 
from BeautifulSoup import BeautifulSoup 

# URL to scrape and open it with the urllib2 
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football' 
source = urllib2.urlopen(url) 

# Turn the saced source into a BeautifulSoup object 
soup = BeautifulSoup(source) 

# From the source HTML page, search and store all <div class="date">...</div> and it's content 
datesDiv = soup.findAll('div', { "class" : "date" }) 
# Loop through the tag and store only the needed information, being the actual date 
dates = [tag.contents[0] for tag in datesDiv] 

# From the source HTML page, search and store all <span class="time">...</span> and it's content 
timesSpan = soup.findAll('span', { "class" : "time" }) 
# Loop through the tag and store only the needed information, being the actual times 
times = [tag.contents[0] for tag in timesSpan] 

# From the source HTML page, search and store all <td class="home">..</td> and it's content 
hometeamsTd = soup.findAll('td', { "class" : "home" }) 
# Loop through the tag and store only the needed information, being the home team 
# if tag.contents[1] != 'Dutch KNVB Beker' - Do a direct test if output is needed or not 
hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker'] 

# From the source HTML page, search and store all <td class="away">..</td> and it's content 
# [1:] at the end meand slice the first one found 
awayteamsTd = soup.findAll('td', { "class" : "away" })[1:] 
# Loop through the tag and store only the needed information, being the away team 
awayteams = [tag.contents[1] for tag in awayteamsTd] 

# From the source HTML page, search and store all <a class="broadcast" href="...">..</a> and it's content 
broadcastsA = soup.findAll('a', { "class" : "broadcast" }) 
# Loop through the tag and store only the needed information, being the the broadcast URL, where we can find the streams 
broadcasts = [tag['href'] for tag in broadcastsA] 

我的问题是,该阵列不等于对方:

len(dates)  #9, should be 6 
len(times)  #18, should be 12 
len(hometeams) #6, is correct 
len(awayteams) #6, is correct 
len(broadcasts) #9, should be 6 

问题我有,我做了以下搜索获取dates数组:soup.findAll('div', { "class" : "date" })。这显然给我所有的<div>元素class="date"。但问题是,我只需要日期时<td>元素与class="away"

看到,我刮了HTML的下一个部分:

<tr class="odd"> 
    <td class="logo"> 
     <img src="/gfx/disciplines/football.gif" alt="football"/> 
    </td> 
    <td> 
     <a href="/competition.php?part=sports&amp;competitionid=92&amp;discipline=football">Dutch Cup</a> 
     <img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/> 
    </td> 
    <td> 
     <div class="date" rel="1380054900">Tuesday, September 24</div> <!-- This date is not needed, because within this <tr> there is no <td class="away"> --> 
     <span class="time" rel="1380054900">22:35</span> - <!-- This time is not needed, because within this <tr> there is no <td class="away"> --> 
    <span class="time" rel="1380058500">23:35</span> <!-- This time is not needed, because within this <tr> there is no <td class="away"> --> 
    </td> 
    <td class="home" colspan="3"> 
     <img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>Dutch KNVB Beker<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758"/> 
    </td> 
    <td class="broadcast"> 
     <a class="broadcast" href="/broadcast.php?matchid=221554&amp;part=sports">Live</a> <!-- This href is not needed, because within this <tr> there is no <td class="away"> --> 
    </td> 
</tr> 
<tr class="even"> 
    <td class="logo"> 
     <img src="/gfx/disciplines/football.gif" alt="football"/> 
    </td> 
    <td> 
     <a href="/competition.php?part=sports&amp;competitionid=92&amp;discipline=football">Dutch Cup</a> 
     <img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="comp-92"/> 
    </td> 
    <td> 
     <div class="date" rel="1380127500">Wednesday, September 25</div> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> --> 
     <span class="time" rel="1380127500">18:45</span> - <!-- This time we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> --> 
    <span class="time" rel="1380134700">20:45</span> <!-- This date we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> --> 
    </td> 
    <td class="home"> 
     <img class="flag" src="/gfx/flags/nl.gif" alt="nl"/>PSV<img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3"/> 
    </td> 
    <td>vs.</td> 
    <td class="away"> 
     <img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428"/>Stormvogels Telstar<img class="flag" src="/gfx/flags/nl.gif" alt="nl"/> 
    </td> 
    <td class="broadcast"> 
     <a class="broadcast" href="/broadcast.php?matchid=221555&amp;part=sports">Live</a> <!-- This href we would like to have, because now all records are complete, there is a <td class="away"> in this <tr> --> 
    </td> 
</tr> 

回答

1

如何重新思考的方式,你刮数据。你有相匹配的表 - 然后就遍历行:

for tr in soup.findAll('tr', {'class': ['odd', 'even']}): 
    home_team = tr.find('td', {'class': 'home'}).text 
    if home_team == 'Dutch KNVB Beker': 
     continue 

    away_team = tr.find('td', {'class': 'away'}).text 
    date = ' - '.join([span.text for span in tr.findAll('span', {'class': 'time'})]) 
    broadcast = tr.find('a', {'class': 'broadcast'})['href'] 

    print home_team, away_team, date, broadcast 

打印5行:

RKC Waalwijk Heracles 20:45 - 22:45 /broadcast.php?matchid=221553&part=sports 
PSV Stormvogels Telstar 18:45 - 20:45 /broadcast.php?matchid=221555&part=sports 
Ajax FC Volendam 20:45 - 22:45 /broadcast.php?matchid=221556&part=sports 
SC Heerenveen FC Twente 18:45 - 20:45 /broadcast.php?matchid=221558&part=sports 
Feyenoord FC Dordrecht 20:45 - 22:45 /broadcast.php?matchid=221559&part=sports 

然后,你可以收集成果转化类型的字典的列表。

+0

仅供参考,费耶诺德将赢得那场比赛:) – alecxe

+0

你是completemy的权利,但我刚刚开始尝试,并首先结束了这一点。我会更加深入您的解决方案。感谢您的帮助 –

+0

LOL通常他们应该alecxe ;-) –