2011-02-28 61 views
0

Python和BeautifulSoup的新手,我试图从网站刮比赛细节,以显示在我的本地俱乐部网站。BeautifulSoup刮从远程站点显示在本地网站的详细信息

这是到目前为止我的代码:

import urllib2 
import sys 
import os 

sys.path.insert(0, os.path.abspath(os.path.dirname(__file__))) 
from BeautifulSoup import BeautifulSoup 

# Road 
#cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Road%20Events' 

# MTB 
cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Mountain%20Biking%20Events' 

response = urllib2.urlopen(cyclelab_url) 
html = response.read() 

soup = BeautifulSoup(html) 
event_names = soup.findAll(attrs= {"class" : "SpanEventName"}) 
for event in event_names: 
    txt = event.find(text=True) 
    print txt 

event_details = soup.findAll(attrs= {"class" : "TDText"}) 
for detail in event_details: 
    lines=[] 
    txt_details = detail.find(text=True) 
    print txt_details 

这将打印事件名称和事件的详细信息,我想要做的就是,打印事件名称,然后在它下面该事件的事件细节。这看起来应该很简单,但我很难过。

回答

0

更新: Mark Longair有正确/更好的答案。看评论。

代码从上到下执行。所以,在你的代码中,首先打印所有的事件,然后是细节。你必须将代码“编织”在一起,这意味着每一个事件,打印所有的细节,然后移动到下一个事件。尝试是这样的:

[....] 
event_names = soup.findAll(attrs= {"class" : "SpanEventName"}) 
event_details = soup.findAll(attrs= {"class" : "TDText"}) 
for event in event_names: 
     txt = event.find(text=True) 
     print txt 
    for detail in event_details: 
     txt_details = detail.find(text=True) 
     print txt_details 

一些进一步的改进:可以删除所有的空格和换行与.strip()。例如:text_details = detail.find(text=True).strip()

+0

对于页面上的每个事件,将打印出来的事件名称,然后* all * events的详细信息 - 我认为这不是@ user621024想要的内容... – 2011-02-28 12:20:15

+1

您是对的。我不应该在其他事情之间急于回答问题,也不要对其进行测试。 upvoted你的答案。 – dermatthias 2011-02-28 13:24:52

4

如果您查看页面的结构,您将看到在第一个循环中找到的事件名称由一个表格包含,该表格包含表格行中单元格对的所有其他有用详细信息。所以,我要做的只是一个循环,并且每次找到事件名称时,查找封闭表并查找所有事件。这似乎工作确定:

soup = BeautifulSoup(html) 
event_names = soup.findAll(attrs= {"class" : "SpanEventName"}) 
for event in event_names: 
    txt = event.find(text=True) 
    print "Event name: "+txt.strip() 
    # Find each parent in turn until we find the table that encloses 
    # the event details: 
    parent = event.parent 
    while parent and parent.name != "table": 
     parent = parent.parent 
    if not parent: 
     raise Exception, "Failed to find a <table> enclosing the event" 
    # Now parent is the table element, so look for every 
    # row under that table, and then the cells under that: 
    for row in parent.findAll('tr'): 
     cells = row.findAll('td') 
     # We only care about the rows where there is a multiple of two 
     # cells, since these are the key/value pairs: 
     if len(cells) % 2 != 0: 
      continue 
     for i in xrange(0,len(cells),2): 
      key_text = cells[i].find(text=True) 
      value_text = cells[i+1].find(text=True) 
      if key_text and value_text: 
       print " Key:",key_text.strip() 
       print " Value:",value_text.strip() 

输出看起来像:

Event name: Columbia Grape Escape 2011 
    Key: Category: 
    Value: Mountain Biking Events 
    Key: Event Date: 
    Value: 4 March 2011 to 6 March 2011 
    Key: Entries Close: 
    Value: 31 January 2011 at 23:00 
    Key: Venue: 
    Value: Eden on the Bay, Blouberg 
    Key: Province: 
    Value: Western Cape 
    Key: Distance: 
    Value: 3 Day, 3 Stage Race (228km) 
    Key: Starting Time: 
    Value: -1:-1 
    Key: Timed By: 
    Value: RaceTec 
Event name: Investpro MTB Race 2011 
    Key: Category: 
    Value: Mountain Biking Events 
    Key: Event Date: 
    Value: 5 March 2011 
    Key: Entries Close: 
    Value: 25 February 2011 at 23:00 

...等

+0

谢谢!这工作得很好!在这个评论中学到的东西比在bautifulsoup上有多少个教程更多! – daemonza 2011-02-28 12:24:57