BeautifulSoup刮从远程站点显示在本地网站的详细信息

Python和BeautifulSoup的新手，我试图从网站刮比赛细节，以显示在我的本地俱乐部网站。BeautifulSoup刮从远程站点显示在本地网站的详细信息

这是到目前为止我的代码：

import urllib2 
import sys 
import os 

sys.path.insert(0, os.path.abspath(os.path.dirname(__file__))) 
from BeautifulSoup import BeautifulSoup 

# Road 
#cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Road%20Events' 

# MTB 
cyclelab_url='http://www.cyclelab.com/OnLine%20Entries.aspx?type=Mountain%20Biking%20Events' 

response = urllib2.urlopen(cyclelab_url) 
html = response.read() 

soup = BeautifulSoup(html) 
event_names = soup.findAll(attrs= {"class" : "SpanEventName"}) 
for event in event_names: 
    txt = event.find(text=True) 
    print txt 

event_details = soup.findAll(attrs= {"class" : "TDText"}) 
for detail in event_details: 
    lines=[] 
    txt_details = detail.find(text=True) 
    print txt_details

这将打印事件名称和事件的详细信息，我想要做的就是，打印事件名称，然后在它下面该事件的事件细节。这看起来应该很简单，但我很难过。

来源

2011-02-28 daemonza

更新： Mark Longair有正确/更好的答案。看评论。

代码从上到下执行。所以，在你的代码中，首先打印所有的事件，然后是细节。你必须将代码“编织”在一起，这意味着每一个事件，打印所有的细节，然后移动到下一个事件。尝试是这样的：

[....] 
event_names = soup.findAll(attrs= {"class" : "SpanEventName"}) 
event_details = soup.findAll(attrs= {"class" : "TDText"}) 
for event in event_names: 
     txt = event.find(text=True) 
     print txt 
    for detail in event_details: 
     txt_details = detail.find(text=True) 
     print txt_details

一些进一步的改进：可以删除所有的空格和换行与.strip（）。例如：text_details = detail.find(text=True).strip()。

来源

2011-02-28 11:56:05 dermatthias

对于页面上的每个事件，将打印出来的事件名称，然后* all * events的详细信息 - 我认为这不是@ user621024想要的内容... – 2011-02-28 12:20:15

您是对的。我不应该在其他事情之间急于回答问题，也不要对其进行测试。 upvoted你的答案。 – dermatthias 2011-02-28 13:24:52

如果您查看页面的结构，您将看到在第一个循环中找到的事件名称由一个表格包含，该表格包含表格行中单元格对的所有其他有用详细信息。所以，我要做的只是一个循环，并且每次找到事件名称时，查找封闭表并查找所有事件。这似乎工作确定：

soup = BeautifulSoup(html) 
event_names = soup.findAll(attrs= {"class" : "SpanEventName"}) 
for event in event_names: 
    txt = event.find(text=True) 
    print "Event name: "+txt.strip() 
    # Find each parent in turn until we find the table that encloses 
    # the event details: 
    parent = event.parent 
    while parent and parent.name != "table": 
     parent = parent.parent 
    if not parent: 
     raise Exception, "Failed to find a <table> enclosing the event" 
    # Now parent is the table element, so look for every 
    # row under that table, and then the cells under that: 
    for row in parent.findAll('tr'): 
     cells = row.findAll('td') 
     # We only care about the rows where there is a multiple of two 
     # cells, since these are the key/value pairs: 
     if len(cells) % 2 != 0: 
      continue 
     for i in xrange(0,len(cells),2): 
      key_text = cells[i].find(text=True) 
      value_text = cells[i+1].find(text=True) 
      if key_text and value_text: 
       print " Key:",key_text.strip() 
       print " Value:",value_text.strip()

输出看起来像：

Event name: Columbia Grape Escape 2011 
    Key: Category: 
    Value: Mountain Biking Events 
    Key: Event Date: 
    Value: 4 March 2011 to 6 March 2011 
    Key: Entries Close: 
    Value: 31 January 2011 at 23:00 
    Key: Venue: 
    Value: Eden on the Bay, Blouberg 
    Key: Province: 
    Value: Western Cape 
    Key: Distance: 
    Value: 3 Day, 3 Stage Race (228km) 
    Key: Starting Time: 
    Value: -1:-1 
    Key: Timed By: 
    Value: RaceTec 
Event name: Investpro MTB Race 2011 
    Key: Category: 
    Value: Mountain Biking Events 
    Key: Event Date: 
    Value: 5 March 2011 
    Key: Entries Close: 
    Value: 25 February 2011 at 23:00

...等

来源

2011-02-28 12:13:01

谢谢！这工作得很好！在这个评论中学到的东西比在bautifulsoup上有多少个教程更多！ – daemonza 2011-02-28 12:24:57

BeautifulSoup刮从远程站点显示在本地网站的详细信息

回答

相关问题