2011-04-29 53 views
1

我尝试使用以下BeautifulSoup脚本查找第30个TED视频(视频的名称和网址):美丽的汤,虽然声明

import urllib2 
from BeautifulSoup import BeautifulSoup 

total_pages = 3 
page_count = 1 
count = 1 

url = 'http://www.ted.com/talks?page=' 

while page_count < total_pages: 

    page = urllib2.urlopen("%s%d") %(url, page_count) 

    soup = BeautifulSoup(page) 

    link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail')) 

    outfile = open("test.html", "w") 

    print >> outfile, """<head> 
      <head> 
        <title>TED Talks Index</title> 
      </head> 

      <body> 

      <br><br><center> 

      <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>""" 

    print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>" 

    ted_link = 'http://www.ted.com/' 

    for anchor in link: 
      print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href']) 

    count = count + 1 

    print >> outfile, """</table> 
        </body> 
        </html>""" 

    page_count = page_count + 1 

代码看起来正常的减两件事情:

  1. 计数似乎没有增加。它只会经过并找到第一页的内容,即:前十个,而不是三十个视频。为什么?

  2. 这段代码给了我很多错误。我不知道该怎么实现我逻辑想在这里(用的urlopen( “%s%d”):

代码:

total_pages = 3 
page_count = 1 
count = 1 

url = 'http://www.ted.com/talks?page=' 

while page_count < total_pages: 

page = urllib2.urlopen("%s%d") %(url, page_count) 
+0

它不会解决您的问题,但你有两个开放''标签,而不是''和''标签:(IE'打印>> OUTFILE “”” '应该是' print >> outfile,“”“' – 2011-04-29 06:08:12

回答

1

首先,简化循环,消除几个变量,其数额在这种情况下,样板克鲁夫特:

for pagenum in xrange(1, 4): # The 4 is annoying, write it as 3+1 if you like. 
    url = "http://www.ted.com/talks?page=%d" % pagenum 
    # do stuff with url 

但是,让我们打开的文件的循环之外,而不是重新打开它每一次迭代这就是为什么你只看到10分的结果:谈判11-20而不是像你想象的那样前十应该是21-30,只是你在page_count < total_pages上打圈,只处理前两页。)

并且一次收集所有链接,然后写出输出。我已经剥离了HTML样式,这也使代码更容易遵循;相反,使用CSS,可能是内嵌的< style>元素,或者如果你喜欢,可以将其添加回去。

import urllib2 
from cgi import escape # Important! 
from BeautifulSoup import BeautifulSoup 

def is_talk_anchor(tag): 
    return tag.name == "a" and tag.findParent("dt", "thumbnail") 
links = [] 
for pagenum in xrange(1, 4): 
    soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum)) 
    links.extend(soup.findAll(is_talk_anchor)) 

out = open("test.html", "w") 

print >>out, """<html><head><title>TED Talks Index</title></head> 
<body> 
<table> 
<tr><th>#</th><th>Name</th><th>URL</th></tr>""" 

for x, a in enumerate(links): 
    print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"])) 

print >>out, "</table>" 

# Or, as an ordered list: 
print >>out, "<ol>" 
for a in links: 
    print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"])) 
print >>out, "</ol>" 

print >>out, "</body></html>" 
+0

Thanks!你能解释一下”from cgi import escape“吗? – EGP 2011-04-29 06:48:58

+0

@AdamC .:如果其中一个URL或标题包含一个对HTML特殊的字符,即&,<或“(因为我在一个输出中使用了双引号),那么escape()会处理它们,所以你不会产生无效的标记。你不会想要cgi模块中的其他东西,但它碰巧都在stdlib中并包含这个方便的函数。http://docs.python.org/library/cgi.html#cgi.escape – 2011-04-29 06:52:47