美丽的汤，虽然声明

我尝试使用以下BeautifulSoup脚本查找第30个TED视频（视频的名称和网址）：美丽的汤，虽然声明

import urllib2 
from BeautifulSoup import BeautifulSoup 

total_pages = 3 
page_count = 1 
count = 1 

url = 'http://www.ted.com/talks?page=' 

while page_count < total_pages: 

    page = urllib2.urlopen("%s%d") %(url, page_count) 

    soup = BeautifulSoup(page) 

    link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail')) 

    outfile = open("test.html", "w") 

    print >> outfile, """<head> 
      <head> 
        <title>TED Talks Index</title> 
      </head> 

      <body> 

      <br><br><center> 

      <table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>""" 

    print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>" 

    ted_link = 'http://www.ted.com/' 

    for anchor in link: 
      print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href']) 

    count = count + 1 

    print >> outfile, """</table> 
        </body> 
        </html>""" 

    page_count = page_count + 1

代码看起来正常的减两件事情：

计数似乎没有增加。它只会经过并找到第一页的内容，即：前十个，而不是三十个视频。为什么？
这段代码给了我很多错误。我不知道该怎么实现我逻辑想在这里（用的urlopen（ “％s％d”）：

代码：

total_pages = 3 
page_count = 1 
count = 1 

url = 'http://www.ted.com/talks?page=' 

while page_count < total_pages: 

page = urllib2.urlopen("%s%d") %(url, page_count)

来源

2011-04-29 EGP

它不会解决您的问题，但你有两个开放''标签，而不是''和''标签：（IE'打印>> OUTFILE “”” '应该是' print >> outfile，“”“' – 2011-04-29 06:08:12

首先，简化循环，消除几个变量，其数额在这种情况下，样板克鲁夫特：

for pagenum in xrange(1, 4): # The 4 is annoying, write it as 3+1 if you like. 
    url = "http://www.ted.com/talks?page=%d" % pagenum 
    # do stuff with url

但是，让我们打开的文件的循环之外，而不是重新打开它每一次迭代这就是为什么你只看到10分的结果：谈判11-20而不是像你想象的那样前十应该是21-30，只是你在page_count < total_pages上打圈，只处理前两页。）

并且一次收集所有链接，然后写出输出。我已经剥离了HTML样式，这也使代码更容易遵循;相反，使用CSS，可能是内嵌的< style>元素，或者如果你喜欢，可以将其添加回去。

import urllib2 
from cgi import escape # Important! 
from BeautifulSoup import BeautifulSoup 

def is_talk_anchor(tag): 
    return tag.name == "a" and tag.findParent("dt", "thumbnail") 
links = [] 
for pagenum in xrange(1, 4): 
    soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum)) 
    links.extend(soup.findAll(is_talk_anchor)) 

out = open("test.html", "w") 

print >>out, """<html><head><title>TED Talks Index</title></head> 
<body> 
<table> 
<tr><th>#</th><th>Name</th><th>URL</th></tr>""" 

for x, a in enumerate(links): 
    print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"])) 

print >>out, "</table>" 

# Or, as an ordered list: 
print >>out, "<ol>" 
for a in links: 
    print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"])) 
print >>out, "</ol>" 

print >>out, "</body></html>"

来源

2011-04-29 06:27:29

Thanks！你能解释一下”from cgi import escape“吗？ – EGP 2011-04-29 06:48:58

@AdamC .:如果其中一个URL或标题包含一个对HTML特殊的字符，即＆，<或“（因为我在一个输出中使用了双引号），那么escape（）会处理它们，所以你不会产生无效的标记。你不会想要cgi模块中的其他东西，但它碰巧都在stdlib中并包含这个方便的函数。http://docs.python.org/library/cgi.html#cgi.escape – 2011-04-29 06:52:47

美丽的汤，虽然声明

回答

相关问题