我尝试使用以下BeautifulSoup脚本查找第30个TED视频(视频的名称和网址):美丽的汤,虽然声明
import urllib2
from BeautifulSoup import BeautifulSoup
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
soup = BeautifulSoup(page)
link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))
outfile = open("test.html", "w")
print >> outfile, """<head>
<head>
<title>TED Talks Index</title>
</head>
<body>
<br><br><center>
<table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""
print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"
ted_link = 'http://www.ted.com/'
for anchor in link:
print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])
count = count + 1
print >> outfile, """</table>
</body>
</html>"""
page_count = page_count + 1
代码看起来正常的减两件事情:
计数似乎没有增加。它只会经过并找到第一页的内容,即:前十个,而不是三十个视频。为什么?
这段代码给了我很多错误。我不知道该怎么实现我逻辑想在这里(用的urlopen( “%s%d”):
代码:
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
它不会解决您的问题,但你有两个开放''
标签,而不是''和''标签:(IE'打印>> OUTFILE “”” '应该是' print >> outfile,“”“' – 2011-04-29 06:08:12