简单的Python网络爬虫

我在YouTube上遵循一个python教程，并起床到我们做基本的网络爬虫的地方。我试图让自己做一件非常简单的事情。去Craigslist上的我的城市汽车部分，打印每个条目的标题/链接，并跳转到下一页并在需要时重复。它适用于第一页，但不会继续更改页面并获取数据。有人可以帮助解释什么是错的？简单的Python网络爬虫

import requests 
from bs4 import BeautifulSoup 

def widow(max_pages): 
    page = 0 # craigslist starts at page 0 
    while page <= max_pages: 
     url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary 
     for link in soup.findAll('a', {'class':'hdrlnk'}): 
      href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html 
      title = link.string 
      print(title) 
      print(href) 
      page += 100 # craigslist pages go 0, 100, 200, etc 

widow(0) # 0 gets the first page, replace with multiples of 100 for extra pages

来源

2016-09-19 v0dkuh

看起来你有你的缩进一个问题，你需要做的 page += 100主，而块不里面的for循环。

def widow(max_pages): 
    page = 0 # craigslist starts at page 0 
    while page <= max_pages: 
     url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary 
     for link in soup.findAll('a', {'class':'hdrlnk'}): 
      href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html 
      title = link.string 
      print(title) 
      print(href) 
     page += 100 # craigslist pages go 0, 100, 200, etc

来源

2016-09-19 05:15:10 sisanared

神圣的废话哇。我现在很笨，现在哈哈。谢谢。 – v0dkuh

这不仅仅是解决方案的一部分吗？ 'page'递增，但在示例中'max_pages'设置为'0'。在第一页之后，'100 <= 0'将返回False并因此退出循环。 –

OP的评论建议，他会打电话给窗口（0）以获取第一页。如果他打电话给窗口（1000），那么他将继续刮擦，直到页面<= 1000 – sisanared

简单的Python网络爬虫

回答

相关问题