2016-10-11 85 views
-2

我创建了一个网站刮板将刮去黄页所有信息(用于教育目的)为什么它跳过整个循环?

def actual_yellow_pages_scrape(link,no,dir,gui,sel,ypfind,terminal,user,password,port,type): 
print(link,no,dir,gui,sel,ypfind,terminal,user,password,port,type) 
r = requests.get(link,headers=REQUEST_HEADERS) 
soup = BeautifulSoup(r.content,"html.parser") 
workbook = xlwt.Workbook() 
sheet = workbook.add_sheet(str(ypfind)) 
count = 0 

for i in soup.find_all(class_="business-name"): 
     sheet.write(count,0,str(i.text)) 
     sheet.write(count,1,str("http://www.yellowpages.com"+i.get("href"))) 
     r1 = requests.get("http://www.yellowpages.com"+i.get("href")) 
     soup1 = BeautifulSoup(r1.content,"html.parser") 
     website = soup1.find("a",class_="custom-link") 
     try: 
      print("Acquiring Website") 
      sheet.write(count,2,str(website.get("href"))) 
     except: 
      sheet.write(count,2,str("None")) 
     email = soup1.find("a",class_="email-business") 
     try: 
      print(email.get("href")) 
      EMAIL = re.sub("mailto:","",str(email.get("href"))) 
      sheet.write(count,3,str(EMAIL)) 
     except: 
      sheet.write(count,3,str("None")) 
     phonetemp = soup1.find("div",class_="contact") 
     try: 
      phone = phonetemp.find("p") 
      print(phone.text) 
      sheet.write(count,4,str(phone.text)) 
     except: 
      sheet.write(count,4,str("None")) 
     reviews = soup1.find(class_="count") 
     try: 
      print(reviews.text) 
      sheet.write(count,5,str(reviews.text)) 
     except: 
      sheet.write(count,5,str("None")) 
     count+=1 
save = dir+"\\"+ypfind+str(no)+".xls" 
workbook.save(save) 
no+=1 
for i in soup.find_all("a",class_="next ajax-page"): 
    print(i.get("href")) 
    actual_yellow_pages_scrape("http://www.yellowpages.com"+str(i.get("href")),no,dir,gui,sel,ypfind,terminal,user,password,port,type) 

的代码是我的刮板的上面部分。我在汤和for循环中创建了断点,甚至没有执行for循环的单行。没有错误抛出。我试着打印1-10的数字,但它不工作,为什么?

谢谢

+0

可能因为'find_all'的结果是空的?你有没有检查过它? – Julien

+0

因为你迭代的内容可能是空的。 –

+0

使用'print()'来查看变量中的含义。 – furas

回答

0

答案已经发现,

我用文字visulaizer找到什么是“r.content”我soupified,并得到一个干净的HTML,并通过HTML文件不见了,最后发现浏览器不受支持,所以我删除了请求头并运行代码,终于得到我想要的