2017-02-25 85 views
0

之间这是我到目前为止的代码内容:http://pastebin.com/CdUiXpdf无法显示在span标签

import requests 
from bs4 import BeautifulSoup 


def web_crawler(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = "https://www.kupindo.com/Knjige/artikli/1_strana_" + str(page) 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, "html.parser") 
     print("PAGE: " + str(page)) 
     for link in soup.find_all("a", class_="item_link"): 
      href = link.get("href") 
      # title = link.string 
      print(href) 
      # print(title) 
      extended_crawler(href) 
     page += 1 


def extended_crawler(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "html.parser") 
    for view_counter in soup.find_all("span", id="BrojPregleda"): 
     print("View Count: ", view_counter.text) 


web_crawler(1) 

输出是例如

PAGE: 1 
https://www.kupindo.com/showcontent/2143/Beletristika/37875219_VUK-DRASKOVIC-Izabrana-dela-1-7-Srpska-rec 
View Count: 

所以浏览次数是空的,甚至尽管有用于查找带有BrojPregleda标识的跨度的expanded_crawler函数,不显示任何内容。

+0

@Arman你是什么意思PDF格式的代码? pastebin链接随机以pdf结尾,它是纯文本 – dovla

回答

1

那是因为其具有的ID BrojPregleda跨度正在通过Ajax调用填充。无论是用Selenium来获取值或者请按照下列步骤操作:

1)获取从产品ID在URL

2)后到http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php有一个FORMDATA关键 - 与1的值IDPredmet

3)获得的观看次数

例子:

def extended_crawler(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "html.parser") 
    ViewCount = requests.post('http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php', data = {'IDPredmet': item_url[item_url.rfind('/') + 1:item_url.rfind('_')]}) 
    print (ViewCount.text) 
+0

这很有效,非常感谢。从来没有想到这一点 – dovla