使用selenium，beautifulsoup和python进行网页扫描

当前正在使用javascript进行搜索的房地产网站。我的过程首先为包含单个列表的包含多个不同href链接的列表开始，将这些链接附加到另一个列表，然后按下一个按钮。我这样做直到下一个按钮不再可点击。使用selenium，beautifulsoup和python进行网页扫描

我的问题是，收集所有列表（~13000链接）后，刮板不会移动到第二部分，打开链接并获取我需要的信息。 Selenium甚至不打开链接列表的第一个元素。

继承人我的代码：

wait = WebDriverWait(driver, 10) 
while True: 
    try: 
     element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next'))) 
     html = driver.page_source 
     soup = bs.BeautifulSoup(html,'html.parser') 
     table = soup.find(id = 'search_main_div') 
     classtitle = table.find_all('p', class_= 'title') 
     for aaa in classtitle: 
      hrefsyo = aaa.find('a', href = True) 
      linkstoclick = hrefsyo.get('href') 
      houselinklist.append(linkstoclick) 
     element.click() 
    except: 
     pass

在此之后我还有一个简单的刮刀，通过列表的例子不胜枚举，打开它们的硒和收集对目录资料。

for links in houselinklist: 
    print(links) 
    newwebpage = links 
    driver.get(newwebpage) 
    html = driver.page_source 
    soup = bs.BeautifulSoup(html,'html.parser') 
    . 
    . 
    . 
    . more code here

来源

2017-07-31 bathtubandatoaster

您正在刮的链接在哪里？ – ksai

https://www.28hse.com/cn/rent/house-type-g1 – bathtubandatoaster

你得到了什么错误？ – ksai

问题是while True:创建一个运行无穷大的循环。你的except子句有一个pass语句，这意味着一旦发生错误，循环只是继续运行。相反，它可以写成

wait = WebDriverWait(driver, 10) 
while True: 
    try: 
     element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next'))) 
     html = driver.page_source 
     soup = bs.BeautifulSoup(html,'html.parser') 
     table = soup.find(id = 'search_main_div') 
     classtitle = table.find_all('p', class_= 'title') 
     for aaa in classtitle: 
      hrefsyo = aaa.find('a', href = True) 
      linkstoclick = hrefsyo.get('href') 
      houselinklist.append(linkstoclick) 
     element.click() 
    except: 
     break # change this to exit loop

一旦出现错误时，循环break并移动到下一行代码

，或者就可以消除while循环，只是循环在你的使用for循环的href链接列表

wait = WebDriverWait(driver, 10) 
hrefLinks = ['link1','link2','link3'.....] 
for link in hrefLinks: 
    try: 
     driver.get(link) 
     element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next'))) 
     html = driver.page_source 
     soup = bs.BeautifulSoup(html,'html.parser') 
     table = soup.find(id = 'search_main_div') 
     classtitle = table.find_all('p', class_= 'title') 
     for aaa in classtitle: 
      hrefsyo = aaa.find('a', href = True) 
      linkstoclick = hrefsyo.get('href') 
      houselinklist.append(linkstoclick) 
     element.click() 
    except: 
     pass #pass on error and move on to next hreflink

来源

2017-07-31 05:29:34 DJK

这是否解决您的问题？ – DJK

哟感谢队友 – bathtubandatoaster

使用selenium，beautifulsoup和python进行网页扫描

回答

相关问题