2017-06-13 72 views
0

我想读一些使用python和phantomjs的新闻文章。 我正在使用无尽滚动的网站在滚动到底部时动态加载下一篇文章。 Here是一个示例网址。python硒phantomjs无尽滚动只为第一页工作

我使用下面的代码进行管理,让它工作加载一篇文章,但只有一篇文章......任何人都可以帮助我使其无限工作?或者任何提示有什么不对,都可以改进? 谢谢!

from selenium import webdriver 
from bs4 import BeautifulSoup 
from time import sleep 
from selenium.webdriver.common.proxy import * 
from selenium import webdriver 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 

# Pretend to be chrome 
dcap = dict(DesiredCapabilities.PHANTOMJS) 
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 " 
    "(KHTML, like Gecko) Chrome/15.0.87" 
) 

driver = webdriver.PhantomJS(desired_capabilities=dcap) 
driver.set_window_size(1120, 550) 

## GET 
driver.get("https://www.bloomberg.com/news/features/2017-06-08/no-one-has-ever-made-a-corruption-machine-like-this-one") 

# print current scrollTop 
driver.execute_script('return document.body.scrollTop') 
# out: 0 

# print current scrollHeight 
driver.execute_script('return document.body.scrollHeight') 
# out: 18255 

# scroll to bottom 
driver.execute_script("window.scrollTo(0, 18255)") 

# print current scrollTop 
driver.execute_script('return document.body.scrollTop') 
# out: 17705 

# print current scrollHeight 
driver.execute_script('return document.body.scrollHeight') 
# out: 29050 
# It works! Great! 

# Scroll to bottom again 
driver.execute_script("window.scrollTo(0, 29050)") 

# print current scrollTop 
driver.execute_script('return document.body.scrollTop') 
# out: 28500 

# print current scrollHeight 
driver.execute_script('return document.body.scrollHeight') 
# out: 29050 
# It's still the same, no matter how hard I try, it cannot load more... 


# According to tolmachofof's suggestion below, I tried to scroll very slowly, still no luck. :< 
top = driver.execute_script('return document.body.scrollTop') 
height = driver.execute_script('return document.body.scrollHeight') 
for i in range(top, height, 100): 
    driver.execute_script("window.scrollTo(0," + str(i) + ")") 
    print(driver.execute_script('return document.body.scrollTop')) 
    sleep(0.2) 

回答

0

你可以使用这个脚本:

SCROLL_TEMPLATE = """ 

     var scroll_interval = arguments[0]; 
     var scroll_time = arguments[1]; 
     var scroll_step = arguments[2] 

     function scroll() { 
      document.body.scrollTop += scroll_step; 
     } 

     var _scroll = setInterval(scroll, scroll_interval) 
     setTimeout(function() {clearInterval(_scroll)}, scroll_time)""" 

    def scroll_page(driver, scroll_interval=0.5, scroll_time=5000, scroll_step=50): 
     driver.execute_script(SCROLL_TEMPLATE, scroll_interval, scroll_time, scroll_step) 
     # Script will finish before scroll if you delete it 
     sleep((scroll_time/1000) + 0.3) 

注:scroll_interval是单条语句之间的超时。 Scroll_time是整个页面滚动时间。 Scroll_step - 单个滚动步(px)

+0

请阅读我的问题,我可以让它滚动,但我不知道它为什么只能在第一页上工作... – Student222

+0

您可以非常快地滚动作品。我曾经有过同样的问题。这个解决方案帮助我通过降低滚动速度来无休止地分页。 – tolmachofof

+0

我试着慢慢滚动。仍然没有工作... – Student222