1

我想拉叽叽喳喳使用Selenium铬的webdriver和BeautifulSoup对于具有80K追随者账户追随者数据的所有追随者。 我在脚本中遇到两个问题:使用python中的selenium chrome webdriver提取twitter追随者数据?无法加载

1)在所有关注者加载后,滚动到页面底部以获取整个页面源代码时,我的脚本不会一直滚动到底部。在加载随机数的追随者之后,它停止滚动,然后开始遍历每个追随者配置文件以获取他们的数据。我希望它加载页面上的所有追随者,然后开始遍历配置文件。

2)我的第二个问题是,我每次运行该脚本,它会尝试通过一个滚动到下一个,直到所有的追随者被加载,然后开始通过一次解析一个跟随数据提取数据的时间。这需要4到5天才能获取我的案例中的所有追随者数据(80K追随者)。有没有更好的方法来做到这一点。

这里是我的脚本:

from bs4 import BeautifulSoup 
 
import sys 
 
import os,re 
 
import time 
 
from selenium import webdriver 
 
from selenium.webdriver.support.ui import Select 
 
from selenium.webdriver.common.keys import Keys 
 
from os import listdir 
 
from os.path import isfile, join 
 

 
print "Running for chrome." 
 

 
chromedriver=sys.argv[1] 
 
download_path=sys.argv[2] 
 
os.system('killall -9 "Google Chrome"') 
 
try: 
 
\t os.environ["webdriver.chrome.driver"]=chromedriver 
 
\t chromeOptions = webdriver.ChromeOptions() 
 
\t prefs = {"download.default_directory" : download_path} 
 
\t chromeOptions.add_experimental_option("prefs",prefs) 
 
\t driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions) 
 
\t driver.implicitly_wait(20) 
 
\t driver.maximize_window() 
 
except Exception as err: 
 
\t print "Error:Failed to open chrome." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 
\t 
 
#opening the web page 
 
try: 
 
\t driver.get('https://twitter.com/login') 
 
except Exception as err: 
 
\t print "Error:Failed to open url." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 

 
username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']") 
 
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']") 
 

 
username.send_keys("###########") 
 
password.send_keys("###########") 
 
driver.find_element_by_xpath("//button[@type='submit']").click() 
 
#os.system('killall -9 "Google Chrome"') 
 
driver.get('https://twitter.com/sadserver/followers') 
 

 

 

 
followers_link=driver.page_source #follwer page 18at a time 
 
soup=BeautifulSoup(followers_link,'html.parser') 
 

 
output=open('twitter_follower_sadoperator.csv','a') 
 
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n') 
 
div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
name_list=[] 
 
lastHeight = driver.execute_script("return document.body.scrollHeight") 
 

 

 
for _ in xrange(0, followers_count/followers_per_page + 1): 
 
     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
 
     time.sleep(5) 
 
     newHeight = driver.execute_script("return document.body.scrollHeight") 
 
     if newHeight == lastHeight: 
 
       followers_link=driver.page_source #follwer page 18at a time 
 
       soup=BeautifulSoup(followers_link,'html.parser') 
 
       div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
       bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
       for name in bref: 
 
         name_list.append(name['href']) 
 
       break 
 
     lastHeight = newHeight 
 
     followers_link='' 
 

 
print len(name_list) 
 

 

 
for x in range(0,len(name_list)): 
 
     #print name['href'] 
 
     #print name.text 
 
     driver.stop_client() 
 
     driver.get('https://twitter.com'+name_list[x]) 
 
     page_source=driver.page_source 
 
     each_soup=BeautifulSoup(page_source,'html.parser') 
 
     profile=each_soup.find('div',{'class':'ProfileHeaderCard'}) 
 
          
 
     try: 
 
       name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text 
 
       if name: 
 
         output.write('"'+name.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in name:',e 
 

 
     try: 
 
       handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text 
 
       if handle: 
 
         output.write('"'+handle.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in handle:',e 
 

 
     try: 
 
       location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text 
 
       if location: 
 
         output.write('"'+location.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in location:',e 
 

 
     try: 
 
       bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text 
 
       if bio: 
 
         output.write('"'+bio.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in bio:',e 
 
         
 
     try: 
 
       joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text 
 
       if joinDate: 
 
         output.write('"'+joinDate.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in joindate:',e 
 
     
 
     try: 
 
       url = [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1] 
 
       if url: 
 
         output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n') 
 
       else: 
 
         output.write(' '+'\n') 
 
     except Exception as e: 
 
       output.write(' '+'\n') 
 
       print 'Error in url:',e 
 
     
 

 

 
     
 
output.close() 
 

 

 
os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")

回答

0

有一个更好的方式。 使用Twitter的API,这里有一个快速Github上脚本,我发现Github Script 对不起,你可能觉得你已经使用硒细腰大量的时间(有专业人士不使用API​​) 伟大的职位上的自动化和得到的东西是如何工作的:Twitter API

有滚动多次的方式,但你必须做一些数学或设置条件来阻止这一切。

driver.execute_script("window.scrollTo(0, 10000);") 

比方说,你有10K追随者和INTIAL显示追随者,之后你会加载10followers每个滚动。你会滚动另一个倍。

下面是您的案例的确切用法,当然由alecxe:D。 Qudora* answer By - alecxe -

html = driver.page_source 

.page_source可一旦你揭示所有的追随者(滚动),然后用一些分析它像BeautifulSoup

+0

我们可以通过手动滚动加载所有的追随者,并保存在文本文件页面的源代码,然后遍历从文本文件,而不是去到Twitter网站的所有追随者的数据。我不知道这是否会起作用。如果它会那么能否请您提供的代码要做到这一点,因为我一直想这样做,并没有成功。谢谢。 –

+0

是的,有硒FUNC> .page_source例如 HTML = driver.page_source –

0

我做T他实现由提到使用alecxe在他的回答中,但我的脚本仍然没有解析所有的追随者。它仍在加载随机数的追随者。似乎无法得到这个底部。有人可以尝试运行它们,看看他们是否能够加载所有的追随者。下面是修改后的脚本:

from bs4 import BeautifulSoup 
 
import sys 
 
import os,re 
 
import time 
 
from selenium import webdriver 
 
from selenium.webdriver.support.ui import Select 
 
from selenium.webdriver.common.keys import Keys 
 
from os import listdir 
 
from os.path import isfile, join 
 

 
print "Running for chrome." 
 

 
chromedriver=sys.argv[1] 
 
download_path=sys.argv[2] 
 
os.system('killall -9 "Google Chrome"') 
 
try: 
 
\t os.environ["webdriver.chrome.driver"]=chromedriver 
 
\t chromeOptions = webdriver.ChromeOptions() 
 
\t prefs = {"download.default_directory" : download_path} 
 
\t chromeOptions.add_experimental_option("prefs",prefs) 
 
\t driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions) 
 
\t driver.implicitly_wait(20) 
 
\t driver.maximize_window() 
 
except Exception as err: 
 
\t print "Error:Failed to open chrome." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 
\t 
 
#opening the web page 
 
try: 
 
\t driver.get('https://twitter.com/login') 
 
except Exception as err: 
 
\t print "Error:Failed to open url." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 

 
username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']") 
 
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']") 
 

 
username.send_keys("*****************") 
 
password.send_keys("*****************") 
 
driver.find_element_by_xpath("//button[@type='submit']").click() 
 
#os.system('killall -9 "Google Chrome"') 
 
driver.get('https://twitter.com/sadoperator/followers') 
 

 

 

 
followers_link=driver.page_source #follwer page 18at a time 
 
soup=BeautifulSoup(followers_link,'html.parser') 
 

 
output=open('twitter_follower_sadoperator.csv','a') 
 
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n') 
 
div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
name_list=[] 
 
lastHeight = driver.execute_script("return document.body.scrollHeight") 
 

 
followers_link=driver.page_source #follwer page 18at a time 
 
soup=BeautifulSoup(followers_link,'html.parser') 
 

 
followers_per_page = 18 
 
followers_count = 15777 
 

 

 
for _ in xrange(0, followers_count/followers_per_page + 1): 
 
     driver.execute_script("window.scrollTo(0, 7755000);") 
 
     time.sleep(2) 
 
     newHeight = driver.execute_script("return document.body.scrollHeight") 
 
     if newHeight == lastHeight: 
 
       followers_link=driver.page_source #follwer page 18at a time 
 
       soup=BeautifulSoup(followers_link,'html.parser') 
 
       div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
       bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
       for name in bref: 
 
         name_list.append(name['href']) 
 
       break 
 
     lastHeight = newHeight 
 
     followers_link='' 
 

 
print len(name_list) 
 

 
''' 
 
for x in range(0,len(name_list)): 
 
     #print name['href'] 
 
     #print name.text 
 
     driver.stop_client() 
 
     driver.get('https://twitter.com'+name_list[x]) 
 
     page_source=driver.page_source 
 
     each_soup=BeautifulSoup(page_source,'html.parser') 
 
     profile=each_soup.find('div',{'class':'ProfileHeaderCard'}) 
 
          
 
     try: 
 
       name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text 
 
       if name: 
 
         output.write('"'+name.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in name:',e 
 

 
     try: 
 
       handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text 
 
       if handle: 
 
         output.write('"'+handle.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in handle:',e 
 

 
     try: 
 
       location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text 
 
       if location: 
 
         output.write('"'+location.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in location:',e 
 

 
     try: 
 
       bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text 
 
       if bio: 
 
         output.write('"'+bio.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in bio:',e 
 
         
 
     try: 
 
       joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text 
 
       if joinDate: 
 
         output.write('"'+joinDate.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in joindate:',e 
 
     
 
     try: 
 
       url = [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1] 
 
       if url: 
 
         output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n') 
 
       else: 
 
         output.write(' '+'\n') 
 
     except Exception as e: 
 
       output.write(' '+'\n') 
 
       print 'Error in url:',e 
 
     
 

 

 
     
 
output.close() 
 
''' 
 

 
os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")

0
  1. 在Firefox或其他浏览器中打开开发者控制台,并滚动翻页过程中发生写下来(复印件)要求 - 你会用它来构建你的申请。请求看起来水木清华这样的 - https://twitter.com/DiaryofaMadeMan/followers/users?include_available_features=1&include_entities=1&max_position=1584951385597824282&reset_error_state=false,和搜索HTML源数据民位置这样的 - 数据分位=使用PhantomJS“1584938620170076301”
  2. 加载HTML - 解析使用Beautifulsoup。你需要获得第一部分追随者和“数据最小”值。保存追随者到列表中,“数据分位”变量
  3. 使用保存在第1阶段的要求和“数据民”,构建新的要求 - 与保存数据分钟只更换的请求数据-MAX的数字
  4. 使用Python请求(无webdriver的再)来发送请求和接收JSON响应。
  5. 获得新的追随者和新的数据min,从响应JSON
  6. 重复2,3,4直至数据分= 0

这种方式比API要好得多,因为你可以加载大量的数据没有任何限制

相关问题