我想拉叽叽喳喳使用Selenium铬的webdriver和BeautifulSoup对于具有80K追随者账户追随者数据的所有追随者。 我在脚本中遇到两个问题:使用python中的selenium chrome webdriver提取twitter追随者数据?无法加载
1)在所有关注者加载后,滚动到页面底部以获取整个页面源代码时,我的脚本不会一直滚动到底部。在加载随机数的追随者之后,它停止滚动,然后开始遍历每个追随者配置文件以获取他们的数据。我希望它加载页面上的所有追随者,然后开始遍历配置文件。
2)我的第二个问题是,我每次运行该脚本,它会尝试通过一个滚动到下一个,直到所有的追随者被加载,然后开始通过一次解析一个跟随数据提取数据的时间。这需要4到5天才能获取我的案例中的所有追随者数据(80K追随者)。有没有更好的方法来做到这一点。
这里是我的脚本:
from bs4 import BeautifulSoup
import sys
import os,re
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from os import listdir
from os.path import isfile, join
print "Running for chrome."
chromedriver=sys.argv[1]
download_path=sys.argv[2]
os.system('killall -9 "Google Chrome"')
try:
\t os.environ["webdriver.chrome.driver"]=chromedriver
\t chromeOptions = webdriver.ChromeOptions()
\t prefs = {"download.default_directory" : download_path}
\t chromeOptions.add_experimental_option("prefs",prefs)
\t driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions)
\t driver.implicitly_wait(20)
\t driver.maximize_window()
except Exception as err:
\t print "Error:Failed to open chrome."
\t print "Error: ",err
\t driver.stop_client()
\t driver.close()
\t
#opening the web page
try:
\t driver.get('https://twitter.com/login')
except Exception as err:
\t print "Error:Failed to open url."
\t print "Error: ",err
\t driver.stop_client()
\t driver.close()
username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']")
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']")
username.send_keys("###########")
password.send_keys("###########")
driver.find_element_by_xpath("//button[@type='submit']").click()
#os.system('killall -9 "Google Chrome"')
driver.get('https://twitter.com/sadserver/followers')
followers_link=driver.page_source #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')
output=open('twitter_follower_sadoperator.csv','a')
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n')
div = soup.find('div',{'class':'GridTimeline-items has-items'})
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
name_list=[]
lastHeight = driver.execute_script("return document.body.scrollHeight")
for _ in xrange(0, followers_count/followers_per_page + 1):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
followers_link=driver.page_source #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')
div = soup.find('div',{'class':'GridTimeline-items has-items'})
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
for name in bref:
name_list.append(name['href'])
break
lastHeight = newHeight
followers_link=''
print len(name_list)
for x in range(0,len(name_list)):
#print name['href']
#print name.text
driver.stop_client()
driver.get('https://twitter.com'+name_list[x])
page_source=driver.page_source
each_soup=BeautifulSoup(page_source,'html.parser')
profile=each_soup.find('div',{'class':'ProfileHeaderCard'})
try:
name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text
if name:
output.write('"'+name.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in name:',e
try:
handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text
if handle:
output.write('"'+handle.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in handle:',e
try:
location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text
if location:
output.write('"'+location.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in location:',e
try:
bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text
if bio:
output.write('"'+bio.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in bio:',e
try:
joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text
if joinDate:
output.write('"'+joinDate.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in joindate:',e
try:
url = [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1]
if url:
output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n')
else:
output.write(' '+'\n')
except Exception as e:
output.write(' '+'\n')
print 'Error in url:',e
output.close()
os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")
我们可以通过手动滚动加载所有的追随者,并保存在文本文件页面的源代码,然后遍历从文本文件,而不是去到Twitter网站的所有追随者的数据。我不知道这是否会起作用。如果它会那么能否请您提供的代码要做到这一点,因为我一直想这样做,并没有成功。谢谢。 –
是的,有硒FUNC> .page_source例如 HTML = driver.page_source –