-1
我想抓取多个网站(从CSV文件),并从Chrome的“检查元素” - 源代码(右键单击网页,然后选择检查元素)中提取某些关键字。眼下用selenium webdriver抓取多个网址
我可以从他们的“查看源代码”中提取某些关键字-code(在网页上点击右键,然后选择通过浏览器查看源代码)这个脚本:
import urllib2
import csv
fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js']
def csv_writerheader(path):
with open(path, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
def csv_writer(dictdata, path):
with open(path, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writerow(dictdata)
csv_output_file = 'EXPORT_Results!.csv'
# LIST OF KEY WORDS (TITLE CASE TO MATCH FIELD NAMES)
keywords = ['@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js']
csv_writerheader(csv_output_file)
with open('top1m-edited.csv', 'r') as f:
csv_f = csv.reader(f, lineterminator='\n')
for line in f:
strdomain = line.strip()
# INITIALIZE DICT
data = {'Website': strdomain}
if '.nl' in strdomain:
try:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
# ITERATE THROUGH EACH KEY AND UPDATE DICT
for searchstring in keywords:
if searchstring.lower() in str(html_content).lower():
print (strdomain, searchstring, 'found')
data[searchstring] = 'found'
else:
print (strdomain, searchstring, 'not found')
data[searchstring] = 'not found'
# CALL METHOD PASSING DICT AND OUTPUT FILE
csv_writer(data, csv_output_file)
except urllib2.HTTPError:
print (strdomain, 'HTTP ERROR')
except urllib2.URLError:
print (strdomain, 'URL ERROR')
except urllib2.socket.error:
print (strdomain, 'SOCKET ERROR')
except urllib2.ssl.CertificateError:
print (strdomain, 'SSL Certificate ERROR')
f.close()
这以下我编写了一个代码,用于从网站上获取所需的“检查元素” - 源代码,以便稍后使用上述脚本提取关键字(从CSV文件中的多个网站)。代码:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
driver.get('https://www.rocmn.nl/')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
print(source_code)
我现在想以与第二个合并的第一个脚本来仅爬行“检查元素” - 源代码(的所有网站的在CSV)并将结果导出为CSV文件(如第一个脚本中所示)
我完全不知道从哪里开始获得此工作。请帮助
SO不是代码写入服务。我们在这里帮助解决编程问题,但您首先需要付出一些努力。尝试将这两者结合起来,阅读一些基本的编程教程,博客,书籍,并试一试。如果你无法正常工作,请回过头来编辑这个问题,以更具体地说明你遇到的问题。 – JeffC
我知道。我只是要求别人指点我正确的方向。在这一点上,我真的不知道从哪里开始。 – jakeT888