2017-02-28 118 views
-1

我想抓取多个网站(从CSV文件),并从Chrome的“检查元素” - 源代码(右键单击网页,然后选择检查元素)中提取某些关键字。眼下用selenium webdriver抓取多个网址

我可以从他们的“查看源代码”中提取某些关键字-code(在网页上点击右键,然后选择通过浏览器查看源代码)这个脚本:

import urllib2 
import csv 

fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js'] 

def csv_writerheader(path): 
    with open(path, 'w') as csvfile: 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n') 
     writer.writeheader() 

def csv_writer(dictdata, path): 
    with open(path, 'a') as csvfile: 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n') 
     writer.writerow(dictdata) 

csv_output_file = 'EXPORT_Results!.csv' 
# LIST OF KEY WORDS (TITLE CASE TO MATCH FIELD NAMES) 
keywords = ['@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js'] 

csv_writerheader(csv_output_file) 

with open('top1m-edited.csv', 'r') as f: 
    csv_f = csv.reader(f, lineterminator='\n') 
    for line in f: 
     strdomain = line.strip() 
     # INITIALIZE DICT 
     data = {'Website': strdomain} 

     if '.nl' in strdomain: 
      try: 
       req = urllib2.Request(strdomain.strip()) 
       response = urllib2.urlopen(req) 
       html_content = response.read() 

       # ITERATE THROUGH EACH KEY AND UPDATE DICT 
       for searchstring in keywords: 
        if searchstring.lower() in str(html_content).lower(): 
         print (strdomain, searchstring, 'found') 
         data[searchstring] = 'found' 
        else: 
         print (strdomain, searchstring, 'not found') 
         data[searchstring] = 'not found' 

       # CALL METHOD PASSING DICT AND OUTPUT FILE 
       csv_writer(data, csv_output_file) 

      except urllib2.HTTPError: 
       print (strdomain, 'HTTP ERROR') 

      except urllib2.URLError: 
       print (strdomain, 'URL ERROR') 

      except urllib2.socket.error: 
       print (strdomain, 'SOCKET ERROR') 

      except urllib2.ssl.CertificateError: 
       print (strdomain, 'SSL Certificate ERROR') 

f.close() 

这以下我编写了一个代码,用于从网站上获取所需的“检查元素” - 源代码,以便稍后使用上述脚本提取关键字(从CSV文件中的多个网站)。代码:

from selenium import webdriver 

driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe') 
driver.get('https://www.rocmn.nl/') 

elem = driver.find_element_by_xpath("//*") 
source_code = elem.get_attribute("outerHTML") 

print(source_code) 

我现在想以与第二个合并的第一个脚本来爬行“检查元素” - 源代码(的所有网站的在CSV)并将结果导出为CSV文件(如第一个脚本中所示)

我完全不知道从哪里开始获得此工作。请帮助

+0

SO不是代码写入服务。我们在这里帮助解决编程问题,但您首先需要付出一些努力。尝试将这两者结合起来,阅读一些基本的编程教程,博客,书籍,并试一试。如果你无法正常工作,请回过头来编辑这个问题,以更具体地说明你遇到的问题。 – JeffC

+0

我知道。我只是要求别人指点我正确的方向。在这一点上,我真的不知道从哪里开始。 – jakeT888

回答

0

从源收集关键字不是正确的方法。来自身体部分和元标记的关键词很重要。不管你得到什么,你只需要递减到1,

private Object getTotalCount(String strKeyword) { 
    // TODO Getting total count for given keyword 
    // Setting up Javascript executor for executing javascript on a page. Make 
    // sure HTMLUNIDriver/Any driver having javascript enabled. 
    JavascriptExecutor jsExecutor = wdHTMLUnitDriver; 
    // System.out.println(driver.getCurrentUrl()); 
    // Counting up keyword on body of the web page only 
    Object objCount = null; 
    try { 
     objCount = jsExecutor.executeScript(
      "var temp = document.getElementsByTagName('body')[0].innerText;\r\nvar substrings = temp.split(arguments[0]);\r\n \r\nreturn (substrings.length);", 
      strKeyword); 
    } catch (Exception e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    // System.out.println(obj.toString()); 
    if (objCount.equals(null)) 
     return null; 
    // Returning total count found by javascript executor. 
    return objCount.toString(); 
}