2017-10-05 67 views
-2

我试图为scrapy建造一般铲运机 - 虽然它看起来有点儿车。这个想法是,它应该把网址作为输入,只从该特定的网址中删除网页,但它似乎要离开YouTube等网站。理想情况下,它也会有一个深度选项,它允许1,2 ,3等作为远离初始页面的深度链接数量。任何想法如何实现这一目标?Scrapy一般铲运机

from bs4 import BeautifulSoup 
from bs4.element import Comment 
import urllib 
from route import urls 
import pickle 
import os 
import urllib2 
import urlparse 

def tag_visible(element): 
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: 
     return False 
    if isinstance(element, Comment): 
     return False 
    return True 


def text_from_html(body): 
    soup = BeautifulSoup(body, 'html.parser') 
    texts = soup.findAll(text=True) 
    visible_texts = filter(tag_visible, texts) 
    return u" ".join(t.strip() for t in visible_texts) 

def getAllUrl(url): 
    try: 
     page = urllib2.urlopen(url).read() 
    except: 
     return [] 
    urlList = [] 
    try: 
     soup = BeautifulSoup(page) 
     soup.prettify() 
     for anchor in soup.findAll('a', href=True): 
      if not 'http://' in anchor['href']: 
       if urlparse.urljoin(url, anchor['href']) not in urlList: 
        urlList.append(urlparse.urljoin(url, anchor['href'])) 
      else: 
       if anchor['href'] not in urlList: 
        urlList.append(anchor['href']) 

     length = len(urlList) 

     return urlList 
    except urllib2.HTTPError, e: 
     print e 

def listAllUrl(url): 
    urls_new = list(set(url)) 
    return urls_new 
count = 0 

main_url = str(raw_input('Enter the url : ')) 
url_split=main_url.split('.',1) 
folder_name =url_split[1] 
txtfile_split = folder_name.split('.',1) 
txtfile_name = txtfile_split[0] 
url = getAllUrl(main_url) 
urls_new = listAllUrl(url) 

os.makedirs('c:/Scrapy/Extracted/'+folder_name+"/") 
for url in urls_new: 
    if url.startswith("http") or url.startswith(" "): 
     if(main_url == url): 
      url = url 
     else: 
      pass 
    else: 
     url = main_url+url 
    if '#' in url: 
     new_url = str(url).replace('#','/') 
    else: 
     new_url =url 
    count = count+1 
    if new_url: 
     print""+str(count)+">>",new_url 
     html = urllib.urlopen(new_url).read() 
     page_text_data=text_from_html(html) 
     with open("c:/Scrapy/Extracted/"+folder_name+"/"+txtfile_name+".txt", "a") as myfile: 
      myfile.writelines("\n\n"+new_url.encode('utf-8')+"\n\n"+page_text_data.encode('utf-8')) 
      path ='c:/Scrapy/Extracted/'+folder_name+"/" 
     filename ="url"+str(count)+".txt" 
     with open(os.path.join(path, filename), 'wb') as temp_file: 
      temp_file.write(page_text_data.encode('utf-8')) 
      temp_file.close() 
    else: 
     pass  

回答

1

您当前的解决方案并不涉及Scrapy的。但是,当你特意要求Scrapy时,在这里你走了。

根据您的蜘蛛CrawlSpider类。这允许您抓取给定的网站,并可能指定导航应遵守的规则。

要禁止离场请求,请使用allowed_domains蜘蛛属性。或者,如果使用CrawlSpider类,则可以在Rule中指定构造函数的allow_domains(或其他方式,deny_domains)属性。

要限制抓取深度,请使用DEPTH_LIMITsettings.py

0

你有一个标签scrapy,但你根本不使用它。我建议你尝试使用它 - 这很容易。比尝试在自己身上发展要容易得多。已经有一个选项来限制特定域名的请求。