在特定网站上刮掉问题

这是我在堆栈溢出问题上的第一个问题，请耐心等待。在特定网站上刮掉问题

我试图从网站上自动下载（即刮）的一些意大利的法律文本：http://www.normattiva.it/

我使用下面这段代码（以及类似排列）：

import requests, sys 

debug = {'verbose': sys.stderr} 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection':'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

r = requests.session() 
s = r.get(url, headers=user_agent) 
#print(s.text) 
print(s.url) 
print(s.headers) 
print(s.request.headers)

正如你可以看到我正在尝试加载“caricaArticolo”查询。

然而，输出的是一个网页说，我的搜索是无效的（“会话无效或过期”）

看来，网页承认我不使用浏览器和负载一个“突破”的JavaScript功能。

<body onload="javascript:breakout();">

我试图用“浏览器”模拟器Python脚本，例如硒和robobrowser但结果是一样的。

有没有人愿意花10分钟看页面输出并给予帮助？

来源

2016-09-15 terzim

@terzin要访问您首先需要授权的用户的页面。你没有有效的会话。 –

我试了相同的代码，我得到所需的输出 –

尝试使用“beautifulsoup”它真正强大的库，将做你所需要的。 –

一旦你点击网络下使用开发工具打开的网页上的任何链接，该文档选项卡下：

你可以看到三个环节，第一个就是我们点击，第二个返回该HTML允许您跳转到特定的文章，最后一页包含文章文字。

在从首联返回的源代码，你可以看到两个IFRAME标签：

<div id="alberoTesto"> 
     <iframe 
      src="/atto/caricaAlberoArticoli?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
      name="leftFrame" scrolling="auto" id="leftFrame" title="leftFrame" height="100%" style="width: 285px; float:left;" frameborder="0"> 
     </iframe> 

     <iframe 
      src="/atto/caricaArticoloDefault?atto.dataPubblicazioneGazzetta=2016-08-31&atto.codiceRedazionale=16G00182&atto.tipoProvvedimento=DECRETO LEGISLATIVO" 
      name="mainFrame" id="mainFrame" title="mainFrame" height="100%" style="width: 800px; float:left;" scrolling="auto" frameborder="0"> 
     </iframe>

首先是项目，后者与/caricaArticoloDefault和IDmainFrame中是我们想要的。

你需要从最初的请求使用Cookie，这样您可以与会议对象，并通过分析网页使用bs4做到这一点：

import requests, sys 
import os 
from urlparse import urljoin 
import io 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

with requests.session() as s: 
    s.headers.update(user_agent) 
    r = s.get("http://www.normattiva.it/") 
    soup = BeautifulSoup(r.content, "lxml") 
    # get all the links from the initial page 
    for a in soup.select("div.testo p a[href^=http]"): 
     soup = BeautifulSoup(s.get(a["href"]).content) 
     # The link to the text is in a iframe tag retuened from the previous get. 

     text_src_link = soup.select_one("#mainFrame")["src"] 

     # Pick something to make the names unique 
     with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f: 
      # The text is in pre tag that is in the div with the pre class 
      text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\ 
       .select_one("div.wrapper_pre pre").text 
      f.write(text)

第一个文本文件中的一个片段：

   IL PRESIDENTE DELLA REPUBBLICA 
    Visti gli articoli 76, 87 e 117, secondo comma, lettera d), della 
Costituzione; 
    Vistala legge 28 novembre 2005, n. 246 e, in particolare, 
l'articolo 14: 
    comma 14, cosi' come sostituito dall'articolo 4, comma 1, lettera 
a), della legge 18 giugno 2009, n. 69, con il quale e' stata 
conferita al Governo la delega ad adottare, con le modalita' di cui 
all'articolo 20 della legge 15 marzo 1997, n. 59, decreti legislativi 
che individuano le disposizioni legislative statali, pubblicate 
anteriormente al 1° gennaio 1970, anche se modificate con 
provvedimenti successivi, delle quali si ritiene indispensabile la 
permanenza in vigore, secondo i principi e criteri direttivi fissati 
nello stesso comma 14, dalla lettera a) alla lettera h); 
    comma 15, con cui si stabilisce che i decreti legislativi di cui 
al citato comma 14, provvedono, altresi', alla semplificazione o al 
riassetto della materia che ne e' oggetto, nel rispetto dei principi 
e criteri direttivi di cui all'articolo 20 della legge 15 marzo 1997, 
n. 59, anche al fine di armonizzare le disposizioni mantenute in 
vigore con quelle pubblicate successivamente alla data del 1° gennaio 
1970; 
    comma 22, con cui si stabiliscono i termini per l'acquisizione del 
prescritto parere da parte della Commissione parlamentare per la 
semplificazione; 
    Visto il decreto legislativo 30 luglio 1999, n. 300, recante 
riforma dell'organizzazione del Governo, a norma dell'articolo 11 
della legge 15 marzo 1997, n. 59 e, in particolare, gli articoli da 
20 a 22;

来源

2016-09-15 13:13:19

谢谢!!!!见下文！ – terzim

精彩，美妙，精彩的Padraic。有用。刚刚不得不略微编辑清除进口，但它的作品奇妙。非常感谢。我只是发现了python的潜力，并且通过这个特定的任务，你使我的旅程变得更加轻松。我不会一个人解决它。

import requests, sys 
import os 
from urllib.parse import urljoin 
from bs4 import BeautifulSoup 
import io 
user_agent = {'User-agent': 'Mozilla/5.0', 'Connection': 'keep-alive'} 

url = 'http://www.normattiva.it/atto/caricaArticolo?art.progressivo=0&art.idArticolo=1&art.versione=1&art.codiceRedazionale=047U0001&art.dataPubblicazioneGazzetta=1947-12-27&atto.tipoProvvedimento=COSTITUZIONE&art.idGruppo=1&art.idSottoArticolo1=10&art.idSottoArticolo=1&art.flagTipoArticolo=0#art' 

with requests.session() as s: 
    s.headers.update(user_agent) 
    r = s.get("http://www.normattiva.it/") 
    soup = BeautifulSoup(r.content, "lxml") 
    # get all the links from the initial page 
    for a in soup.select("div.testo p a[href^=http]"): 
     soup = BeautifulSoup(s.get(a["href"]).content) 
     # The link to the text is in a iframe tag retuened from the previous get. 

     text_src_link = soup.select_one("#mainFrame")["src"] 

     # Pick something to make the names unique 
     with io.open(os.path.basename(text_src_link), "w", encoding="utf-8") as f: 
      # The text is in pre tag that is in the div with the pre class 
      text = BeautifulSoup(s.get(urljoin("http://www.normattiva.it", text_src_link)).content, "html.parser")\ 
       .select_one("div.wrapper_pre pre").text 
      f.write(text)

来源

2016-09-15 19:39:14 terzim

在特定网站上刮掉问题

回答

相关问题