2016-07-28 69 views
-2

我写了一个程序,目的是登录到我的公司网站之一,然后抓取数据,旨在使数据收集更快。这是使用请求和美丽的汤。试图通过asp.net登录和刮网站

我可以得到它打印出一个页面的HTML代码,但我不能让它登录过去的ASPX,然后在页面上打印HTML。

下面是代码即时通讯使用和我的头和params。任何帮助,将不胜感激

import requests 
 
from bs4 import BeautifulSoup 
 

 
URL="http://mycompanywebsiteloginpage.co.uk/Login.aspx" 
 
headers={"User-Agent":"Mozilla/5.0 (X11; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0 Iceweasel/44.0.2"} 
 

 
username="myusername" 
 
password="mypassword" 
 

 
s=requests.Session() 
 
s.headers.update(headers) 
 
r=s.get(URL) 
 
soup=BeautifulSoup(r.content) 
 

 
VIEWSTATE=soup.find(id="__VIEWSTATE")['value'] 
 
EVENTVALIDATION=soup.find(id="__EVENTVALIDATION")['value'] 
 
EVENTTARGET=soup.find(id="__EVENTTARGET")['value'] 
 
EVENTARGUEMENT=soup.find(id="__EVENTARGUMENT")['value'] 
 

 
login_data={"__VIEWSTATE":VIEWSTATE, 
 
"ctl00$ContentPlaceHolder1$_tbEngineerUsername":username, 
 
"ctl00$ContentPlaceHolder1$_tbEngineerPassword":password, 
 
"ctl00$ContentPlaceHolder1$_tbSiteOwnerEmail":"", 
 
"ctl00$ContentPlaceHolder1$_tbSiteOwnerPassword":"", 
 
"ctl00$ContentPlaceHolder1$tbAdminName":username, 
 
"ctl00$ContentPlaceHolder1$tbAdminPassword":password, 
 
"__EVENTVALIDATION":EVENTVALIDATION, 
 
"__EVENTTARGET":EVENTTARGET, 
 
"--EVENTARGUEMENT":EVENTARGUEMENT} 
 

 
r = s.post(URL, data=login_data) 
 
r = requests.get("http://mycompanywebsitespageafterthelogin.co.uk/Secure/") 
 
print (r.url) 
 
print (r.text)

FROM DATA

__VIEWSTATE:"DAwNEAIAAA4BBQAOAQ0QAgAADgEFAw4BDRACDwEBBm9ubG9hZAFkU2hvd1BhbmVsKCdjdGwwMF9Db250ZW50UGxhY2VIb2xkZXIxX19wbkFkbWluaXN0cmF0b3JzJywgZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoJ2FkbWluTG9naW5MaW5rJykpOwAOAQUBDgENEAIAAA4DBQEFBwULDgMNEAIMDwEBDUFsdGVybmF0ZVRleHQBDldEU0kgRGFzaGJvYXJkAAAAAA0QAgAADgIFAAUBDgINEAIPAQEEVGV4dAEEV0RTSQAAAA0QAgwPAQEHVmlzaWJsZQgAAAAADRACDwECBAABBFdEU2kAAAAAAABCX8QugS7ztoUJMfDmZ0s20ZNQfQ==" 
 
ctl00$ContentPlaceHolder1$_tbEngineerUsername:"myusername" 
 
ctl00$ContentPlaceHolder1$_tbEngineerPassword:"mypassword" 
 
ctl00$ContentPlaceHolder1$_tbSiteOwnerEmail:"" 
 
ctl00$ContentPlaceHolder1$_tbSiteOwnerPassword:"" 
 
ctl00$ContentPlaceHolder1$tbAdminName:"myusername" 
 
ctl00$ContentPlaceHolder1$tbAdminPassword:"mypassword" 
 
__EVENTVALIDATION:"HQABAAAA/////wEAAAAAAAAADwEAAAAKAAAACBzHEFXh+HCtf3vdl8crWr6QZnmaeK7pMzThEoU2hwqJxnlkQDX2XLkLAOuKEnW/qBMtNK2cdpQgNxoGtq65" 
 
__EVENTTARGET:"ctl00$ContentPlaceHolder1$_btAdminLogin" 
 
__EVENTARGUMENT:""

请求中的Cookie

ASP.NET_SessionId:"11513CDDE31AF267CCD87BAB"

响应头

Cache-Control:"private" 
 
Connection:"Keep-Alive" 
 
Content-Length:"123" 
 
Content-Type:"text/html; charset=utf-8" 
 
Date:"Thu, 28 Jul 2016 13:37:45 GMT" 
 
Keep-Alive:"timeout=15, max=91" 
 
Location:"/Secure/" 
 
Server:"Apache/2.2.14 (Ubuntu)" 
 
x-aspnet-version:"2.0.50727"

请求头

Host:"mycompanywebsite.co.uk" 
 
User-Agent:"Mozilla/5.0 (X11; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0 Iceweasel/44.0.2" 
 
Accept:"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" 
 
Accept-Language:"en-US,en;q=0.5" 
 
Accept-Encoding:"gzip, deflate" 
 
Referer:"http://mycompanywebsiteloginpage/Login.aspx" 
 
Cookie:"ASP.NET_SessionId=F11CB47B137ADB66D2274758" 
 
Connection:"keep-alive"

+0

这是一个问题,大量的代码。你能指定哪些部分不适合你吗? –

回答

3

更改的行

r = requests.get("http://mycompanywebsitespageafterthelogin.co.uk/Secure/") 

使用会话对象

r = s.get("http://mycompanywebsitespageafterthelogin.co.uk/Secure/") 
+0

谢谢朋友,那曾经工作过 – ipmev12