使用已打开的网页（使用硒）来美化？

我有一个网页打开并使用webdriver代码登录。使用webdriver为此，因为该页面需要登录和各种其他行动之前我设置刮。使用已打开的网页（使用硒）来美化？

目标是从这个打开的页面中抓取数据。需要找到链接并打开它们，因此selenium webdriver和BeautifulSoup之间会有很多组合。

我看着为BS4文档和BeautifulSoup(open("ccc.html"))引发错误

soup = bs4.BeautifulSoup(open("https://m/search.mp?ss=Pr+Dn+Ts"))

OSError: [Errno 22] Invalid argument: ' https://m/search.mp?ss=Pr+Dn+Ts '

我想这是因为它不是一个.html？

2017-01-23 Sid

参见[如何让整个页面的innerHTML的硒驱动程序（ https://stackoverflow.com/questions/35905517/how-to-get-innerhtml-of-whole-page-in-selenium-driver） – robyschek

您正试图通过网址打开页面。 open()不会那么做的，使用urlopen()：

from urllib.request import urlopen # Python 3 
# from urllib2 import urlopen # Python 2 

url = "your target url here" 
soup = bs4.BeautifulSoup(urlopen(url), "html.parser")

或者使用对人类的HTTP - requests library：

import requests 

response = requests.get(url) 
soup = bs4.BeautifulSoup(response.content, "html.parser")

还要注意，强烈建议specify a parser explicitly - 我在这个使用html.parser情况下，还有其他解析器可用。

I want to use the exact same page(same instance)

一种常见的方式做到这一点是让driver.page_source并将其传递给BeautifulSoup进一步解析：

from bs4 import BeautifulSoup 
from selenium import webdriver 

driver = webdriver.Firefox() 
driver.get(url) 

# wait for page to load.. 

source = driver.page_source 
driver.quit() # remove this line to leave the browser open 

soup = BeautifulSoup(source, "html.parser")

来源

2017-01-23 17:17:38 alecxe

我想我没有正确解释，页面已经打开。 :(我想使用由selenium打开的完全相同的页面（相同的实例）。在这两个例子中，我假设一个新的基于URL的请求正在打开/获取数据。 – Sid

@Sid好吧，我已经更新了回答 - 请看这是否是你的意思。谢谢。 – alecxe

第三个正是我在找的。:)谢谢 – Sid

使用已打开的网页（使用硒）来美化？

回答

相关问题