我如何确保我在特定网站的关于我们页面

下面是我试图用来从给定主页网址的网站检索所有链接的一段代码。我如何确保我在特定网站的关于我们页面

import requests 
from BeautifulSoup import BeautifulSoup 

url = "https://www.udacity.com" 
response = requests.get(url) 
page = str(BeautifulSoup(response.content)) 


def getURL(page): 

    start_link = page.find("a href") 
    if start_link == -1: 
     return None, 0 
    start_quote = page.find('"', start_link) 
    end_quote = page.find('"', start_quote + 1) 
    url = page[start_quote + 1: end_quote] 
    return url, end_quote 

while True: 
    url, n = getURL(page) 
    page = page[n:] 
    if url: 
     print url 
    else: 
     break

结果是

/uconnect 
# 
/
/
/
/nanodegree 
/courses/all 
# 
/legal/tos 
/nanodegree 
/courses/all 
/nanodegree 
uconnect 
/
/course/machine-learning-engineer-nanodegree--nd009 
/course/data-analyst-nanodegree--nd002 
/course/ios-developer-nanodegree--nd003 
/course/full-stack-web-developer-nanodegree--nd004 
/course/senior-web-developer-nanodegree--nd802 
/course/front-end-web-developer-nanodegree--nd001 
/course/tech-entrepreneur-nanodegree--nd007 
http://blog.udacity.com 
http://support.udacity.com 
/courses/all 
/veterans 
https://play.google.com/store/apps/details?id=com.udacity.android 
https://itunes.apple.com/us/app/id819700933?mt=8 
/us 
/press 
/jobs 
/georgia-tech 
/business 
/employers 
/success 
# 
/contact 
/catalog-api 
/legal 
http://status.udacity.com 
/sitemap/guides 
/sitemap 
https://twitter.com/udacity 
https://www.facebook.com/Udacity 
https://plus.google.com/+Udacity/posts 
https://www.linkedin.com/company/udacity 

Process finished with exit code 0

我想要得到的只是URL“关于我们”一个网站，该网站的区别在许多情况下，像

为Udacity是https://www.udacity.com/us

的页面

对于artscape-inc它是https://www.artscape-inc.com/about-decorative-window-film/

我的意思是，我可以尝试在URL中搜索关键字“about”，但据说我可能在这种方法中错过了udacity。任何人都可以提出任何好的方法

来源

2016-04-23 x0v

这是不太可能有*是*一个好办法 - 网站是免费的把自己的关于我的EQ无论他们想要什么（或者什么都没有），只要他们喜欢就叫它。 – jonrsharpe

@jonrsharpe：yaa，那好吧但仍然是多少我可以减少误报的数量 – x0v

要覆盖“关于我们”页面链接的所有可能变体并不容易，但这里是最初的想法，可以在两种情况下都能正常工作 - 检查href属性中的“about” a元素的文本：

def about_links(elm): 
    return elm.name == "a" and ("about" in elm["href"].lower() or \ 
           "about" in elm.get_text().lower())

用法：

soup.find_all(about_links) # or soup.find(about_links)

什么，你也可以做，以减少误报的数量只检查页的“页脚”的一部分。例如。找到footer元素，或具有id="footer"或具有footer类的元素。

另一个想法排序“外包”出去的“关于我们”页面定义的，将是谷歌（从您的脚本，当然）“大约” +“网页URL”，并抢了先搜索结果。

作为一个方面说明，我发现你还在使用BeautifulSoup version 3 - 它没有被开发和维护，你应该尽快切换到BeautifulSoup 4，通过安装它：

pip install --upgrade beautifulsoup4

，改变你的进口：

from bs4 import BeautifulSoup

来源

2016-04-23 20:59:18 alecxe

此外，这是一个相关的线程：http://stackoverflow.com/a/28145856/771848。 – alecxe

我如何确保我在特定网站的关于我们页面

回答

相关问题