2017-10-16 39 views
1

这是我第一次尝试使用编程来获得有用的东西,所以请耐心等待。建设性的反馈是非常感谢:)创建来自特定网站的URL列表

我正在建立一个数据库与欧洲议会的所有新闻稿。到现在为止,我已经构建了一个可以从一个特定URL检索我想要的数据的刮板。但是,在阅读了几篇教程之后,我仍然无法弄清楚如何创建一个包含来自这个特定站点的所有新闻稿的URL列表。

也许这是关系到网站的构建方式,或者我(可能)只是缺少一些明显的事情,一个有经验的项目将实现向右走,但是我真的不知道如何从这里着手。

这是启动URL:http://www.europarl.europa.eu/news/en/press-room

这是我的代码:

links = [] # Until now I have just manually pasted a few links 
      # into this list, but I need it to contain all the URLs to scrape 

# Function for removing html tags from text 
TAG_RE = re.compile(r'<[^>]+>') 
def remove_tags(text): 
    return TAG_RE.sub('', text) 

# Regex to match dates with pattern DD-MM-YYYY 
date_match = re.compile(r'\d\d-\d\d-\d\d\d\d') 

# For-loop to scrape variables from site 
for link in links: 

    # Opening up connection and grabbing page 
    uClient = uReq(link) 

    # Saves content of page in new variable (still in HTML!!) 
    page_html = uClient.read() 

    # Close connection 
    uClient.close() 

    # Parsing page with soup 
    page_soup = soup(page_html, "html.parser") 

    # Grabs page 
    pr_container = page_soup.findAll("div",{"id":"website"}) 

    # Scrape date 
    date_container = pr_container[0].time 
    date = date_container.text 
    date = date_match.search(date) 
    date = date.group() 

    # Scrape title 
    title = page_soup.h1.text 
    title_clean = title.replace("\n", " ") 
    title_clean = title_clean.replace("\xa0", "") 
    title_clean = ' '.join(title_clean.split()) 
    title = title_clean 

    # Scrape institutions involved 
    type_of_question_container = pr_container[0].findAll("div", {"class":"ep_subtitle"}) 
    text = type_of_question_container[0].text 
    question_clean = text.replace("\n", " ") 
    question_clean = text.replace("\xa0", " ") 
    question_clean = re.sub("\d+", "", question_clean) # Redundant? 
    question_clean = question_clean.replace("-", "") 
    question_clean = question_clean.replace(":", "") 
    question_clean = question_clean.replace("Press Releases"," ") 
    question_clean = ' '.join(question_clean.split()) 
    institutions_mentioned = question_clean 

    # Scrape text 
    text_container = pr_container[0].findAll("div", {"class":"ep-a_text"}) 
    text_with_tags = str(text_container) 
    text_clean = remove_tags(text_with_tags) 
    text_clean = text_clean.replace("\n", " ") 
    text_clean = text_clean.replace(",", " ") # Removing commas to avoid trouble with .csv-format later on 
    text_clean = text_clean.replace("\xa0", " ") 
    text_clean = ' '.join(text_clean.split()) 

    # Calculate word count 
    word_count = len(text_clean.split()) 
    word_count = str(word_count) 

    print("Finished scraping: " + link) 

    time.sleep(randint(1,5)) 

    f.write(date + "," + title + ","+ institutions_mentioned + "," + word_count + "," + text_clean + "\n") 

    f.close() 
+0

HTML有电流法puting的URL,在HTML我们有:SRC,所有链接href和行动,为SRC =>( '脚本', 'IMG', '源', '视频',“ ('a','link','area','base')和action =>('form','if','input' ),首先你需要将这些标签解压缩,然后提取它们的每个src,href和action sub_tag(不需要解析任何东西或删除脏字符串),用这种方法可以提取所有标准的html url,你可以用beautifulsoup模块和两个FORS! – DRPK

回答

1

下面是一个简单的方法来获得所需的链接列表与python-requestslxml

from lxml import html 
import requests 
url = "http://www.europarl.europa.eu/news/en/press-room/page/" 
list_of_links = [] 
for page in range(10): 
    r = requests.get(url + str(page)) 
    source = r.content 
    page_source = html.fromstring(source) 
    list_of_links.extend(page_source.xpath('//a[@title="Read more"]/@href')) 
print(list_of_links) 
+0

非常感谢您的反馈。我想知道你是否可以澄清我如何知道一个网站是否是动态的?你的方法适用于初始URL的前15个链接,但是我需要Selenium模块在加载更多按钮上“单击”吗? –

+1

如果内容位于页面源代码中,则它是静态内容(如果它是由JavaScript生成的) - 它是动态内容。简单地说,您可以通过右键单击浏览器中的网页来查看页面源代码:如果您可以找到所需的内容 - 它是静态的,如果不是的话 - 它是动态的。 – Andersson

+1

@DanielHansen,您可以查看适用于前10页(150链接)的更新答案。你可以设置更大的范围或用'while'替换'for'循环 – Andersson

0

编辑:第15个网址是不使用硒模块获得。


不能使用urllib.request里(我想这是你使用的是什么),以获得新闻稿的网址,因为这个网站的内容是动态加载。

您可以尝试使用硒模块。

from bs4 import BeautifulSoup 
from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait 

driver = webdriver.Firefox() 
driver.get('http://www.europarl.europa.eu/news/en/press-room') 

# Click "Load More", repeat these as you like 
WebDriverWait(driver, 50).until(EC.visibility_of_element_located((By.ID, "continuesLoading_button"))) 
driver.find_element_by_id("continuesLoading_button").click() 

# Get urls 
soup = BeautifulSoup(driver.page_source) 
urls = [a["href"] for a in soup.select(".ep_gridrow-content .ep_title a")] 
+0

编号此内容不是动态的 – Andersson

0

您可以阅读官方BeautifulSoup documentation以更好地抓取。你还应该检查出Scrapy

下面是抓住该网页上的链接需要一个简单的片断。
在以下示例中,我使用Requests库。如果您有任何其他疑问,请告诉我。

虽然这个脚本不会点击“加载更多”,并加载额外的版本。
我会离开,你;)(提示:使用SeleniumScrapy

def scrape_press(url): 
    page = requests.get(url) 

    if page.status_code == 200: 
     urls = list() 
     soup = BeautifulSoup(page.content, "html.parser") 
     body = soup.find_all("h3", {"class": ["ep-a_heading", "ep-layout_level2"]}) 
     for b in body: 
      links = b.find_all("a", {"title": "Read more"}) 
      if len(links) == 1: 
       link = links[0]["href"] 
       urls.append(link) 

     # Printing the scraped links 
     for _ in urls: 
      print(_) 

注意:你应该从一个网站刮的数据,当且仅当它是合法的,这样做。

1

可以使用requestsBeautifulSoup只有6个衬垫码抢链接。虽然剧本几乎等于安德森爵士,图书馆,这里应用的使用情况略有不同。

import requests ; from bs4 import BeautifulSoup 

base_url = "http://www.europarl.europa.eu/news/en/press-room/page/{}" 
for url in [base_url.format(page) for page in range(10)]: 
    soup = BeautifulSoup(requests.get(url).text,"lxml") 
    for link in soup.select('[title="Read more"]'): 
     print(link['href'])