如何通过它的标题提取一个网址与美丽的汤？

我的，我很感兴趣，抓取的链接列表：如何通过它的标题提取一个网址与美丽的汤？

lis = ['https://example1.com', 'https://example2.com', ..., 'https://exampleN.com']

在这些链接有多个URL，我想提取一些特定的网址内。此类URL有这种形式：

<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>

我怎么能检查的lis所有元素，并返回lis的访问链接，仅仅只在一个数据帧大熊猫有作为标题Url to news网址？像这个（**）：

visited_link, extracted_link 
https://www.example1.com, NaN 
https://www.example2.com, NaN 
https://www.example3.com, https://interesting-linkN.com

注意，对于没有任何<a href="https://interesting-linkN.com" target="_blank" title="Url to news"> News JPG </a>谁的lis要素我想返回NaN。

我试图this和：

def extract_jpg_url(a_link): 
    page = requests.get(a_link) 
    tree = html.fromstring(page.content) 
    # here is the problem... not all interesting links have this xpath, how can I select by title? 
    #(apparently all the jpg urls have this form: title="Url to news") 
    interesting_link = tree.xpath(".//*[@id='object']//tbody//tr//td//span//a/@href") 
    if len(interesting_link) == 0: 
     return'NaN' 
    else: 
     return 'image link ', interesting_link 
then: 

    df['news_link'] = df['urls_from_lis'].apply(extract_jpg_url)

然而，后一种方法耗时太长，而不是lis匹配给定的XPath（查看评论）的我能为了得到任何想法的所有元素（**）？

来源

2017-04-13 J.Do

任何想法如何应用多进程应用？ –

尝试用'soup.find_all（'a'，href = True）'来提取链接 –

我已经准备好了，问题在于有很多链接是垃圾...没有成功。 ..我只想要一个标题为'Url to news'的链接@tmadam –

这不会完全回报你想要的（NaN），但它会给你如何简单而有效地完成这项工作的一般想法。

from bs4 import BeautifulSoup 
from multiprocessing.pool import ThreadPool 
import requests 

def extract_urls(link): 
    r = requests.get(link) 
    html = r.text 
    soup = BeautifulSoup(html, "html.parser") 
    results = soup.findAll('a', {'title': 'Url to news'}) 
    results = [x['href'] for x in results] 
    return (link, results) 

links = [ 
    "https://example1.com", 
    "https://example2.com", 
    "https://exampleN.com", ] 

p = ThreadPool(10) 
r = p.map(extract_urls, links) 

for url, results in r: 
    print(url, results)

来源

2017-04-13 18:15:21 Alden

好吧，让我看看...谢谢！ –

这真的帮了我。有没有办法通过'multiprocessing'来完成熊猫数据框？....我将用另一个想法更新状态......谢谢！ –

我从来没有用过熊猫，所以我不能告诉你。也许别人可以提供一些见解？ – Alden

如何通过它的标题提取一个网址与美丽的汤？

回答

相关问题