2016-11-23 37 views
1

我正在写一个网络抓取工具,为乐谱数据库从网站JW Pepper中提取一些信息。我正在使用BeautifulSoup和python来做到这一点。从beatifulsoup生成的python链接列表中筛选特定项目

这里是我的代码:

# a barebones program I created to scrape the description and audio file off the JW pepper website, will eventually be used in a music database 
import urllib2 
import re 
from bs4 import BeautifulSoup 
linkgot = 0 
def linkget(): 
    search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something 
    print("enter the name of the desired piece") 
    keyword = raw_input("> ") # this will add the keyword to the url 
    url = search + keyword 
    page = urllib2.urlopen(url) 
    soup = BeautifulSoup(page) 
    all_links = soup.findAll("a") 
    link_dict = [] 
    item_dict = [] 
    for link in all_links: 
     link_dict.append(link.get('href')) # adds a list of the the links found on the page to link_dict 
    item_dict.append(x for x in link_dict if '.item' in x) #sorts them occording to .item 
    print item_dict 

linkget() 

“打印”命令将返回此:在0x10ec6dc80>],当我谷歌它返回什么。

回答

0

您对列表的过滤出错了。而不是一个单独的环路滤波器,你可以只生成列表,如果.item存在如下:

from bs4 import BeautifulSoup 
import urllib2 

def linkget(): 
    search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something 
    print("enter the name of the desired piece") 
    keyword = raw_input("> ") # this will add the keyword to the url 
    url = search + keyword 
    page = urllib2.urlopen(url) 
    soup = BeautifulSoup(page, "html.parser") 

    link_dict = [] 
    item_dict = [] 

    for link in soup.findAll("a", href=True): 
     href = link.get('href') 
     link_dict.append(href) # adds a list of the the links found on the page to link_dict 

     if '.item' in href: 
      item_dict.append(href) 

    for href in item_dict: 
     print href 

linkget() 

给你这样的:

/Festival-of-Carols/4929683.item 
/Festival-of-Carols/4929683.item 
/Festival-of-Carols/4929683.item 
... 
+0

感谢您的帮助。似乎我仍然有很多要学习过滤器。 –

+0

不客气!不要忘记点击回答中的灰色勾号以接受它作为解决方案(并获得徽章)。 –