1
我正在写一个网络抓取工具,为乐谱数据库从网站JW Pepper中提取一些信息。我正在使用BeautifulSoup和python来做到这一点。从beatifulsoup生成的python链接列表中筛选特定项目
这里是我的代码:
# a barebones program I created to scrape the description and audio file off the JW pepper website, will eventually be used in a music database
import urllib2
import re
from bs4 import BeautifulSoup
linkgot = 0
def linkget():
search = "http://www.jwpepper.com/sheet-music/search.jsp?keywords=" # this is the url without the keyword that comes up when searching something
print("enter the name of the desired piece")
keyword = raw_input("> ") # this will add the keyword to the url
url = search + keyword
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
all_links = soup.findAll("a")
link_dict = []
item_dict = []
for link in all_links:
link_dict.append(link.get('href')) # adds a list of the the links found on the page to link_dict
item_dict.append(x for x in link_dict if '.item' in x) #sorts them occording to .item
print item_dict
linkget()
“打印”命令将返回此:在0x10ec6dc80>],当我谷歌它返回什么。
感谢您的帮助。似乎我仍然有很多要学习过滤器。 –
不客气!不要忘记点击回答中的灰色勾号以接受它作为解决方案(并获得徽章)。 –