2015-04-05 64 views
1

我是新的python。我正在为我工​​作的公司构建爬虫。抓取它的网站时,有一个内部链接不是它所用的链接格式。我怎样才能得到整个链接,而不是只有目录。如果我不太清楚,请运行我制作的代码:我如何才能从beautifulsoup而不是只有内部链接完整链接

import urllib2 
from bs4 import BeautifulSoup 

web_page_string = [] 

def get_first_page(seed): 
    response = urllib2.urlopen(seed) 
    web_page = response.read() 
    soup = BeautifulSoup(web_page) 
    for link in soup.find_all('a'): 
     print (link.get('href')) 
    print soup 


print get_first_page('http://www.fashionroom.com.br') 
print web_page_string 
+0

你是什么意思的整个链接? – 2015-04-05 14:38:00

+1

'print seed +'/'+ link.get('href')'? – Selcuk 2015-04-05 14:38:22

+0

我想在上面的例子中找到htt://www.fashionroom.com.br/indexnew.html。相反,我只是得到了indexnew.html – michelfashionroom 2015-04-05 14:41:20

回答

0

要求每个人的答案我试图把一个如果在脚本中。如果有人看到我将来会发现的潜在问题,请通知我

import urllib2 
from bs4 import BeautifulSoup 

web_page_string = [] 

def get_first_page(seed): 
    response = urllib2.urlopen(seed) 
    web_page = response.read() 
    soup = BeautifulSoup(web_page) 
    final_page_string = soup.get_text() 
    for link in soup.find_all('a'): 
     if (link.get('href'))[0:4]=='http': 
      print (link.get('href')) 
     else: 
      print seed+'/'+(link.get('href')) 
    print final_page_string 


print get_first_page('http://www.fashionroom.com.br') 
print web_page_string