2017-07-06 182 views
-1

所以我想去http://www.medhelp.org/forums/list,然后有很多链接到那里的不同疾病。在每个链接里面,有几个页面,每个链接都有一些我想要的链接。从网站中提取链接

我想获得一些链接。所以我用这个代码:

myArray=[] 
html_page = urllib.request.urlopen("http://www.medhelp.org/forums/list") 
soup = bs(html_page) 
temp =soup.findAll('div',attrs={'class' : 'forums_link'}) 
for div in temp: 
    myArray.append('http://www.medhelp.org' + div.a['href']) 
myArray_for_questions=[] 
myPages=[] 

#this for is going over all links on the main page. in this case, all 
diseases 
for link in myArray: 

    # "link" is the URL for each link in the main page of our website 
    html_page = urllib.request.urlopen(link) 
    soup1 = bs(html_page) 

    #getting the questions's links in the first page 
    temp =soup1.findAll('div',attrs={'class' : 'subject_summary'}) 
    for div in temp: 
    myArray_for_questions.append('http://www.medhelp.org' + div.a['href']) 

    #now getting the URL for all next pages for this page 
    pages = soup1.findAll('a' ,href=True, attrs={'class' : 'page_nav'}) 
    for l in pages: 
    html_page_t = urllib.request.urlopen('http://www.medhelp.org' 
    +l.get('href')) 
    soup_t = bs(html_page_t) 
    other_pages = soup_t.findAll('a' ,href=True, attrs={'class' : 
    'page_nav'}) 
    for p in other_pages: 
     mystr='http://www.medhelp.org' +p.get('href') 
     if mystr not in myPages: 
      myPages.append(mystr) 
     if p not in pages: 
      pages.append(p) 

    # getting all links inside this page which are people's questions 
    for page in myPages: 
     html_page1 = urllib.request.urlopen(page) 
     soup2 = bs(html_page1) 
     temp =soup2.findAll('div',attrs={'class' : 'subject_summary'}) 
     for div in temp: 
     myArray_for_questions.append('http://www.medhelp.org' + 
     div.a['href']) 

但它需要永远得到我想要从所有页面的所有链接。有任何想法吗?

感谢

+0

这是太普通。 请告诉我们你到目前为止所尝试过的东西,并缩小你的问题范围。 – rowana

+0

当问一个问题时,您通常希望拥有您试图实现并且有疑问的代码,或者您应该要求帮助理解通过对该主题的研究(通过示例代码片段)找到的代码。 – gavsta707

+0

我还没有开始。我只想写一个特殊的网络爬虫,我认为这样做是因为在这个论坛中有很多问题需要我们将所有这些疾病保存在一个文件中。 – Sanaz

回答