使用BeautifulSoup从html-doc中提取数据时遇到困难

我试图从网页中提取数据，并且发现它非常困难。我试过soup.get_Text()，但它没有什么好处，因为它只是返回一个字符而不是整个字符串对象。使用BeautifulSoup从html-doc中提取数据时遇到困难

提取名称很容易，因为您可以通过'b'-tag访问该名称，但是例如提取街道（“AmVogelwäldchen2”）证明相当困难。我可以尝试从单个字符集合地址，但这看起来过于复杂，我觉得必须有一个更简单的方法来做到这一点。也许有人有一个更好的主意。哦，不介意奇怪的功能，我回来了汤，因为我尝试了不同的方法。

import urllib.request 
import time 

from bs4 import BeautifulSoup 


#Performs a HTTP-'POST' request, passes it to BeautifulSoup and returns the result 
def doRequest(request): 
    requestResult = urllib.request.urlopen(request) 
    soup = BeautifulSoup(requestResult) 
    return soup 

def getContactInfoFromPage(page): 
    name = '' 
    straße = '' 
    plz = '' 
    stadt = '' 
    telefon = '' 
    mail = '' 
    url = '' 

    data = [ 
      #'Name', 
      #'Straße', 
      #'PLZ', 
      #'Stadt', 
      #'Telefon', 
      #'E-Mail', 
      #'Homepage' 
      ] 

    request = urllib.request.Request("http://www.altenheim-adressen.de/schnellsuche/" + page) 
    request.add_header("Content-Type", "application/x-www-form-urlencoded;charset=utf-8") 
    request.add_header("User-Agent", "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0") 
    soup = doRequest(request) 

    #Save Name to data structure 
    findeName = soup.findAll('b') 
    name = findeName[2] 
    name = name.string.split('>') 

    data.append(name) 


    return soup 


soup = getContactInfoFromPage("suche2.cfm?id=267a0749e983c7edfeef43ef8e1c7422") 

print(soup.getText())

来源

2014-11-23 Fresh Prince

谢谢，我会尝试，当我回家。 – 2014-11-23 18:43:47

您可以依靠现场标签并获得next sibling的文本。

从这个制作一个漂亮的可重复使用的功能，将使其更加透明和易于使用：

def get_field_value(soup, field): 
    field_label = soup.find('td', text=field + ':') 
    return field_label.find_next_sibling('td').get_text(strip=True)

用法：

print(get_field_value(soup, 'Name')) # prints 'AWO-Seniorenzentrum Kenten' 
print(get_field_value(soup, 'Land')) # prints 'Deutschland'

来源

2014-11-23 18:32:47 alecxe

非常感谢，完美的工作。 – 2014-11-23 21:32:00

@FreshPrince很高兴帮助，谢谢。 – alecxe 2014-11-23 21:36:15

使用BeautifulSoup从html-doc中提取数据时遇到困难

回答

相关问题