2014-12-05 102 views
1

这可能是一个简单的问题,但我无法弄清楚。 我无法从网页的这部分与BeautifulSoup提取电子邮件地址和网址:使用BeautifulSoup提取数据

<!-- ENDE telefonnummer.jsp --></li> 
     <li class="email "> 
       <a 
        class="link" 
        href="mailto:[email protected]" 
        data-role="email-layer" 
        data-template-replacements='{ 
         "name": "Aachener-Airport-Taxi Blum", 
         "subscriberId": "128027562762", 
         "captchaBase64": "data:image/jpg;base64,/9j/4AAQSkZJRgABAgAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/wAARCAAvAG4DASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD02iivPLm58L6d4x1i4nsdLGpRNCsUcsSIFcASG5eUjEQLXCqWPJMfy72IWvyDD4d13JK90r6K/VLurb7nuzlynodcfqvxJ0PTda/se3ivtU1AMyPBp0HmsjKMkHJGTjOducbTnGK1rDw7awanJrF7FaXOsy4DXcdsI9oClQEBLEcE5JYk5xnaFVfGXTxP8JPElxqVxbQ6jaXshja8lGTOMhz8+S0bnng5BIJw+0GvRy3A4fETnHm5pJe6r8vM+uuui+TflqY1akoJO1l9567oPjKz13VJtMOnapp17HCLgQ6hbeUzx7tpZeTwDgc468ZwcM/4WD4ZOp/2bHfTTXh+7DDZzyFxt3ArtQ7lK/MCMgjkcVZ8LeLdK8X6e93pkjgxttlglAWSM9twBPBAyCCR17ggeWaHfw3fx11nU9QS53WTTrEtnbSTZKYgG5UVmxsySeBux06VVDAU6s6yqQlHkjeyet+2q2YSqOKjZp3Z61pnibSdX1C50+0uX+22yh5baeCSGRVPQ7XUEjkdPUeorz/xt8SPEPgzxWthJbaXdWUircR7UkSQxFiNpO4gN8pGcEdDjsKvhy1k8U/GOfxdpjQtpEWcu0yeYf3JhH7sEuu4gsNwXKjPXiuw1zw/Z+J9Y13S7xEIk0y0MUjLuMMm+62uORyCfUZGQeCa1jRwmDxKVVc0eVOSe8W2k1p1W/Tt5icp1Ie7o76eZu2epf294ft9R0a4hj+1RrJE80fmhPVWVXHzDlSA3BHfGKg8N3ep31pcT6jPaSbbmaCNbe3aLHlSvGSdztnO0HHGOnNeH6Dr2ufCfxPNpWqwvJYOwaaBTlXU8CaInHOB7ZxtOCPl9t8IzRXGhPPBIksUl/eukiMGVlN1KQQR1BFY5jl7wcG4tShJrllptrpf7v6uOlV9o9dGtzdooorxToCiiigCG7uo7K2e4lWZkTGRDC8rcnHCoCx69hWF4ZeHUNN1GG5s7ndNd3DTre2ckfnRvK4jz5ijePKCLjnChQccCujorWNRRpuNtW1rft/XfsJq7uclot5qem+EDY2+mTX2p6V/oiQtG1otwiSGNHV5AVOY1DnBI5xxkVn6j45vL7T57Ow8Ea/NdXC+THHqFhstyW4/eHcflwec4B6EjqO9orojiqXO5zpptu+7Xy06fc/MhwdrJnmXgjw7efDnwvqV9f2013qd5tKWdmrzfdQlEOxDtYsWBblR8vPrmfCS3k8LWWpy6vZavb3F1JGqwf2TcPhUBw25UI5LkY7bfevYKK6Z5rKrGqqsbuo1dp2+HZLRkqik1boeWfDvw7q6eOtc8UXdhNY2N95/kR3Q2THfMGGU5K4C85x1GMjmupstYgk8X3c4tdUWK5tLWCOR9MuUUusk5YEmMbQBIvJwOevBrqqKxxGO+sTlOpHdJKz2S+++w40+RJJnK+PfBsXjPQRarIkN7A3mW0zKCA2MFWOMhW4zjuFPOME+HGnXmk+A9PsL+3eC6gaZZI36g+c/5gjkEcEEEV1VFYvGVXhvqr+FO68t/wDMr2a5+fqFFFFcpYUUUUAFFFFABRRRQAUUUUAFFFFABRRRQAUUUUAFFFFAH//Z", 
         "captchaWidth": "110", 
         "captchaHeight": "47", 
         "captchaEncryptedAnswer": "767338fffffff8ffffffd6ffffff8d3038ffffffba1971ffffffdfffffffe3f6c9" 
        }' 
        data-wipe='{"listener":"click","name":"Detailseite E-Mail","id":"128027562762"}' 
       > 
        <i class="icon-mail"></i> 
        <span class="text" >[email protected]</span> 
       </a> 
      </li> 
     <li class="website "> 
       <a class="link" href="http://www.aachener-airport-taxi.de" rel="follow" target="_blank" title="http://www.aachener-airport-taxi.de" 
        data-wipe='{"listener":"click","name":"Detailseite Webadresse","id":"128027562762"}'> 
        <i class="icon-website"></i> 
        <span class="text">Zur Website</span> 
       </a> 
      </li> 
     </ul> 
</div> 

我试图让[email protected]http://www.aachener-airport-taxi.de离开那里。 soup.find(class='email')显然不起作用,因为class使编译器认为我想在括号内声明一个。虽然我可以使用 for link in soup.find_all('a'): print(link.get('href'))来获取所有链接,但我想要这个特定的链接。这些链接总是不同的,所以我不能为它们设置正则表达式,所以我想我们必须亲手浏览html-body。

回答

2
print(soup.find("span",{"class":"text"}).text) 
print(soup.find(attrs={"class":"website"}).a["href"]) 
[email protected] 
http://www.aachener-airport-taxi.de 
+0

不错,谢谢。第一种方法实际上返回的是电话号码,而不是电子邮件,因为电话号码嵌套在我发布的HTML主体部分上方几行的类似结构中,但是我可以修改它以提取邮件: )我使用'mail = soup.find(attrs = {“class”:“email”})。a [“href”]',这会返回mailto:info @ taxi-ac.de'。你只需要string.split结果,你去了。可能不是教科书的方法,但嘿它的作品。 – 2014-12-06 03:03:13

相关问题