与美丽的汤

刮乱源页面我尝试做一些网页使用Python和美丽的汤刮，但网页的源页面是不是最漂亮的。下面的代码是源页面的一小部分：与美丽的汤

...717301758],"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0,...

我想要得到的字符串“birthdayFriends”之后的参数“2”，但我不知道如何得到它。到目前为止，我已经编写了下面的代码，但它只打印一个空列表。

import urllib2 
from bs4 import BeautifulSoup 

# Create an OpenerDirector with support for Basic HTTP Authentication... 
auth_handler = urllib2.HTTPBasicAuthHandler() 
auth_handler.add_password(realm='PDQ Application', 
          uri='myWebpage', 
          user='myUsername', 
          passwd='myPassword') 
opener = urllib2.build_opener(auth_handler) 
# ...and install it globally so it can be used with urlopen. 
urllib2.install_opener(opener) 
page = urllib2.urlopen('myWebpage') 

soup = BeautifulSoup(page.read()) 

bf = soup.findAll('birthdayFriends') 

print bf 

>> []

来源

2014-01-18 Christoffer

BeautifulSoup是一个HTML解析器，还有你的片段看起来并不像一个html的。它是否在“脚本”标签内？ – alecxe

是的，它在脚本标签内。那么有什么事吗？也许另一个图书馆比美丽的汤？ – Christoffer

那么，从脚本标记获取数据的一种方法是使用正则表达式：例如使用BS定位脚本元素，然后使用正则表达式解析脚本标记的内容。 – alecxe

假设某处HTML有像脚本标签下面：

<script> 
var x = {"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0}} 
</script>

那么你的代码可能看起来像：

script = soup.findAll('script')[0] # or the number it appears in the file 
# take the json part 
j = bf.text.split('=')[1] 

import json 
# load json string to a dictionary 
d = json.loads(j, strict=False) 
print d["birthdayFriends"]

的情况下，该脚本的内容标签比较复杂，考虑环比台词或看到How can I parse Javascript variables using python?

另外，在蟒蛇的JavaScript解析也看到pynoceros

来源

2014-01-19 03:20:09

回答

相关问题