2017-04-03 62 views
3

我想使用Python的库BeautifulSoup来解析当前月相的一些数据。BeautifulSoup不返回所有数据

from bs4 import BeautifulSoup 
import urllib2 

moon_url = "http://www.moongiant.com/phase/today/" 


try: 
    rqest = urllib2.urlopen(moon_url) 
    moon_Soup = BeautifulSoup(rqest, 'lxml') 
    moon_angle = 0 
    moon_illumination = 0 
    main_data = moon_Soup.find('div', {'id' : 'moonDetails'}) 
    print main_data 

except urllib2.URLError: 
    print "Error" 

但输出,而不是这样的:

<div id="moonDetails">   
     Phase: <span>Waxing Crescent</span><br>Illumination: <span>36% 
</span><br>Moon Age: <span>6.00 days</span><br>Moon Angle: <span>0.55</span><br>Moon Distance: <span>364,</span>434.78 km<br>Sun Angle: <span>0.53</span><br>Sun Distance: <span>149,</span>571,918.47 km<br> 
</div> 

仅仅是这样的:

<div id="moonDetails"> 
</div> 

任何想法?

+0

这个数据不在''div id =“moonDetails”>' – RaminNietzsche

+0

''var mArray'其实它在var jArray中。我如何使用Python解析jArray? – Costis94

+2

阅读http://stackoverflow.com/questions/24118337/fetch-data-of-variables-inside-script-tag-in-python-or-content-added-from-js – RaminNietzsche

回答

3

其实RaminNietzsche的评论后,我用dryscrape库。

from bs4 import BeautifulSoup 
import urllib2 
import dryscrape 

    moon_url = "http://www.moongiant.com/phase/today/" 

try: 
    rqest = urllib2.urlopen(moon_url) 
    session = dryscrape.Session() 
    session.visit(moon_url) 
    response = session.body() 
    soup = BeautifulSoup(response, 'lxml') 

    moon_data = soup.findAll('div', {'id':'moonDetails'}) 
    print moon_data 

结果输出现在是:

<div id="moonDetails">   
     Phase: <span>Waxing Crescent</span><br>Illumination: <span>36% 
</span><br>Moon Age: <span>6.00 days</span><br>Moon Angle: <span>0.55</span><br>Moon Distance: <span>364,</span>434.78 km<br>Sun Angle: <span>0.53</span><br>Sun Distance: <span>149,</span>571,918.47 km<br> 
</div> 

感谢的每个人的答案!

3

正如RaminNietzsche在评论中所述,您应该在此特定script标记中提取脚本的文本。您可以使用regexbuilt-in methods(如split()strip()replace(),例如

代码:

from bs4 import BeautifulSoup 
import requests 
import re 
import json 

moon_url = "http://www.moongiant.com/phase/today/" 
html_source = requests.get(moon_url).text 

moon_soup = BeautifulSoup(html_source, 'html.parser') 

data = moon_soup.find_all('script', {'type' : 'text/javascript'}) 

for d in data: 
    d = d.text 
    if 'var jArray=' in d: 
     jArray = re.search('\{(.*?)\}', d).group() 
     moon_data = json.loads(jArray) 
     print(moon_data) 

     #if you want mArray data too, you just have to: 
     # 1. add `'var mArray=' in d` in the if clause, and 
     # 2. uncomment the following lines 
     #mArray = re.search('\[+(.*?)\];', d).group() 
     #print(mArray) 

输出:

{'3': ['<b>April 4</b>', '58%\n', 'Sun Angle: 0.53291621763825', 'Sun Distance: 149657950.85286', 'Moon Distance: 369697.55153449', 'Moon Age: 8.1316595947356', 'Moon Angle: 0.53870564539409', 'Waxing Gibbous', 'April 4'], '2': ["<span style='color:#c7b699'><b>April 3</b></span>", 'Illumination: <span>47%\n</span>', 'Sun Angle: <span>0.53', 'Sun Distance: <span>149,</span>614,</span>943.28', 'Moon Distance: <span>366,</span>585.35', 'Moon Age: <span>7.08', 'Moon Angle: <span>0.54', 'First Quarter', '<b>Monday, April 3, 2017</b>', 'April', 'Phase: <span>First Quarter</span>', 'April 3'], '1': ['<b>April 2</b>', '36%\n', 'Sun Angle: 0.53322274612254', 'Sun Distance: 149571918.46739', 'Moon Distance: 364434.77975454', 'Moon Age: 6.002888839693', 'Moon Angle: 0.54648504798072', 'Waxing Crescent', 'April 2'], '4': ['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5'], '0': ['<b>April 1</b>', '25%\n', 'Sun Angle: 0.53337618944887', 'Sun Distance: 149528889.15122', 'Moon Distance: 363387.67496992', 'Moon Age: 4.9078487808877', 'Moon Angle: 0.54805974945761', 'Waxing Crescent', 'April 1']} 

,因为它的加载一个JSON,你可以像这样通过它导航:

示例代码:

print(moon_data['4']) 
print('-')*5 
print(moon_data['4'][2]) 

输出:

['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5'] 
----- 
Sun Angle: 0.53276322269153 
2

另一种方式,其要领我从根的回答那儿剽窃在access Chrome DOM

的想法是,你可以使用LXML一起访问已加载和它的JavaScript处理页面的DOM。

>>> moon_url = "http://www.moongiant.com/phase/today/" 
>>> import selenium.webdriver as webdriver 
>>> import lxml.html as html 
>>> import lxml.html.clean as clean 
>>> 
>>> browser = webdriver.Chrome() 
>>> browser.get(moon_url) 
>>> content = browser.page_source 
>>> cleaner = clean.Cleaner() 
>>> content = cleaner.clean_html(content) 
>>> doc = html.fromstring(content) 
>>> type(doc) 
<class 'lxml.html.HtmlElement'> 
>>> type(content) 
<class 'str'> 
>>> open('c:/scratch/content.htm','w').write(content) 
27070 

一旦你这样做,如上文所展示的最后几声明,您可以访问DOM /既作为HTML或适用于LXML处理的树。就你而言,你可能更喜欢用HTML做汤;这意味着将BeautifulSoup应用到content

顺便说一句,当我保存content我确实在HTML中找到了以下结构,正如人们所期望的那样。

<div id="moonDetails"> 
    Phase: <span>First Quarter</span><br> 
    Illumination: <span>47%</span><br> 
    Moon Age: <span>7.08 days</span><br> 
    Moon Angle: <span>0.54</span><br> 
    Moon Distance: <span>366,</span>585.35 km<br> 
    Sun Angle: <span>0.53</span><br> 
    Sun Distance: <span>149,</span>614,943.28 km<br> 
</div>