2017-08-30 90 views
1

请帮助我解决这个小问题。我正在寻找使用美丽的汤(Python)或python从SCRIPT标记(而不是Body)中的下面的代码中提取lat和lng值。我是Python新手,博客推荐使用美丽的汤来提取。如何使用Python或beautifulsoup从文件中提取文本(使用脚本标记)

我想这两个值纬度:21.25335,经度:81.649445 我使用正则表达式这一点。我的正规表示“^ L([在])(:)([0-9])([^,] +)”

检查此链接为正则表达式和HTML文件 - http://regexr.com/3glde

我用这个正则表达式得到了这两个值,但我只想把那些lat和lng值(数字部分)存储在变量中。

这里下面是我用我的Python代码

import re 
pattern = re.compile("^[l]([a-t])([a-t])(\:) ([0-9])([^,]+)") 

for i, line in enumerate(open('C:\hile_text.html')): 
    for match in re.finditer(pattern, line): 
     print 'Found on line %s: %s' % (i+1, match.groups()) 

输出:

  • 实测值线3218:( 'A', 'T', ':', '2' '1.244791')
  • 实测值线3219:( 'N', 'G', ':', '8', '1.643486')

我想只有那些数值像21.25335输出,81 .649445并且想要将这些值存储在变量中,否则您可以为此提供替代代码。

plzz很快帮助我。感谢预期。

这是HTML文件中的脚本标记。

<script type="text/javascript"> 
    window.mapDivId = 'map0Div'; 
    window.map0Div = { 
    lat: 21.25335, 
    lng: 81.649445, 
    zoom: null, 
    locId: 5897747, 
    geoId: 297595, 
    isAttraction: false, 
    isEatery: true, 
    isLodging: false, 
    isNeighborhood: false, 
    title: "Aman Age Roll & Chicken ", 
    homeIcon: true, 
    url: "/Restaurant_Review-g297595-d5897747-Reviews-Aman_Age_Roll_Chicken-Raipur_Raipur_District_Chhattisgarh.html", 
    minPins: [ 
    ['hotel', 20], 
    ['restaurant', 20], 
    ['attraction', 20], 
    ['vacation_rental', 0]  ], 
    units: 'km', 
    geoMap: false, 
    tabletFullSite: false, 
    reuseHoverDivs: false, 
    noSponsors: true }; 
    ta.store('infobox_js', 'https://static.tacdn.com/js3/infobox-c-v21051733989b.js'); 
    ta.store("ta.maps.apiKey", ""); 
    (function() { 
    var onload = function() { 
    if (window.location.hash == "#MAPVIEW") { 
    ta.run("ta.mapsv2.Factory.handleHashLocation", {}, true); 
    } 
    } 
    if (window.addEventListener) { 
    if (window.history && window.history.pushState) { 
    window.addEventListener("popstate", function(e) { 
    ta.run("ta.mapsv2.Factory.handleHashLocation", {}, false); 
    }, false); 
    } 
    window.addEventListener('load', onload, false); 
    } 
    else if (window.attachEvent) { 
    window.attachEvent('onload', onload); 
    } 
    })(); 
    ta.store("mapsv2.show_sidebar", true); 
    ta.store('mapsv2_restaurant_reservation_js', ["https://static.tacdn.com/js3/ta-mapsv2-restaurant-reservation-c-v2430632369b.js"]); 
    ta.store('mapsv2.typeahead_css', "https://static.tacdn.com/css2/maps_typeahead-v21940478230b.css"); 
    // Feature gate VR price pins on SRP map. VRC-14803 
    ta.store('mapsv2.vr_srp_map_price_enabled', true); 
    ta.store('mapsv2.geoName', 'Raipur'); 
    ta.store('mapsv2.map_addressnotfound', "Address not found");  ta.store('mapsv2.map_addressnotfound3', "We couldn\'t find that location near {0}. Please try another search.");  ta.store('mapsv2.directions', "Directions from {0} to {1}");  ta.store('mapsv2.enter_dates', "Enter dates for best prices");  ta.store('mapsv2.best_prices', "Best prices for your stay");  ta.store('mapsv2.list_accom', "List of accommodations");  ta.store('mapsv2.list_hotels', "List of hotels");  ta.store('mapsv2.list_vrs', "List of holiday rentals");  ta.store('mapsv2.more_accom', "More accommodations");  ta.store('mapsv2.more_hotels', "More hotels");  ta.store('mapsv2.more_vrs', "More Holiday Homes");  ta.store('mapsv2.sold_out_on_1', "SOLD OUT on 1 site");  ta.store('mapsv2.sold_out_on_y', "SOLD OUT on 2 sites"); </script> 

回答

0

你的正则表达式有点搞砸了。 ^l表示您试图匹配'l',这是线条上的第一个字符。

^\s+(l[an][gt])(:\s+)(\d+\.\d+)会更好。检出一个regerx分析器工具,如http://www.myezapp.com/apps/dev/regexp/show.ws,以了解正在发生的事情。

这里是一个击穿

Sequence: match all of the followings in order BeginOfLine Repeat WhiteSpaceCharacter one or more times CapturingGroup GroupNumber:1 Sequence: match all of the followings in order l AnyCharIn[ a n] AnyCharIn[ g t] CapturingGroup GroupNumber:2 Sequence: match all of the followings in order : Repeat WhiteSpaceCharacter one or more times CapturingGroup GroupNumber:3 Sequence: match all of the followings in order Repeat Digit one or more times . Repeat Digit one or more times

+0

“(L [一个] [GT])(:\ S +)(\ d + \ \ d +)” 这种表达现在工作 –

+0

大。你能否将我的答案标记为已接受? – WombatPM

相关问题