2015-10-18 82 views
2

这是一个订票网站的源代码的一部分:如何使用nokogiri和机械化从<script>标签中提取文本?

<script> 
booking.ensureNamespaceExists('env'); 
booking.env.b_map_center_latitude = 53.36480155016638; 
booking.env.b_map_center_longitude = -2.2752803564071655; 
booking.env.b_hotel_id = '35523'; 
booking.env.b_query_params_no_ext = '?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaFCIAQGYAS64AQTIAQTYAQHoAQH4AQs;sid=e1c9e4c7a000518d8a3725b9bb6e5306;dcid=1'; 
</script> 

而且我想提取booking.env.b_hotel_id。所以我会得到'25523'的值。我如何用nokogiri和机械化来实现这一点?

希望有人能帮助!谢谢! :)

回答

6
require 'mechanize' 

agent = Mechanize.new 
page = agent.get('http://www.booking.com/hotel/us/solera-by-stay-alfred.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmcgV1c19ueYgBAZgBMbgBBMgBBNgBAegBAfgBAg;sid=695d6598485cb1a8fd9e39c5de3878ba;dcid=4;checkin=2015-10-20;checkout=2015-10-21;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=cf5d76283b73d34a1d7e0d61cad6974e38a94351X1;type=total;ucfs=1&') 

match = agent.page.search("script").text.scan(/^booking.env.b_hotel_id = \'.*\'/) 
puts match 
puts match[0].split("'")[1] 

输出:

booking.env.b_hotel_id = '1202411' 
1202411 

页面,帮助我弄清楚了这一点:

http://robdodson.me/crawling-pages-with-mechanize-and-nokogiri/

Parsing javascript function elements with nokogiri

Regular expression - starting and ending with a character string

http://www.rubular.com

+0

嗨,谢谢!但如果我想提取其他信息,如b_map_center_latitude或b_map_center_longitude?它会一样吗?并像这样:.scan(/^booking.env._map_center_latitude = \'。* \'/)? –

+0

由于这些变量的值不是字符串,因此不用引号括起来,你可能想要拿掉每个'\',并用'\;'替换后者来限制分号上的正则表达式,如下所示: '/^booking.env.b_map_center_latitude =。* \; /'但是,我在booking.com上找到的页面用逗号结束了每个变量声明行,因此您可能需要'/^booking.env.b_map_center_latitude =。* \,/'代替。与Rubular玩耍,这是有趣的和教育! – Jason

+0

我试过/^booking.env.b_map_center_latitude =。* \,/但没有得到任何回报。 !只是空字符串...我仍然觉得这是很难理解:( –

相关问题