我很舒服通过使用CSS元素作为识别我想要的内容部分的方法刮HTML内容,但我需要刮网页的部分内容:屏幕抓取HTML头内容?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- saved from url=(0028)http://www.peoplesafe.co.uk/ -->
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>PeopleSafe</title>
<link href="css/screen.css" media="screen" rel="stylesheet" type="text/css" />
<!--[if lte IE 6]>
<link href="http://www.peoplesafe.co.uk/styles/default/screen_ie6.css" media="screen" rel="stylesheet" type="text/css" />
<![endif]-->
<link rel="icon" href="http://www.peoplesafe.co.uk/styles/default/favicon.ico" />
<script type="text/javascript" src="js/tabpane.js"></script>
<link type="text/css" rel="StyleSheet" href="css/tab.webfx.css?v=2" />
<meta http-equiv="Author" content="Rare Creative Group" />
<meta http-equiv="Description" content="Experts in lone worker safety" />
<meta http-equiv="Keywords" content="lone, worker, safety" />
<script type="text/javascript" src="js/spotlight.js"></script>
<script type="text/javascript" src="js/promo.js"></script>
<script src="http://maps.google.com/maps?ile=api&v=2&sensor=true&key=ABQIAAAA04SCF3o4CZghg6c0Qqgd-RQxzn3bXKr_TQ6C8c2CiIf8-vjJhBS3endtVbbJ1vftXL4Wbb2PwuJ8ag" type="text/javascript"></script>
<script type="text/javascript">
//<![CDATA[
function load()
{
// required for original Peoplesafe layout:
start();
if (GBrowserIsCompatible())
{
// codice setcenter:
var map = new GMap2(document.getElementById("map"));
var customUI = map.getDefaultUI();
// Remove MapType.G_HYBRID_MAP
//customUI.maptypes.hybrid = false;
map.setUI(customUI);
//map.addControl(new GSmallMapControl());
//map.addControl(new GMapTypeControl());
map.setCenter(new GLatLng(51.612308, -1.239453), 11);
// Crea un nuovo marker nel punto specificato con una descrizione HTML associata:
function createMarker(point, description, primary_contact_id)
{
//var icon = new GIcon();
////icon.shadow = "/images/nuvola.png";
//icon.iconSize = new GSize(87, 38);
////icon.shadowSize = new GSize(107, 38);
//icon.iconAnchor = new GPoint(6, 20);
//icon.infoWindowAnchor = new GPoint(5, 1);
//icon.image = "/img/.";
我需要某种方式解析来自该行的纬度和经度:
map.setCenter(new GLatLng(51.612308, -1.239453), 11);
所以在我的表中的一列,我想第一部分:
51.612308
并在第二列,我想第二部分:
-1.239453
这是可能的,而不CSS选择器的可用性?
编辑
感谢您的帮助,到目前为止,非常appreiated!
最初的问题是,一旦你登录到网站与重定向的事,我已经排序时,现在当我做的:
put page.root
我得到的页面的全部来源,我期望。所以,现在我的代码(登录后)为:
html_doc = page.root
# Find the first <script> in the head that does not have src="..."
#script = html.at_xpath('/html/head/script[not(@src)]')
# Use a regex to find the correct code parts in the JS, using named captures
parts = script.text.match(/new GLatLng\(\s*(?<lat>.+?)\s*,\s*(?<long>.+?)\s*\)/)
p parts[:lat], parts[:long]
#=> "51.612308"
#=> "-1.239453"
我运行上面的时候得到一个错误:
undefined local variable or method `script' for main:Object
难道你们就不能只是把它们与HTML?他们是静态的吗? – RyanS 2012-04-05 13:07:54
这是一个没有API的服务提供商网站,我们有权删除,但我们无法更改HTML代码。他们是我们用于单身工人的手机的经纬度,几乎是实时的。 – dannymcc 2012-04-05 13:08:50
在您的主机环境中使用一些服务器端技术编写web方法asp.net,php,jsp来取消网页数据,然后从返回的html内容中提取数据 – 2012-04-05 13:09:46