2014-10-20 118 views
0

我用Xpath做了很多HTML抓取。但现在我不得不刮掉一些JSON,不知道该怎么做。我想刮的来源是:用PHP抓取JSON

 { 
      "ASIN" : "B00DR4LYHY", 
      "FeatureName" : "price_feature_div", 
      "Type" : "JSON", 
      "Value" : 
      { 
       "content" : 
       {"price_feature_div":"<div id=\"price\" class=\"a-section a-spacing-small\">\n<table class=\"a-lineitem\">\n \n\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t  \n\t\t    \n\t\t       \n\t\t\t\t  \n\t\t    \n\t\t\t\t  \n\n\n\n\n\n\t\n<tr>\n <td class=\"a-color-secondary a-size-base a-text-right a-nowrap\">Price:<\/td>\n <td class=\"a-span12\">\n  <span id=\"priceblock_ourprice\" 

class=\"a-size-medium a-color-price\">$37.60<\/span>\n  \n\n\n\n  \n\n\n\n\n\n\n  \n\n  <span id=\"ourprice_shippingmessage\">\t\n  \t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n  \n  \n\n\t \n\t\t\n\t\t\n  \n   <span class=\"a-size-base a-color-base\">& <b>FREE Shipping<\/b><\/span>\n  \n  \n \n\n\n\n  <\/span>\n  \n  \n  \n  \n <\/td>\n<\/tr>\n\n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\n\t\t   \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\n\n\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\n\n\n\n\t\t\t\n\t\t\t\n\n\t\t\n\t\n\t\n\t\n\n \n \n\t\n<\/table>\n<\/div>"} 

     } 
    } 

我得到这个代码:

$URL = 'http://www.amazon.com/gp/twister/ajaxv2?sid=188-4344403-7969026&ptd=OUTERWEAR&json=1&dpxAjaxFlag=1&sCac=1&isUDPFlag=1&twisterView=glance&ee=2&pgid=apparel_display_on_website&sr=1-3&nodeID=1036592&rid=0Q05FXGQJSA20X44DJVG&parentAsin=B00DR4LUQY&enPre=1&qid=1413775191&dStr=size_name%2Ccolor_name&auiAjax=1&storeID=apparel&psc=1&asinList=B00DR4LYHY&isFlushing=2&id=B00DR4LYHY&prefetchParam=0&mType=full&dpEnvironment=softlines'; 

我需要得到的是价格(37.60 $)

我正在使用的代码,从Venkata提供的是:

$URL = 'http://www.amazon.com/gp/twister/ajaxv2?sid=188-4344403-7969026&ptd=OUTERWEAR&json=1&dpxAjaxFlag=1&sCac=1&isUDPFlag=1&twisterView=glance&ee=2&pgid=apparel_display_on_website&sr=1-3&nodeID=1036592&rid=0Q05FXGQJSA20X44DJVG&parentAsin=B00DR4LUQY&enPre=1&qid=1413775191&dStr=size_name%2Ccolor_name&auiAjax=1&storeID=apparel&psc=1&asinList=B00DR4LYHY&isFlushing=2&id=B00DR4LYHY&prefetchParam=0&mType=full&dpEnvironment=softlines'; 



    $page = file_get_contents($URL); 
    $decoded = json_decode($page); 

    $html = $decoded->Value->content->price_feature_div; 


$dom = new DOMDocument(); 
$dom->loadHTML($html); 

$xpath = new DOMXPath($dom); 

//frem dom method 
$elements = $dom->getElementById("priceblock_ourprice")->item(0); 

//OR use extract it from xpath like below line 
$priceNode = $xpath->query("//*[@id='priceblock_ourprice']"); 

if (!is_null($elements)) { 
    //$priceNode = $elements->item(0); 
    $ourPrice = $priceNode; 
    echo $ourPrice; 
} 

我认为最好的是使用REGEX,但该表达式应该是什么样子?

+5

解码json,提取html,然后像平常一样将它输入到dom中。不,“最好”会**不是正则表达式。 – 2014-10-20 17:11:04

+0

@MarcB谢谢,但是,你能解释怎么做? – Emilios1995 2014-10-20 17:21:57

+0

http://php.net/json_decode – 2014-10-20 17:31:53

回答

0

萃取PHP

$json_string = '{"ASIN" : "B00DR4LYHY","FeatureName" : "price_feature_div","Type" : "JSON","Value" : {"content" : {"price_feature_div":"<div id=\"price\" class=\"a-section a-spacing-small\">\n<table class=\"a-lineitem\">\n \n\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t  \n\t\t    \n\t\t       \n\t\t\t\t  \n\t\t    \n\t\t\t\t  \n\n\n\n\n\n\t\n<tr>\n <td class=\"a-color-secondary a-size-base a-text-right a-nowrap\">Price:<\/td>\n <td class=\"a-span12\">\n  <span id=\"priceblock_ourprice\" class=\"a-size-medium a-color-price\">$37.60<\/span>\n  \n\n\n\n  \n\n\n\n\n\n\n  \n\n  <span id=\"ourprice_shippingmessage\">\t\n  \t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n  \n  \n\n\t \n\t\t\n\t\t\n  \n   <span class=\"a-size-base a-color-base\">& <b>FREE Shipping<\/b><\/span>\n  \n  \n \n\n\n\n  <\/span>\n  \n  \n  \n  \n <\/td>\n<\/tr>\n\n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\n\t\t   \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\n\n\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\n\n\n\n\t\t\t\n\t\t\t\n\n\t\t\n\t\n\t\n\t\n\n \n \n\t\n<\/table>\n<\/div>"}}}'; 

$decoded = json_decode($json_string); 
$html = $decoded->Value->content->price_feature_div; 

$dom = new DOMDocument(); 
$dom->loadHTML($html); 

$xpath = new DOMXPath($dom); 

//frem dom method 
$elements = $dom->getElementById("priceblock_ourprice")->item(0); 

//OR use extract it from xpath like below line 
//$priceNode = $xpath->query("//*[@id='priceblock_ourprice']"); 

if (!is_null($elements)) { 
    $priceNode = $elements->item(0); 
    $ourPrice = $priceNode; 
    echo $ourPrice; 
} 

提取在前端(I使用的jQuery在下面的溶液)

var jsonObj={ 
      "ASIN" : "B00DR4LYHY", 
      "FeatureName" : "price_feature_div", 
      "Type" : "JSON", 
      "Value" : 
      { 
       "content" : 
       {"price_feature_div":"<div id=\"price\" class=\"a-section a-spacing-small\">\n<table class=\"a-lineitem\">\n \n\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t  \n\t\t    \n\t\t       \n\t\t\t\t  \n\t\t    \n\t\t\t\t  \n\n\n\n\n\n\t\n<tr>\n <td class=\"a-color-secondary a-size-base a-text-right a-nowrap\">Price:<\/td>\n <td class=\"a-span12\">\n  <span id=\"priceblock_ourprice\" class=\"a-size-medium a-color-price\">$37.60<\/span>\n  \n\n\n\n  \n\n\n\n\n\n\n  \n\n  <span id=\"ourprice_shippingmessage\">\t\n  \t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n  \n  \n  \n\n\t \n\t\t\n\t\t\n  \n   <span class=\"a-size-base a-color-base\">& <b>FREE Shipping<\/b><\/span>\n  \n  \n \n\n\n\n  <\/span>\n  \n  \n  \n  \n <\/td>\n<\/tr>\n\n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\t\t \n\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\n\t\t   \n\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\n\n\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\t\t\t\t\n\n\n\n\n\n\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\t\t\t\n\t\t\t\n\n\n\n\n\n\t\t\t\n\t\t\t\n\n\t\t\n\t\n\t\n\t\n\n \n \n\t\n<\/table>\n<\/div>"} 

     } 
    }; 
//using jQuery we extracted the price 
var ourPrice = $(jsonObj.Value.content.price_feature_div).find("#priceblock_ourprice").text(); 

console.log(ourPrice);//"$37.60" is the value you can see in the browser-console 

注:我发现语法错误在"price_feature_div" HTML值(在JSON值中,它应该是单行,即使它是HTML字符串)。注意到HTML中的两个换行符。

+0

谢谢你的回答!但我无法得到输出。'val ourPrice = $(jsonObj ...')中的'val'是什么意思?我在php中,不知道这是什么意思。 URL(我会在我的问题中发布URL),那么确切的代码会是什么? – Emilios1995 2014-10-20 17:33:08

+0

不客气,对不起,这个错字应该是'var',我现在纠正请看看。你在做抽取?服务器端(后端)或客户端(前端)? – 2014-10-20 17:43:58

+0

用php代码更新了答案;在json_decode(@MarcB已经提示)之后用DOM完成提取 – 2014-10-20 19:39:02

0

我认为最好的办法是使用正则表达式,但我应该表达的样子

在某些情况下,正则表达式的作品比XPath的更好(尺寸为有限非结构化的HTML文本片段)。

因此,您只需获取原始数据并坚持使用$即可获得您想要的数据。

$page = file_get_contents($URL); 
$pattern = '/\$[\d.]+/'; 
$preg_match($page, $pattern, $matches); 
echo 'price = ', $matches[0]; 

请参阅working demo