2017-09-27 124 views
1
require(httr) 
require(XML) 
basePage <- "http://bet.hkjc.com/" 
h <- handle(basePage) 
GET(handle = h) 
res <- GET(handle = h, path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2") 
resXML <- htmlParse(content(res, as = "text")) 

我用上面的代码来刮一个aspx。网站。它返回了一堆文本。不过,我只想获得“var infoDivideByRace”,“var scratchList”。请问如何提取这两个变量并将它们转换为列数据?谢谢!部分退货如下:用R刮,如何提取var

var poolSellStatus = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@'); 
var poolSellStatus_bak = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@'); 
var winOddsByRace = '[email protected]@@@@@WIN;1=3.6=1;2=4.7=0;3=43=0;4=11=0;5=29=0;6=9.4=0;7=4.6=0;8=11=0;9=52=0;10=82=0;11=52=0;12=8.6=0#PLA;1=1.4=1;2=2.0=0;3=6.0=0;4=3.5=0;5=6.2=0;6=2.6=0;7=2.0=0;8=4.2=0;9=7.9=0;10=11=0;11=8.4=0;12=2.5=0'.split('@@@'); 
var multiRacePoolsStr = '@@@DBL#;1,2;2,3;3,4;4,5;5,6;6,7;7,[email protected]@@TBL#;6,7,[email protected]@@D-T#;3,4;6,[email protected]@@T-T#;4,5,[email protected]@@6UP#;3,4,5,6,7,8'; 
var fieldSize = 'HV;12;12;12;12;12;12;12;12'; 
var fieldSizeWithReserve = 'HV;12;12;12;12;12;12;12;12'; 
var reserveList = 'HV'; 
var scratchList = 'HV'; 

回答

0

最简单或最合适的方法是使用Phantomjs或硒。如果没有,Regexrvest变通。

library(rvest) 

basePage <- "http://bet.hkjc.com/" 

ss <- paste0(basePage,path) 

path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2" 

scripts <- read_html(ss, encoding = 'utf8') %>% 
    html_nodes("script") %>% html_text(trim=TRUE) 

new <- scripts[grepl('var scratchList =|var infoDivideByRace = ',scripts)] 

value1 <- str_replace_all(strsplit(str_extract(new,regex('var scratchList = (.*?);')), split=' ')[[1]][4],";|'",'')  
value2 <- str_replace_all(strsplit(str_extract(new,regex('var infoDivideByRace = (.*?);')),split=' ')[[1]][4],";|'",'') 

value1 
#[1] "HV" 

value2 
使用V8包
0

备用选项:

library(rvest) 
library(stringi) 
library(purrr) 
library(V8) 

获取您指定的网页内容:包含您的目标变量

pg <- read_html("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2", encoding = "UTF-8") 

提取脚本标记,脚本标签转换为文本,分成一个字符向量,只保留var行:

html_nodes(pg, xpath=".//script[contains(., 'infoDivideByRace')]") %>% 
    html_text() %>% 
    stri_split_lines() %>% 
    flatten_chr() %>% 
    keep(stri_detect_regex, "^var") -> script_txt 

初始化的V8 JavaScript引擎:

ctx <- v8() 

让它解析javascript和创建数据:

ctx$eval(script_txt) 

检索数据(infoDivideByRace具有2个空白数组元素,所以我们忽略它们):

grep("^$", ctx$get('infoDivideByRace'), value=TRUE, invert=TRUE) 
## [1] STACKOVERFLOW'S SPAM PROTECTION WON'T LET ME PASTE THIS CONTENT 

ctx$get('scratchList') 
[1] "HV" 
+0

以上不起作用... 它返回:Flatten_chr(。)中的错误:不能fin d函数“flatten_chr” –

+0

我忘了'库(purrr)'(我已经添加到帖子中) – hrbrmstr