2016-12-04 74 views
0

我的解析器创建一个数据帧,它看起来像:如何清理和拆分R中的HTML标签?

name   html 
1 John   <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> 
2 Steve   <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> 

那么,如何从HTML中提取有用的信息?例如,我想用一些HTML属性为特征:

name minute second  id 
1 John  68  37 8028 
2 Steve  69  4 132205 

回答

1

正则表达式是可能的,但我更喜欢rvest包本,

这是data.table或dplyr更容易,但让这样做它基础R,(在关闭的机会,这些都是新的概念)

# Example data 

df <- structure(list(name = c("John", "Steve"), html = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>", 
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>" 
)), .Names = c("name", "html"), row.names = c(NA, -2L), class = "data.frame") 

rvest让我们使用DOM,可以比使用正则表达式的工作同样的事情要好很多拆分这件事。

library(rvest) 

# Get span attributes from each row: 
spanattrs <- 
    lapply(df$html, 
      function(y) read_html(y) %>% html_node('span') %>% html_attrs) 

# rbind to get a data.frame with all attributes 
final <- data.frame(df, do.call(rbind,spanattrs)) 

> final 
    name                              html   class 
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> incident-icon 
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> incident-icon 
    data.minute data.second data.id 
1   68   37 8028 
2   69   4 132205 

让我们删除HTML,所以它在这里的观众更好一点:

> final$html <- NULL 
> final 
    name   class data.minute data.second data.id 
1 John incident-icon   68   37 8028 
2 Steve incident-icon   69   4 132205 
3

如果你已经在你的问题中的数据帧,你可以尝试以下。您的数据帧在这里被称为mydf。您可以使用stri_extract_all_regex()提取所有数字。然后,遵循将列表转换为数据框的经典方法。然后,分配新列名称并将结果与​​原始数据框中的列name绑定。

library(stringi) 
library(dplyr) 

stri_extract_all_regex(str = mydf$url, pattern = "[0-9]+") %>% 
unlist %>% 
matrix(ncol = 4, byrow = T) %>% 
data.frame %>% 
setNames(c("minute", "second", "ID", "data")) %>% 
bind_cols(mydf["name"], .) 

# name minute second  ID data 
#1 John  68  37 8028 68 
#2 Steve  69  4 132205 69 

DATA

mydf <- structure(list(name = c("John", "Steve"), url = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>", 
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>" 
)), .Names = c("name", "url"), row.names = c(NA, -2L), class = "data.frame") 
1

一种替代rvest方法使用purrrdplyr

library(rvest) 
library(purrr) 
library(dplyr) 

df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html 
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> 
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>') 

by_row(df, .collate="cols", 
     ~read_html(.$html) %>% 
     html_nodes("span:first-of-type") %>% 
     html_attrs() %>% 
     flatten_chr() %>% 
     as.list() %>% 
     flatten_df()) %>% 
    select(-html, -class1) %>% 
    setNames(gsub("^data-|1$", "", colnames(.))) 
## # A tibble: 2 × 4 
## name minute second  id 
## <chr> <chr> <chr> <chr> 
## 1 John  68  37 8028 
## 2 Steve  69  4 132205