2017-01-22 84 views
4

所以我有一个大的数据集,看起来像这样:的R - 基于匹配(模板)重新排序列

 V1  V2 V3   V4 
1 Sleep Domestic Eat Child Care 
2 Sleep Domestic Eat  Paid 
3 Sleep Domestic Eat Child Care 
4 Sleep  Eat Paid  <NA> 

我想这样做是为了reorder基于“模板”列

["Sleep", "Eat", "Domestic", "Paid", "Child care"] 

为了得到(输出中)

V1 V2  V3  V4   V5 
Sleep Eat Domestic  NA Child Care 
Sleep Eat Domestic Paid   NA 
Sleep Eat Domestic  NA Child Care 
Sleep Eat  NA Paid   NA 

所以在1列Sleep,列2 Eat,...

我不知道从哪里开始。 有什么想法?

数据

x = structure(list(V1 = c("Sleep", "Sleep", "Sleep", "Sleep"), V2 = c("Domestic", 
"Domestic", "Domestic", "Eat"), V3 = c("Eat", "Eat", "Eat", "Paid" 
), V4 = c("Child Care", "Paid", "Child Care", NA)), .Names = c("V1", 
"V2", "V3", "V4"), row.names = c(NA, 4L), class = "data.frame") 

template = c('Sleep', 'Eat', 'Domestic', 'Paid', 'Child care') 
+0

一个选项,有一个案例不匹配 - “儿童护理”到“托儿” – thelatemail

+1

我无法理解你的问题,所以让我提出了我认为你在问,然后你告诉我我错了,好吗?基本上每一列*应该*代表有价值或没有价值,例如:'[4,'V5']'应该是“儿童保育”(儿童保育的意思是“是”)或“NA”意思“不”用于托儿。这些yes/no值的顺序应根据模板在每一行中排序。真的吗? –

+0

@TravisHeeter嗨实际上它是另一种看待它的方式。我没有这样想过,但是是的。 – giacomo

回答

2

这里是tidyverse

library(dplyr) 
library(tidyr) 
library(tibble) 
rownames_to_column(x, 'id') %>% 
     gather(Var, Val, -id, na.rm = TRUE) %>% 
     mutate(Var = factor(Val, levels = template)) %>% 
     spread(Var, Val) %>% 
     select(-id) %>% 
     setNames(., paste0("V", seq_along(template))) 
#  V1 V2  V3 V4   V5 
#1 Sleep Eat Domestic <NA> Child Care 
#2 Sleep Eat Domestic Paid  <NA> 
#3 Sleep Eat Domestic <NA> Child Care 
#4 Sleep Eat  <NA> Paid  <NA> 
+1

真棒作品非常好谢谢! – giacomo

+1

@giacomoV感谢您的评论。 – akrun

3

检查rowSums每个template值,然后再次拼凑它:

template <- c("Sleep", "Eat", "Domestic", "Paid", "Child Care") 
# i've fixed this template so the case matches the values for 'Child Care' 

data.frame(lapply(
    setNames(template, seq_along(template)), 
    function(v) c(NA,v)[(rowSums(x==v,na.rm=TRUE)>0)+1] 
)) 

#  X1 X2  X3 X4   X5 
#1 Sleep Eat Domestic <NA> Child Care 
#2 Sleep Eat Domestic Paid  <NA> 
#3 Sleep Eat Domestic <NA> Child Care 
#4 Sleep Eat  <NA> Paid  <NA> 

或者用pmax替代:

data.frame(
    lapply(
    setNames(template, seq_along(template)), 
    function(v) do.call(pmax, c(replace(x, x != v,NA),na.rm=TRUE)) 
) 
) 
+0

非常感谢非常酷 – giacomo

2

reshape2和dplyr解决方案。显然不像其他人那么紧凑。这个想法是融化(创造高),订单因素和演员。

library(reshape2) 
library(dplyr) 

# make and id column 
x$id <- row.names(x) 

# make a tall result id, var, value 
tall <- x %>% 
    melt(id.vars="id") %>% 
    select(id, value) 

# make an ordered factor with the template 
tall$value <- factor(tall$value, levels=template, ordered = TRUE) 

# make wide result with dcast 
result <- tall %>% 
    filter(!is.na(value)) %>% # drop the NAs 
    mutate(var = value) %>% # name the column the same as the value 
    dcast(id ~ var)   # make into wide format 

result 
# id Sleep Eat Domestic Paid Child Care 
#1 1 Sleep Eat Domestic <NA> Child Care 
#2 2 Sleep Eat Domestic Paid  <NA> 
#3 3 Sleep Eat Domestic <NA> Child Care 
#4 4 Sleep Eat  <NA> Paid  <NA>