2017-09-06 72 views
1

我对R相对比较陌生,每次我需要“重塑”数据时,我都感到非常困惑。我有一个看起来像这样的数据:只收缩一些变量,长到R的宽格式

拥有:

ID ever_smoked alcoholic  medication dosage 
1 1   no  no humira/adalimumab 40mg 
2 1   no  no  prednisone 15mg 
3 1   no  no  azathioprine 30mg 
4 1   no  no   rowasa 9mg 
5 2   yes  no   lialda 20mg 
6 2   yes  no mercaptopurine  1g 
7 2   yes  no   asacol 1600mg 

旺旺:

ID ever_smoked alcoholic medication 
1 1   no  no humira/adalimumab, prednisone, azathioprine, rowasa 
2 2   yes  no lialda, mercaptopurine, asacol 

    dosage     most_recent_med  most_recent_dose 
1 40mg, 15mg, 30mg, 9mg rowasa    9mg 
2 20mg, 1g, 1600mg  asacol    1600mg 

(请注意,它应该是2个观测和7个变量)。本质上,我想(1)只折叠一些变量,(2)保留其他变量的第一行,并且(3)根据某些变量的最后一个条目创建2个新变量的变量。

代码重现:

have <- data.frame(ID = c(1, 1, 1, 1, 2, 2, 2), 
    ever_smoked = c("no", "no", "no", "no", "yes", "yes", "yes"), 
    alcoholic = c("no", "no", "no", "no", "no", "no", "no"), 
    medication = c("humira/adalimumab", "prednisone", "azathioprine", "rowasa", "lialda", "mercaptopurine", "asacol"), 
    dosage = c("40mg", "15mg", "30mg", "9mg", "20mg", "1g", "1600mg"), stringsAsFactors = FALSE) 

want <- data.frame(ID = c(1, 2), 
    ever_smoked = c("no", "yes"), 
    alcoholic = c("no", "no"), 
    medication = c("humira/adalimumab, prednisone, azathioprine, rowasa", "lialda, mercaptopurine, asacol"), 
    dosage = c("40mg, 15mg, 30mg, 9mg", "20mg, 1g, 1600mg"), 
    most_recent_med = c("rowasa", "asacol"), 
    most_recent_dose = c("9mg", "1600mg"), stringsAsFactors = FALSE) 

感谢。

回答

3

这是一个总结的过程中,可以使用summarise_all,并通过两个功能来概括每一列:一个toString崩溃列,一个拿最后一行last

have %>% 
    group_by(ID, ever_smoked, alcoholic) %>% 
    summarise_all(funs(toString(.), most_recent = last(.))) 

# A tibble: 2 x 7 
# Groups: ID, ever_smoked [?] 
#  ID ever_smoked alcoholic         medication_toString  dosage_toString medication_most_recent dosage_most_recent 
# <dbl>  <chr>  <chr>            <chr>     <chr>     <chr>    <chr> 
#1  1   no  no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg     rowasa    9mg 
#2  2   yes  no      lialda, mercaptopurine, asacol  20mg, 1g, 1600mg     asacol    1600mg 

假设ever_smoked酒精对于此处的每个ID都是唯一的。

4

下面是一些不同的方法:

1)sqldf

library(sqldf) 
sqldf("select ID, 
       ever_smoked, 
       alcoholic, 
       group_concat(medication) as medication, 
       group_concat(dosage) as dosage, 
       medication as last_medication, 
       dosage as last_doage 
     from have 
     group by ID") 

,并提供:

ID ever_smoked alcoholic          medication    dosage last_medication last_doage 
1 1   no  no humira/adalimumab,prednisone,azathioprine,rowasa 40mg,15mg,30mg,9mg   rowasa  9mg 
2 2   yes  no      lialda,mercaptopurine,asacol  20mg,1g,1600mg   asacol  1600mg 

2)的数据。表

library(data.table) 
have_dt <- data.table(have) 
have_dt[, list(medication = toString(medication), 
       dosage = toString(dosage), 
       last_medication = medication[.N], 
       last_dosage = dosage[.N]), 
      by = "ID,ever_smoked,alcoholic"] 

,并提供:

ID ever_smoked alcoholic           medication    dosage last_medication last_dosage 
1: 1   no  no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg   rowasa   9mg 
2: 2   yes  no      lialda, mercaptopurine, asacol  20mg, 1g, 1600mg   asacol  1600mg 

3)基 - 由

do.call("rbind", by(have, have$ID, with, data.frame(
    ID = ID[1], 
    ever_smoked = ever_smoked[1], 
    alcoholic = alcoholic[1], 
    medication = toString(medication), 
    dosage = toString(dosage), 
    last_medication = tail(medication, 1), 
    last_dosage = tail(dosage, 1)))) 

,并提供:

ID ever_smoked alcoholic           medication    dosage last_medication last_dosage 
1 1   no  no humira/adalimumab, prednisone, azathioprine, rowasa 40mg, 15mg, 30mg, 9mg   rowasa   9mg 
2 2   yes  no      lialda, mercaptopurine, asacol  20mg, 1g, 1600mg   asacol  1600mg 

注意吨他的可能替代地被写为:

do.call("rbind", by(have, have$ID, function(x) with(x, data.frame(
    ID = ID[1], 
    ever_smoked = ever_smoked[1], 
    alcoholic = alcoholic[1], 
    medication = toString(medication), 
    dosage = toString(dosage), 
    last_medication = tail(medication, 1), 
    last_dosage = tail(dosage, 1))))) 

4)基 - 骨料

aggregate(. ~ ID + ever_smoked + alcoholic, have, 
    function(x) c(values = toString(x), last = as.character(tail(x, 1)))) 

给予:

ID ever_smoked alcoholic         medication.values medication.last   dosage.values dosage.last 
1 1   no  no humira/adalimumab, prednisone, azathioprine, rowasa   rowasa 40mg, 15mg, 30mg, 9mg   9mg 
2 2   yes  no      lialda, mercaptopurine, asacol   asacol  20mg, 1g, 1600mg  1600mg 

注意,这将返回一个2×5的数据的帧,其中最后两列是每个2列矩阵,比平展形式可以更方便地进行索引,但如果平坦化是首选,那么:do.call("data.frame", DF)