2014-09-02 94 views
0

我在R中的以下数据框(只样本数据)R:据帧的组织,结构和子集数据帧

data <- data.frame(NAME=c("NAME1", "NAME1", "NAME1","NAME2","NAME2","NAME2"), 
        ID=c(47,47,47,259,259,259), 
        SURVEY_YEAR=c(1960,1961,1965,2007,2010,2014), 
        REFERENCE_YEAR=c(1959,1960,1963,2004,2009,2011), 
        CUMULATIVE_SUM=c(-6,-10,-23,-9,NA,-40)) 

以表格形式,它看起来是这样的:

NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM 
1 NAME1 47  1960   1959    -6 
2 NAME1 47  1961   1960   -10 
3 NAME1 47  1965   1963   -23 
4 NAME2 259  2007   2004    -9 
5 NAME2 259  2010   2009    NA 
6 NAME2 259  2014   2011   -40 

我是什么试图做的是重构我的数据帧,以便它看起来应该像这样到底:

NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR 
1 NAME1 47  1960   1959    -6      0 
2 NAME1 47  1961   1960   -10      -6 
3 NAME1 47  1965   1963   -23     -10 
4 NAME2 259  2007   2004    -9      0 
5 NAME2 259  2010   2009    NA      NA 
6 NAME2 259  2014   2011   -40      -9 

我在尝试通过使用下面的代码来实现这一目标:

# loop through elements in data$CUMULATIVE_SUM 
for (i in 1:length(data$CUMULATIVE_SUM)) { 
    # take value of upper row, but take NULL if in upper row there is another NAME or end of table 
    if (i==1) { 
    value=0 # If first row 
    } else { 
    if (data$NAME[i-1]==data$NAME[i]) { 
     value=data$CUMULATIVE_SUM[i-1] # Normal case: take upper value 
    } else { 
     value=0 # If other NAME 
    } 
    } 
    data$CUMULATIVE_SUM_REFYEAR[i] <- value # Write new value in new column 
} 

使用此代码,我上面的代码的结果是这样的:

NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR 
1 NAME1 47  1960   1959    -6      0 
2 NAME1 47  1961   1960   -10      -6 
3 NAME1 47  1965   1963   -23     -10 
4 NAME2 259  2007   2004    -9      0 
5 NAME2 259  2010   2009    NA      **-9** 
6 NAME2 259  2014   2011   -40      NA 

与我期望的解决方案进行比较时,你可能已经注意到了-9的值在错误的地方(用粗体标出)。如果连续出现NA值,有什么办法可以解决这个问题?我被卡住了。感谢您的帮助!

+0

数据2 < - na.omit(数据) – 2014-09-02 22:50:54

+0

谢谢你的回答,但是这不会工作,我仍然需要保持在NAS! – kurdtc 2014-09-03 00:06:27

+0

他们保持。分配属性并将值重新输入。检出它,这很有用。 '?na.omit' – 2014-09-03 00:10:19

回答

2

尝试

library(data.table) 
setDT(data)[!is.na(CUMULATIVE_SUM), 
      CUMULATIVE_SUM_REFYEAR := c(0, CUMULATIVE_SUM[-.N]), 
      by = NAME] 
data 
#  NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR 
# 1: NAME1 47  1960   1959    -6      0 
# 2: NAME1 47  1961   1960   -10      -6 
# 3: NAME1 47  1965   1963   -23     -10 
# 4: NAME2 259  2007   2004    -9      0 
# 5: NAME2 259  2010   2009    NA      NA 
# 6: NAME2 259  2014   2011   -40      -9 
1

使用dplyr

library(dplyr) 
    left_join(data, data %>% 
    group_by(NAME) %>% 
    filter(!is.na(CUMULATIVE_SUM)) %>% 
    mutate(CUMULATIVE_SUM_REFYEAR= lag(CUMULATIVE_SUM, 1, 0))) 
    #  NAME ID SURVEY_YEAR REFERENCE_YEAR CUMULATIVE_SUM CUMULATIVE_SUM_REFYEAR 
    #1 NAME1 47  1960   1959    -6      0 
    #2 NAME1 47  1961   1960   -10      -6 
    #3 NAME1 47  1965   1963   -23     -10 
    #4 NAME2 259  2007   2004    -9      0 
    #5 NAME2 259  2010   2009    NA      NA 
    #6 NAME2 259  2014   2011   -40      -9