2016-09-14 38 views
1

数据dplyr:获得区域和面积的不同层次的分布

df <- read.csv(url("https://www.dropbox.com/s/uaivja22czx2pe8/df_stats_question.csv?raw=1")) 

创建不同层次的EVT

#for example "0-15", "15-30", "30-60", ">60" 
library(dplyr) 
df <- df %>% 
    mutate(EVT_mod = ifelse (EVT <= 15, "0-15", 
          ifelse(EVT <= 30, "15-30", 
            ifelse(EVT <= 60, "30-60", ">60")))) 

我要做些什么?

每个区域(Zone1Zone5),我想要得到的param1param2 不同组合的总区域面积的百分比,这percent_area的每个级别中EVT_mod

例分布输出

#I want the output to be as below 
#ID  param1 param2 percent_area  0-15 15-30 30-60 >60 
#zone1 High  High  10    2  3  4  1 
#zone1 High  Medium 5     0.5 2  0.5  2 
#zone1 High  Low  15    3  4  5  3 
#zone1 Medium High  9     3  2  3  1 
#zone1 Medium Medium 11    2  3  4  2 
#zone1 Medium Low  8     0.7 0.3 3  4 
#zone1 Low  High  7     0.9 1.1 3  2 
#zone1 Low  Medium 23    8  7  5  3 
#zone1 Low  Low  12    7  2  1  2 

我干了什么?

#I got the percent of area for each zone like below 
df1 <- df %>% 
    dplyr::select(ID, param1, param2, area) %>% 
    dplyr::arrange(ID, param1, param2) %>% 
    dplyr::group_by(ID, param1, param2) %>% 
    dplyr::summarise(area = sum(area)) %>% 
    dplyr::group_by(ID) %>% 
    dplyr::mutate(percent_area = area/sum(area) * 100) 

head(df1) 
#  ID param1 param2  area percent_area 
# <fctr> <fctr> <fctr>  <dbl>  <dbl> 
#1 Zone1 High High 1247.26891 1.60636374 
#2 Zone1 High Low 4725.57502 6.08609125 
#3 Zone1 High Medium 10.06087 0.01295744 
#4 Zone1 Low High 1432.38859 1.84478029 
#5 Zone1 Medium High 44907.15570 57.83614608 
#6 Zone1 Medium Low 22036.19702 28.38052622 

问题

任何建议如何获得percent_area每个EVT_mod水平的分布可以理解的?

回答

1

这个怎么样?首先按EVT_mod分组,然后分散在各列上,然后我们以类似的方式结束。

首先,我改变这一行:

df <- df %>% 
    mutate(EVT_mod = ifelse (EVT <= 15, 'cat1', 
          ifelse(EVT <= 30, 'cat2', 
            ifelse(EVT <= 60, 'cat3', 'cat4')))) 

因为这些将成为列名,并有像0-15事情列名是一种痛苦,特别是dplyr的NSE。

df %>% 
    select(ID, param1, param2, area, EVT_mod) %>% 
    group_by(ID, param1, param2, EVT_mod) %>% 
    summarise(area = sum(area)) %>% 
    tidyr::spread(EVT_mod, area, fill = 0) %>% 
    mutate(area = sum(c(cat1, cat2, cat3, cat4))) %>% 
    group_by(ID) %>% 
    mutate(cat1 = cat1/sum(area) * 100, 
     cat2 = cat2/sum(area) * 100, 
     cat3 = cat3/sum(area) * 100, 
     cat4 = cat4/sum(area) * 100, 
     percent_area = area/sum(area) * 100) %>% 
    arrange(ID, param1, param2) 

# Source: local data frame [61 x 9] 
# Groups: ID [5] 
# 
#  ID param1 param2  cat1  cat2  cat3 cat4  area percent_area 
# <fctr> <fctr> <fctr>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl> 
# 1 Zone1 High High 1.34705031 0.25931343 0.00000000  0 1247.26891 1.60636374 
# 2 Zone1 High Low 5.59184841 0.49424283 0.00000000  0 4725.57502 6.08609125 
# 3 Zone1 High Medium 0.01262533 0.00033211 0.00000000  0 10.06087 0.01295744 
# 4 Zone1 Low High 1.84478029 0.00000000 0.00000000  0 1432.38859 1.84478029 
# 5 Zone1 Medium High 56.31313681 1.52300927 0.00000000  0 44907.15570 57.83614608 
# 6 Zone1 Medium Low 18.64165645 9.73886978 0.00000000  0 22036.19702 28.38052622 
# 7 Zone1 Medium Medium 4.06436687 0.16876810 0.00000000  0 3286.83815 4.23313497 
# 8 Zone2 High High 30.03120766 10.13084134 0.01099552  0 11522.80578 40.17304453 
# 9 Zone2 High Low 6.91574950 1.58340654 0.04628919  0 2451.08397 8.54544522 
# 10 Zone2 High Medium 0.88955660 0.05981439 0.00000000  0 272.30741 0.94937100 
# # ... with 51 more rows 
+0

非常感谢您的时间和帮助。 – aelwan