2014-10-09 38 views
0

我有dataframe,从中我想获得从数据集//其中%处理过的治疗=处理/总访问自动化数据帧元司

例如百分之。 %治疗急性上颌窦炎=九万三千四百七十零分之九万三千四百七十零= 100%

dput(droplevels(head(magma))) 

structure(list(DIAG_CODE_1 = structure(c(1L, 1L, 2L, 2L, 2L, 
2L), .Label = c("4610 SINUSITIS MAXILLARY ACUT", "4619 SINUSITIS ACUTE UNSP" 
), class = "factor"), GENDER = structure(c(1L, 1L, 1L, 1L, 1L, 
1L), .Label = "FEMALE", class = "factor"), AGE = structure(c(1L, 
1L, 1L, 1L, 1L, 1L), .Label = "0-2", class = "factor"), Mention_DRGU = c(5460L, 
5460L, 17790L, 17790L, 9400L, 9400L), treatment_status = structure(c(1L, 
2L, 1L, 2L, 1L, 2L), .Label = c("Total visits", "Treated"), class = "factor"), 
    diag_class_1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Acute sinusitis", class = "factor"), 
    year = c(2007L, 2007L, 2007L, 2007L, 2008L, 2008L)), .Names = c("DIAG_CODE_1", 
"GENDER", "AGE", "Mention_DRGU", "treatment_status", "diag_class_1", 
"year"), row.names = c(1285L, 1286L, 1407L, 1410L, 1408L, 1411L 
), class = "data.frame") 

但是与432行,这是可能的,我可以手动都认为计算,但是这将是令人难以置信的耗费时间。这不是什么电脑用于:p。如果你们可以帮我找到在R内自动执行任务的方法,那将不胜感激。

有没有一种方法可以创建一个结果数据框,告诉我DIAG_CODE_1,性别,年龄,治疗百分比和年份?我创建了(在Excel中)我想让output看起来像是让你们看到我的意思。

output

我会做这种计算等呼吸系统疾病的,所以我现在看的学习这样我可以让我的生命从长远来看更容易。

+0

@akrun我已经把一些dput输出(希望正确 – user3900661 2014-10-09 15:40:02

回答

1

你可以使用dplyr

library(dplyr) 
library(tidyr) 

magma %>% 
     spread(treatment_status, Mention_DRGU) %>% 
     mutate(PercentageTreated=100*(Treated/`Total visits`)) %>% 
     select(-diag_class_1, -`Total visits`, -Treated) 
#     DIAG_CODE_1 GENDER AGE year PercentageTreated 
#1 4610 SINUSITIS MAXILLARY ACUT FEMALE 0-2 2007    100 
#2  4619 SINUSITIS ACUTE UNSP FEMALE 0-2 2007    100 
#3  4619 SINUSITIS ACUTE UNSP FEMALE 0-2 2008    100 
+0

谢谢你的帮助!你能推荐任何资源给我,这将有助于我理解dplyr和tidyr包吗? – user3900661 2014-10-09 16:14:45

+0

这些软件包仍然是新的。所以,我不确定是否有很多可用的资源。 15日在纽约市举行R-day培训。 http://blog.rstudio.org/2014/07/08/r-day-at-strata-nyc/我会在'stackoverflow'中搜索与'dplyr/tidyr'相关的标签。 – akrun 2014-10-09 16:22:26

1

试试这个:

magma2<-reshape(magma, idvar = c("DIAG_CODE_1","GENDER","AGE","diag_class_1","year"), timevar = "treatment_status", direction = "wide") 

colnames(magma2)<-c("DIAG_CODE_1","GENDER","AGE","diag_class_1","year","Treated","TotVisits") 

magma2$PercentageTreated<-as.numeric(as.character(magma2$Treated))/as.numeric(as.character(magma2$TotVisits)) 

head(magma2) 
+0

这使得在两行中的错误 – 2014-10-09 15:51:28

+0

什么是错误的?没有可重复的代码,我无法验证 – Ujjwal 2014-10-09 15:53:16

+0

之前是它的名字,你修正了这个问题,但现在它的因子为'magma2 $ Treated/magma2 $ TotVisits',这是因为这些名字在错误的列上 – 2014-10-09 15:55:53