2017-04-18 139 views
0

在Stata中,我可以使用codebookout命令创建一个Excel工作簿,该工作簿将现有数据集中所有变量的名称,标签和存储类型与相应的值和值标签一起保存。R:是否有与Stata的codebookout命令相同的内容?

我想在R中找到等效函数。到目前为止,我遇到了memisc函数库,它有一个名为codebook的函数,但它与Stata中的函数不同。

例如,在Stata,码本的输出应该是这样的......(见下文 - 这就是我想要的)

Variable Name Variable Label Answer Label Answer Code Variable Type 
    hhid    hhid   Open ended     String 
    inter_month  inter_month Open ended     long 
    year    year   Open ended     long 
    org_unit   org_unit          long 
             Balaka   1 
             Blantyre  2 
             Chikwawa  3 
             Chiradzulu  4 

即在数据帧中的每一列进行评估,以产生5个不同列的值:

  • 变量名这是该列的名称
  • 变量标签是 列的名称
  • 答案标签,它是 列中的唯一值。如果没有唯一值,则认为是开放式结果
  • 答案代码,它是对答案标签中每个类别的数字分配。如果答案标签不是分类,则为空。
  • 变量类型:int,STR,长(日期)...

这里是我的尝试:

CreateCodebook <- function(dF){ 
    numbercols <- length(colnames(dF)) 

    table <- data.frame() 

    for (i in 1:length(colnames(dF))){ 
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else "" 
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended" 
    VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i], 
                rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i] 

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType) 
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label") 
    table <- rbind(table, df) 

    } 
    return(table) 
} 

不幸的是,我得到以下警告消息:

Warning messages: 
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) : 
    invalid factor level, NA generated 
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) : 
    invalid factor level, NA generated 

我产生的输出结果导致答案代码标签混乱:

   Variable Name Variable Label Variable Type Answer Code Answer Label 
hhid     hhid   hhid Open ended    character 
month     month   month Open ended     integer 
year     year   year Open ended     integer 
org_unit   org_unit  org_unit Open ended    character 
v000     v000   v000 Open ended    character 
v001     v001   v001 Open ended     integer 
v002     v002   v002 Open ended     integer 
v003     v003   v003 Open ended     integer 
v005     v005   v005 Open ended     integer 
v006     v006   v006 Open ended     integer 
v007     v007   v007 Open ended     integer 
v021     v021   v021 Open ended     numeric 
2285     v024   v024  central  <NA>  factor 
1             north  <NA>    
7119            south  <NA>    
11      v025   v025   rural  <NA>  factor 
1048     v025   v025   urban  <NA>  factor 
district_name district_name district_name Open ended    character 
coords_x1   coords_x1  coords_x1 Open ended     numeric 
coords_x2   coords_x2  coords_x2 Open ended     numeric 
itn_color   itn_color  itn_color Open ended     numeric 
piped     piped   piped Open ended     numeric 
sanit     sanit   sanit Open ended     numeric 
sanit_cd   sanit_cd  sanit_cd Open ended     numeric 
water     water   water Open ended     numeric 
+0

你能证明你是如何试图到目前为止回答这个问题?你可以开始写一些代码......(否则,这是“找到一个非现场资源”(脱离主题)或“为我写代码”(脱离主题)......) –

+0

我基本上有一个DataFrame (它可以是任何数据帧,无关紧要),我将代码簿应用于该df。但输出不是我想要的。 –

+1

对不起,我读得太快了,没有看到你在问题的原始版本中提到过'memisc :: codebook'。尽管如此,恐怕(如果你自己不能取得更大的进展)这个问题可能不适合SO,因为你基本上需要一个定制/非常具体的输出。 –

回答

1

我决定为了自己的娱乐而对此采取一些措施。我使用了内置的Titanic数据集。然而,我对你的一个定义有一个问题:你说“如果没有独特的价值,它就被认为是开放式的”。但长度> 0的变量有一些独特的价值:你的意思是“如果每个值是唯一的”?即使这个定义不一定按预期工作:在Titanic数据集中,响应是整数,并且32个总值中只有22个唯一值。我不认为有人会真的想要这个枚举,所以我测试了factor类型(但是如果你真的想要的话,你可以用下面的length(u)==length(x)这行代替)。

## utility function: pad vector with blanks to specified length 
pad <- function(x,n,p="") { 
    return(c(x,rep(p,n-length(x)))) 
} 
## process a single column 
proc_col <- function(x,nm) { 
    u <- unique(x) 
    ## if (length(u)==length(x)) { 
    if (!is.factor(x)) { 
     n <- 1 
     u <- "open ended" 
     cc <- "" 
    } else { 
     cc <- as.numeric(u) 
     n <- length(u) 
    } 
    dd <- data.frame(`Variable Name`=pad(nm,n), 
       `Variable Label`=pad(nm,n), 
       `Answer Label`=u, 
       `Answer Code`=cc, 
       `Variable Type`=pad(class(x),n), 
       stringsAsFactors=FALSE) 
    return(dd) 
} 
## process all columns 
proc_df <- function(x) { 
    L <- Map(proc_col,x,names(x)) 
    dd <- do.call(rbind,L) 
    rownames(dd) <- NULL 
    return(dd) 
} 

例子:

xx <- as.data.frame.table(Titanic) 
proc_df(xx) 

## Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type 
## 1   Class   Class   1st   1  factor 
## 2          2nd   2    
## 3          3rd   3    
## 4          Crew   4    
## 5   Sex   Sex   Male   1  factor 
## 6          Female   2    
## 7   Age   Age  Child   1  factor 
## 8          Adult   2    
## 9  Survived  Survived   No   1  factor 
## 10          Yes   2    
## 11   Freq   Freq open ended     numeric 

我没码值等的名单之前离开空格,但你可以自己做出这些调整?

+0

非常感谢你本!我肯定会赞成这一点,并接受这个答案。为了我自己的利益,我也想出了一个解决方案。我非常接近,但我收到一条警告信息。 –

0

这里是我的一个解决方案,破解:

CreateCodebook <- function(dF){ 
    numbercols <- length(colnames(dF)) 

    table <- data.frame() 

    for (i in 1:length(colnames(dF))){ 
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else "" 
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended" 
    VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i], 
                rep("",length(AnswerCode) - 1)) else colnames(dF)[i] 
    VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i], 
                rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i] 

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE) 
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label") 
    table <- rbind(table, df) 

    } 
    rownames(table) <- 1:nrow(table) 
    return(table) 
} 

输出:

Variable Name Variable Label Variable Type Answer Code Answer Label 
1   brid   brid Open ended    character 
2   month   month Open ended     integer 
3   year   year Open ended     integer 
4  org_unit  org_unit Open ended    character 
5   v000   v000 Open ended    character 
6   v001   v001 Open ended     integer 
7   v002   v002 Open ended     integer 
8   v003   v003 Open ended     integer 
9   v005   v005 Open ended     integer 
10   v006   v006 Open ended     integer 
11   v007   v007 Open ended     integer 
12   v021   v021 Open ended     numeric 
13   v024   v024  central   1  factor 
14          north   2    
15          south   3    
16   v025   v025   rural   1  factor 
17          urban   2    
18   bidx   bidx Open ended     integer 
19 district_name district_name Open ended    character 
20  coords_x1  coords_x1 Open ended     numeric 
21  coords_x2  coords_x2 Open ended     numeric 
22   anc4   anc4 Open ended     numeric 
23 antimal_48  antimal_48 Open ended     numeric 
24   carep   carep Open ended     numeric 
25   csec   csec Open ended     numeric 
26   dptv   dptv Open ended     numeric 
27  ebreast  ebreast Open ended     numeric 
28  fans_48  fans_48 Open ended     numeric 
29  ideliv   ideliv Open ended     numeric 
30   iptp   iptp Open ended     numeric 
31  iron90   iron90 Open ended     numeric 
32  measlesv  measlesv Open ended     numeric 
33   ors   ors Open ended     numeric 
34   ort   ort Open ended     numeric 
35   pncwm   pncwm Open ended     numeric 
36  sstools  sstools Open ended     numeric 
37   tt    tt Open ended     numeric 
38   vita   vita Open ended     numeric 
相关问题