2016-08-12 56 views
3

我是一位植物学家,也是初学者的R用户。我想知道你是否可以帮我找到写剧本的解决方案。我一直在使用R来优化从电子表格创建文本的过程。为此我使用MonographaR包,我很好。问题本身正在处理data.frame。我的电子表格(CSV文件)基本上由物种栏,字符行和交叉点单元格组成。我想要一个最终脚本,它允许我将两个或更多列合并到原始电子表格的新列中。当细胞具有不同的内容时,新的细胞内容必须通过昏迷+空间", "分开独立的内容。当单元格具有相同的内容时,新单元格必须只有相同的内容一次,而不重复它。我试图用连接编写的脚本,cbind等重复了单元格的内容,我对此并不满意。使用R - 将多个色谱柱冷凝成新色谱柱而不重复内容

我最初的CSV看起来像这样,

 cattleya.minor cattleya.maxima cattleya.pumila 
colour red   red    red 
surface sharp   smooth   sharp 
leaves 1    3    4 

,我想有一个最终的结果是这样

 cattleya  cattleya.minor cattleya.maxima cattleya.pumila 
colour red   red   red    red 
surface sharp, smooth sharp   smooth   sharp 
leaves 1, 3, 4  1    3    4 

非常感谢你确实。

+3

你的数据不是[整洁(http://vita.had.co.nz/papers/tidy-data.pdf),因为你已经得到了不同类型的数据(字符串,整数)在同一列内。转换数据会更好,因此每一列都是一个变量,每一行都是一个观察值。 – alistaire

回答

1

As @alistaire评论说,从“整洁”数据开始,事情会变得更容易。

# Starting data (which I've called "dat") 
dat 
 cattleya.minor cattleya.maxima cattleya.pumila 
colour    red    red    red 
surface   sharp   smooth   sharp 
leaves    1    3    4 
library(reshape2) 
library(tibble) 
library(dplyr) 

# Make data tidy 
dat.tidy = dat %>% 
    rownames_to_column(var="Characteristic") %>%    # Turn rownames into a data column 
    melt(id.var="Characteristic", variable.name="Species") %>% # Reshape to "long" format 
    dcast(Species ~ Characteristic)        # Cast back to wide so that each characteristic gets its own column 

dat.tidy  
  Species colour leaves surface 
1 cattleya.minor red  1 sharp 
2 cattleya.maxima red  3 smooth 
3 cattleya.pumila red  4 sharp 
# Summarize by genus 
dat.tidy %>% 
    group_by(Genus=gsub("(.*)\\..*","\\1",Species)) %>%  # Collapse to genus (remove species designation) 
    summarise_all(funs(paste(unique(.), collapse=", "))) %>% # For each charactreristic, paste together each unique value for a given genus 
    select(-Species) 
 Genus colour leaves  surface 
1 cattleya red 1, 3, 4 sharp, smooth 
0

谢谢@allistaire & @ eipi10!

Eipi10,我很高兴能接近我的目标。我完全按照您的建议和相同的数据集运行脚本。它工作得很好,但它看起来在最后一个命令块或在线select(-Species)上有一点问题。你会检查它吗? [R取回我下面的:

> dat <- read.csv("dat.csv") 
> dat 
     cattleya.minor cattleya.maxima cattleya.pumila 
color    red    red    red 
surface   sharp   smooth   sharp 
leaves    1    3    4 
> 
> # Make data tidy 
> dat.tidy = dat %>% 
+ rownames_to_column(var="Characteristic") %>%    # Turn  rownames into a data column 
+ melt(id.var="Characteristic", variable.name="Species") %>% # Reshape to "long" format 
+ dcast(Species ~ Characteristic)        # Cast back to wide so that each characteristic gets its own column 
Warning message: 
attributes are not identical across measure variables; they will be dropped 
> 
> dat.tidy 
      Species color leaves surface 
1 cattleya.minor red  1 sharp 
2 cattleya.maxima red  3 smooth 
3 cattleya.pumila red  4 sharp 
> 
> # Summarize by genus 
> dat.tidy %>% 
+ group_by(Genus=gsub("(.*)\\..*","\\1",Species)) %>% # Collapse to genus (remove species designation) 
+ summarise_all(funs(paste(unique(.), collapse=", "))) # For each charactreristic, paste together each unique value for a given genus 
# A tibble: 1 x 5 
    Genus           Species color leaves   surface 
    <chr>           <chr> <chr> <chr>   <chr> 
1 cattleya cattleya.minor, cattleya.maxima, cattleya.pumila red 1, 3, 4 sharp, smooth 
> select(-Species) 
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : 
    objeto 'Species' não encontrado (my free translation: object 'Species' not found) 
> 
+0

这是因为我在编辑我的答案时,在选择( - 种类)之前意外删除了'%>%'行。对于那个很抱歉。我现在修好了。如果没有前一行中的'%>%',R会将'select(-Species)'作为单独的语句处理,因此会导致错误。 'select(-Species)'只是删除'Species'列,但如果你想在汇总输出中保留'Species'列,你可以删除那一行。 – eipi10

+0

梦幻般的解决方案!非常感谢你。 –