2016-07-26 162 views
3

我正在尝试清理数据集并在名称下创建3个变量:Adventure,Action和Comedy。原始数据集有3000个观测值(导入的文件名:dat)。我只显示一些意见使用多个变量创建变量

id Runtime  Genres          
37  75  animation, adventure, family, fantasy, musical 
1  162  action, adventure, fantasy, sci_fi  
95  126  action, fantasy 
100  101  comedy, drama, fantasy 
82  136  action, adventure, sci-fi  
99  117  animation, adventure, comedy, family, sport 
91  95  animation, comedy, crime, family 

R中导入数据集后分离所有类别分为5使用下述R代码:

dat1 <- dat %>% separate (Genres, c("Genres1","Genres2" ,"Genres3" ,"Genres4" ,"Genres5"), sep=",", extra = "drop", fill = "right") 


id Runtime Genres1 Genres2 Genres3 Genres4 Genres5          
37  75  animation adventure family fantasy musical 
1  162  action  adventure fantasy sci_fi  
95  126  action  fantasy 
100  101  comedy  drama  fantasy 
82  136  action  adventure sci-fi  
99  117  animation adventure comedy family sport 
91  95  animation comedy  crime family 

如何折叠所有类型为1类各行动,冒险和喜剧?

我用下面的代码尝试:

创建使用

dat1 ["adventure"] <- NA 

dat1$adventure <- ifelse(dat1$Genres1=="adventure",1,(ifelse(dat1$Genres2=="adventure",1,0))) 

缩短了代码的建议后冒险一空列

dat1$adventure <- ifelse((dat1$Genres1=="adventure" | dat1$Genres2=="adventure" | dat1$Genres3=="adventure" | dat1$Genres4=="adventure"),1, 0) 


id Runtime Genres1 Genres2 Genres3 Genres4 Genres5 Adventure          
37  75  animation adventure family fantasy musical 0 
1  162  action  adventure fantasy sci_fi   0 
95  126  action  fantasy        0 
100  101  comedy  drama  fantasy     0 
82  136  action  adventure sci-fi      0 
99  117  animation adventure comedy family sport 0 
91  95  animation comedy  crime family   0 

的代码能够提取冒险Genres1,但返回零为Genres2

我重新修正了这个问题。我尝试了一些建议,但不知道如何去做,因为有3000次观察。

运行建议流派,形成向量的

列表并将其分配给DAT2

dat2 <- c("adventure", "comedy", "action", "drama", "animation", "fantasy", "mystery", "family", "sci-fi", "thriller", "romance", "horror", "musical","history", "war", "documentary", "biography") 

表(因子(DAT2))表(因子(DAT2))

action adventure animation biography  comedy documentary   drama 
     1   1   1   1   1   1   1 
family  fantasy  history  horror  musical  mystery  romance 
     1   1   1   1   1   1   1 
sci-fi thriller   war 
     1   1   1                 

创建功能

fun1 <- function("adventure", "comedy", "action", "drama", "animation", 
"fantasy", "mystery", "family", "sci-fi", "thriller", "romance", "horror", 
"musical","history", "war", "documentary", "biography")) { 
vector_of_cur_genres <- seperate(i, sep = ", ") 
result <- table(factor(vector_of_cur_genres, dat2)) 
return(result) 
} 

    # Results   

fun1 <- function("adventure", "comedy", "action", "drama", 
"animation", "fantasy", "mystery", "family", "sci-fi", "thriller", 
"romance", "horror", "musical","history", "war", "documentary", 
"biography")) { 
    Error: unexpected string constant in "fun1 <- function("adventure"" 
    > vector_of_cur_genres <- separate(i, sep = ", ") 
    Error: Please supply column name 
    > result <- table(factor(vector_of_cur_genres, dat2)) 
    Error in factor(vector_of_cur_genres, dat2) : 
    object 'vector_of_cur_genres' not found 
    > return(result) 
    Error: no function to return from, jumping to top level 
    > } 
    Error: unexpected '}' in "}" 

    mat <- mapply(fun1,dat2$Genres) 
     Error in match.fun(FUN) : object 'fun1' not found                                                   
+0

仅供参考,有没有需要分配给它之前创建一个空的新列:分配创建也无妨。 –

+0

欢迎来到Stack Overflow! [如何做一个伟大的R可重现的例子?](http://stackoverflow.com/questions/5963269) – zx8754

+0

可能地,将数据从宽转换为长,然后将表汇总。 – zx8754

回答

0

你可以使用表和因子的混合来得到你想要的。首先你要确保所有的流派每次拼写完全相同("Adventure" != "adventure")。然后,您应该创建一个包含所有可能类型的矢量c("Adventure", "Comedy", "Drama", ...")

对于每一行,然后您调用table(factor(genres, list_of_possible_genres)),它将返回一个计数表。然后,您可以构建一个矩阵,像这样

mat <- mapply(
    function(i) { 
     table(factor(separate(i, ...),list_of_possible_genres)) 
    },df$Genres) 
#you want to use the original Data.Frame after import 

new.df <- cbind(df,mat) #they should both have the same number of rows here 

使...在单独的调用一样的原始功能。如果您对单个功能或步骤有什么疑问,我可以在评论中解释。

我在这个定义了一个函数,就是在这个应用程序中调用function (i) ...,这与在Python中定义lambda类似。该函数引入了一系列流派,并返回每个可能流派出现次数的命名向量。

编辑:

fun1 <- function(string_of_genres)) { 
    vector_of_cur_genres <- seperate(i, sep = ", ") 
    result <- table(factor(vector_of_cur_genres, list_of_possible_genres)) 
    return(result) 
} 
mat <- mapply(fun1,df$Genres) 
+0

@ Adam:我是R的初学者,你是否想在这个步骤的原始输入数据框上工作?你能解释一下矩阵函数和cbind吗? –

+0

'cbind'是最简单的。它所做的是采用一堆矩阵或数据框架,并将列彼此附着。因此,在调用'cbind(df,mat)'中会发生什么,data.frame会将矩阵的列标记为结尾。 'mapply'是一个向量化函数,它意味着它需要一个向量,矩阵或列表,然后将给定的函数应用到它,然后返回每个函数调用的结果。 'mapply'是'* apply'系列函数的一部分,其中有大量关于解释细微差别的文献。 – Adam

+0

检查我的编辑。您确实想从第一步开始在data.frame上调用它。在分割数据之前。如果你看一下,你会将代码中的数据拆分到其他地方 – Adam