我正在尝试清理数据集并在名称下创建3个变量：Adventure，Action和Comedy。原始数据集有3000个观测值（导入的文件名：dat）。我只显示一些意见使用多个变量创建变量

id Runtime  Genres          
37  75  animation, adventure, family, fantasy, musical 
1  162  action, adventure, fantasy, sci_fi  
95  126  action, fantasy 
100  101  comedy, drama, fantasy 
82  136  action, adventure, sci-fi  
99  117  animation, adventure, comedy, family, sport 
91  95  animation, comedy, crime, family

R中导入数据集后分离所有类别分为5使用下述R代码：

dat1 <- dat %>% separate (Genres, c("Genres1","Genres2" ,"Genres3" ,"Genres4" ,"Genres5"), sep=",", extra = "drop", fill = "right") 


id Runtime Genres1 Genres2 Genres3 Genres4 Genres5          
37  75  animation adventure family fantasy musical 
1  162  action  adventure fantasy sci_fi  
95  126  action  fantasy 
100  101  comedy  drama  fantasy 
82  136  action  adventure sci-fi  
99  117  animation adventure comedy family sport 
91  95  animation comedy  crime family

如何折叠所有类型为1类各行动，冒险和喜剧？

我用下面的代码尝试：

创建使用

dat1 ["adventure"] <- NA 

dat1$adventure <- ifelse(dat1$Genres1=="adventure",1,(ifelse(dat1$Genres2=="adventure",1,0)))

缩短了代码的建议后冒险一空列

dat1$adventure <- ifelse((dat1$Genres1=="adventure" | dat1$Genres2=="adventure" | dat1$Genres3=="adventure" | dat1$Genres4=="adventure"),1, 0) 


id Runtime Genres1 Genres2 Genres3 Genres4 Genres5 Adventure          
37  75  animation adventure family fantasy musical 0 
1  162  action  adventure fantasy sci_fi   0 
95  126  action  fantasy        0 
100  101  comedy  drama  fantasy     0 
82  136  action  adventure sci-fi      0 
99  117  animation adventure comedy family sport 0 
91  95  animation comedy  crime family   0

的代码能够提取冒险Genres1，但返回零为Genres2。

我重新修正了这个问题。我尝试了一些建议，但不知道如何去做，因为有3000次观察。

运行建议流派，形成向量的

列表并将其分配给DAT2

dat2 <- c("adventure", "comedy", "action", "drama", "animation", "fantasy", "mystery", "family", "sci-fi", "thriller", "romance", "horror", "musical","history", "war", "documentary", "biography")

表（因子（DAT2））表（因子（DAT2））

action adventure animation biography  comedy documentary   drama 
     1   1   1   1   1   1   1 
family  fantasy  history  horror  musical  mystery  romance 
     1   1   1   1   1   1   1 
sci-fi thriller   war 
     1   1   1

后

创建功能

fun1 <- function("adventure", "comedy", "action", "drama", "animation", 
"fantasy", "mystery", "family", "sci-fi", "thriller", "romance", "horror", 
"musical","history", "war", "documentary", "biography")) { 
vector_of_cur_genres <- seperate(i, sep = ", ") 
result <- table(factor(vector_of_cur_genres, dat2)) 
return(result) 
} 

    # Results   

fun1 <- function("adventure", "comedy", "action", "drama", 
"animation", "fantasy", "mystery", "family", "sci-fi", "thriller", 
"romance", "horror", "musical","history", "war", "documentary", 
"biography")) { 
    Error: unexpected string constant in "fun1 <- function("adventure"" 
    > vector_of_cur_genres <- separate(i, sep = ", ") 
    Error: Please supply column name 
    > result <- table(factor(vector_of_cur_genres, dat2)) 
    Error in factor(vector_of_cur_genres, dat2) : 
    object 'vector_of_cur_genres' not found 
    > return(result) 
    Error: no function to return from, jumping to top level 
    > } 
    Error: unexpected '}' in "}" 

    mat <- mapply(fun1,dat2$Genres) 
     Error in match.fun(FUN) : object 'fun1' not found

来源

2016-07-26 Suchit Kumbhare

仅供参考，有没有需要分配给它之前创建一个空的新列：分配创建也无妨。 –

欢迎来到Stack Overflow！ [如何做一个伟大的R可重现的例子？]（http://stackoverflow.com/questions/5963269） – zx8754

可能地，将数据从宽转换为长，然后将表汇总。 – zx8754

你可以使用表和因子的混合来得到你想要的。首先你要确保所有的流派每次拼写完全相同（"Adventure" != "adventure"）。然后，您应该创建一个包含所有可能类型的矢量c("Adventure", "Comedy", "Drama", ...")。

对于每一行，然后您调用table(factor(genres, list_of_possible_genres))，它将返回一个计数表。然后，您可以构建一个矩阵，像这样

mat <- mapply(
    function(i) { 
     table(factor(separate(i, ...),list_of_possible_genres)) 
    },df$Genres) 
#you want to use the original Data.Frame after import 

new.df <- cbind(df,mat) #they should both have the same number of rows here

使...在单独的调用一样的原始功能。如果您对单个功能或步骤有什么疑问，我可以在评论中解释。

我在这个定义了一个函数，就是在这个应用程序中调用function (i) ...，这与在Python中定义lambda类似。该函数引入了一系列流派，并返回每个可能流派出现次数的命名向量。

编辑：

fun1 <- function(string_of_genres)) { 
    vector_of_cur_genres <- seperate(i, sep = ", ") 
    result <- table(factor(vector_of_cur_genres, list_of_possible_genres)) 
    return(result) 
} 
mat <- mapply(fun1,df$Genres)

来源

2016-07-26 14:56:18 Adam

@ Adam：我是R的初学者，你是否想在这个步骤的原始输入数据框上工作？你能解释一下矩阵函数和cbind吗？ –

'cbind'是最简单的。它所做的是采用一堆矩阵或数据框架，并将列彼此附着。因此，在调用'cbind（df，mat）'中会发生什么，data.frame会将矩阵的列标记为结尾。 'mapply'是一个向量化函数，它意味着它需要一个向量，矩阵或列表，然后将给定的函数应用到它，然后返回每个函数调用的结果。 'mapply'是'* apply'系列函数的一部分，其中有大量关于解释细微差别的文献。 – Adam

检查我的编辑。您确实想从第一步开始在data.frame上调用它。在分割数据之前。如果你看一下，你会将代码中的数据拆分到其他地方 – Adam

使用多个变量创建变量

列表并将其分配给DAT2

创建功能

回答

相关问题