2017-04-17 61 views
2

我有两个表。其中具有如下所示的格式。其中一个是表格中的这样:R在列内搜索信息

students|Test Score 
A  | 100 
B  | 81 
C  | 92 
D  | 88 

另一个表BI有看起来像这样:

Class | Students 
1  | {A,D} 
2  | {B,C} 

我想在R其中,由我可以搜索学生执行某种操作的从表A中在表B中的列下的数组中列出和制表的得分为以下格式:

Class | Students | Mean Score 
    1  | {A,D} | 94 
    2  | {B,C} | 86.5 

是否有我可以使用做搜索,然后合并这些RESU任何式在R中通过一些操作?

+0

第二张表中的“学生”列是什么类?一个向量或列表? – www

+0

它实际上是一个因素,因为源文件是针对该列的格式:“{A,D}”等 – user7729135

回答

0

有可能做到这一点更有创意的方式,但这里的使用dplyr包解决方案R.

library(dplyr) 
lapply(B$Class, function(x) { 
    mask <- B$Class == x 
    data.frame(Class = x, 
      Students = unlist(strsplit(B$Students[mask], ',')), 
      stringsAsFactors = F) 
}) %>% 
    bind_rows() %>% 
    full_join(A, by = 'Students') %>% 
    group_by(Class) %>% 
    summarize(`Mean Score` = mean(Test.Score)) %>% 
    full_join(B, by = 'Class') 

分步

dplyr包有助于数据操作步骤。这是一个可重现的例子。

library(dplyr) 

A <- read.csv(text = 'Students,Test Score 
A, 100 
B, 81 
C, 92 
D, 88', stringsAsFactors = F) 

B <- read.csv(text = 'Class, Students 
1,"{A,D}" 
2,"{B,C}"', stringsAsFactors = F) %>% 
    mutate(Students = gsub('\\{|\\}', '', Students)) 

str(A) 
# 'data.frame': 4 obs. of 2 variables: 
# $ Students : chr "A" "B" "C" "D" 
# $ Test.Score: int 100 81 92 88 

str(B) 
# 'data.frame': 2 obs. of 2 variables: 
# $ Class : int 1 2 
# $ Students: chr "A,D" "B,C" 

做一些字符操纵将您的B表转换为“长”格式。

C <- lapply(B$Class, function(x) { 
    mask <- B$Class == x 
    data.frame(Class = x, 
      Students = unlist(strsplit(B$Students[mask], ',')), 
      stringsAsFactors = F) 
}) %>% 
    bind_rows() 

str(C) 
# 'data.frame': 4 obs. of 2 variables: 
# $ Class : int 1 1 2 2 
# $ Students: chr "A" "D" "B" "C" 

将学生的成绩添加到我们的“长”表中。

D <- full_join(A, C, by = 'Students') 

str(D) 
# 'data.frame': 4 obs. of 3 variables: 
# $ Students : chr "A" "B" "C" "D" 
# $ Test.Score: int 100 81 92 88 
# $ Class  : int 1 2 2 1 

按照班级对学生进行分组并计算每班的平均分数。然后,添加一个列,其中包括哪些学生在课堂上。

E <- D %>% 
    group_by(Class) %>% 
    summarize(`Mean Score` = mean(Test.Score)) %>% 
    full_join(B, by = 'Class') 

str(E) 
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables: 
# $ Class  : int 1 2 
# $ Mean Score: num 94 86.5 
# $ Students : chr "A,D" "B,C" 
4

使用基础R的一个简单方法:

df2$mean_score <- sapply(df2$Students, function(x, df) { 
         students_vec <- unlist(strsplit(gsub("[{}]","", x), split=",")) 
         mean(df[which(df$students %in% students_vec), "Test Score"]) 
     }, df = df1) 

df2 
# Class Students mean_score 
#1  1 {A,D}  94.0 
#2  2 {B,C}  86.5 

我们采用了在中df2学生列,创造我们想要的学生的一个载体。然后,我们只需将df1分组给那些学生并采取平均值。请注意,这假设您的df2$Students数据以字符串形式出现。

数据:

df1 <- structure(list(students = c("A", "B", "C", "D"), `Test Score` = c(100L, 
81L, 92L, 88L)), .Names = c("students", "Test Score"), row.names = c(NA, 
-4L), class = "data.frame") 

df2 <- structure(list(Class = 1:2, Students = c("{A,D}", "{B,C}")), .Names = c("Class", 
"Students"), row.names = c(NA, -2L), class = "data.frame") 
+0

非常感谢。这实际上工作:) :) – user7729135

4

类似的解决方案,以@MikeH:

B$MeanScore <- sapply(strsplit(gsub("[{}]","", B$Students), split=","), 
     function(x) mean(A$Test.Score[A$Students %in% x])) 

其中给出:

# Class Students MeanScore 
#1  1 {A,D}  94.0 
#2  2 {B,C}  86.5 
+1

不错,更浓缩!因为'strsplit'返回一个列表,'sapply'ing在我的答案中的数据是重复的... –

+1

这是一个不错的方法。如果你改变'df1'为只有一列,并且'row.names'等于'c(“A”,“B”,“C”,“D”),你可以像这样使用子集:'df1 < - 结构(list(Test.Score = c(100L,81L,92L,88L)),.Names =“Test.Score”,row.names = c(“A”,“B”,“C”我们有'sapply(stsplit(gsub(“[{}]”,“”,df2 $ Students),“,”),function( x)mean(df1 [x,]))''。这可以避免使用'%in%'。 –

+0

嗨,只是上面的函数的一个补充是另一种方法,其中我限制X的学生测试分数的均值,并添加第二个条件示例,测试的年份与课程的年份相匹配。 – user7729135

2

使用unnest掰开甲dplyrtidyr溶液和pastecollapse选项进行组装。从@Ben FASOLI

A <- read.csv(text = 'Students,Test Score 
A, 100 
B, 81 
C, 92 
D, 88', stringsAsFactors = F) 

B <- read.csv(text = 'Class, Students 
1,"{A,D}" 
2,"{B,C}"', stringsAsFactors = F) %>% 
mutate(Students = gsub('\\{|\\}', '', Students)) 

library(dplyr) 
library(tidyr) 
B %>% 
    unnest(Students = strsplit(Students, ",")) %>% 
    inner_join(A) %>% 
    group_by(Class) %>% 
    summarize(Students = paste0("{", paste(Students, collapse=","), "}"), mean_score = mean(Test.Score)) 

    #  Class Students mean_score 
    #  <int> <chr>  <dbl> 
    # 1  1 {A,D}  94.0 
    # 2  2 {B,C}  86.5 
0

测试数据从dplyrtidyr另一种解决方案。separate_rows函数可以分隔连续的字符。 data_frame是一个类似于data.frame的函数,但它不会将字符列强制为因子。

# Load packages 
library(dplyr) 
library(tidyr) 

# Create example data frames 
df1 <- data_frame(Students = c("A", "B", "C", "D"), 
        `Test Score` = c(100, 81, 92, 88)) 

df2 <- data_frame(Class = c(1, 2), 
        Students = c("{A,D}", "{B,C}")) 

# Create the output 
df3 <- df2 %>% 
    mutate(Students = gsub("\\{|\\}", "", Students)) %>% 
    separate_rows(Students) %>% 
    left_join(df1, by = "Students") %>% 
    group_by(Class) %>% 
    summarise(`Mean Score` = mean(`Test Score`)) %>% 
    right_join(df2, by = "Class") %>% 
    select(Class, Students, `Mean Score`) 

df3 
# A tibble: 2 × 3 
    Class Students `Mean Score` 
    <dbl> <chr>  <dbl> 
1  1 {A,D}   94.0 
2  2 {B,C}   86.5