我有两个表。其中具有如下所示的格式。其中一个是表格中的这样：R在列内搜索信息

students|Test Score 
A  | 100 
B  | 81 
C  | 92 
D  | 88

另一个表BI有看起来像这样：

Class | Students 
1  | {A,D} 
2  | {B,C}

我想在R其中，由我可以搜索学生执行某种操作的从表A中在表B中的列下的数组中列出和制表的得分为以下格式：

Class | Students | Mean Score 
    1  | {A,D} | 94 
    2  | {B,C} | 86.5

是否有我可以使用做搜索，然后合并这些RESU任何式在R中通过一些操作？

来源

2017-04-17 user7729135

第二张表中的“学生”列是什么类？一个向量或列表？ – www

它实际上是一个因素，因为源文件是针对该列的格式：“{A，D}”等 – user7729135

有可能做到这一点更有创意的方式，但这里的使用dplyr包解决方案R.

library(dplyr) 
lapply(B$Class, function(x) { 
    mask <- B$Class == x 
    data.frame(Class = x, 
      Students = unlist(strsplit(B$Students[mask], ',')), 
      stringsAsFactors = F) 
}) %>% 
    bind_rows() %>% 
    full_join(A, by = 'Students') %>% 
    group_by(Class) %>% 
    summarize(`Mean Score` = mean(Test.Score)) %>% 
    full_join(B, by = 'Class')

分步

的dplyr包有助于数据操作步骤。这是一个可重现的例子。

library(dplyr) 

A <- read.csv(text = 'Students,Test Score 
A, 100 
B, 81 
C, 92 
D, 88', stringsAsFactors = F) 

B <- read.csv(text = 'Class, Students 
1,"{A,D}" 
2,"{B,C}"', stringsAsFactors = F) %>% 
    mutate(Students = gsub('\\{|\\}', '', Students)) 

str(A) 
# 'data.frame': 4 obs. of 2 variables: 
# $ Students : chr "A" "B" "C" "D" 
# $ Test.Score: int 100 81 92 88 

str(B) 
# 'data.frame': 2 obs. of 2 variables: 
# $ Class : int 1 2 
# $ Students: chr "A,D" "B,C"

做一些字符操纵将您的B表转换为“长”格式。

C <- lapply(B$Class, function(x) { 
    mask <- B$Class == x 
    data.frame(Class = x, 
      Students = unlist(strsplit(B$Students[mask], ',')), 
      stringsAsFactors = F) 
}) %>% 
    bind_rows() 

str(C) 
# 'data.frame': 4 obs. of 2 variables: 
# $ Class : int 1 1 2 2 
# $ Students: chr "A" "D" "B" "C"

将学生的成绩添加到我们的“长”表中。

D <- full_join(A, C, by = 'Students') 

str(D) 
# 'data.frame': 4 obs. of 3 variables: 
# $ Students : chr "A" "B" "C" "D" 
# $ Test.Score: int 100 81 92 88 
# $ Class  : int 1 2 2 1

按照班级对学生进行分组并计算每班的平均分数。然后，添加一个列，其中包括哪些学生在课堂上。

E <- D %>% 
    group_by(Class) %>% 
    summarize(`Mean Score` = mean(Test.Score)) %>% 
    full_join(B, by = 'Class') 

str(E) 
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables: 
# $ Class  : int 1 2 
# $ Mean Score: num 94 86.5 
# $ Students : chr "A,D" "B,C"

来源

2017-04-17 21:43:57

使用基础R的一个简单方法：

df2$mean_score <- sapply(df2$Students, function(x, df) { 
         students_vec <- unlist(strsplit(gsub("[{}]","", x), split=",")) 
         mean(df[which(df$students %in% students_vec), "Test Score"]) 
     }, df = df1) 

df2 
# Class Students mean_score 
#1  1 {A,D}  94.0 
#2  2 {B,C}  86.5

我们采用了在中df2学生列，创造我们想要的学生的一个载体。然后，我们只需将df1分组给那些学生并采取平均值。请注意，这假设您的df2$Students数据以字符串形式出现。

数据：

df1 <- structure(list(students = c("A", "B", "C", "D"), `Test Score` = c(100L, 
81L, 92L, 88L)), .Names = c("students", "Test Score"), row.names = c(NA, 
-4L), class = "data.frame") 

df2 <- structure(list(Class = 1:2, Students = c("{A,D}", "{B,C}")), .Names = c("Class", 
"Students"), row.names = c(NA, -2L), class = "data.frame")

来源

2017-04-17 21:49:22

非常感谢。这实际上工作:) :) – user7729135

类似的解决方案，以@MikeH：

B$MeanScore <- sapply(strsplit(gsub("[{}]","", B$Students), split=","), 
     function(x) mean(A$Test.Score[A$Students %in% x]))

其中给出：

# Class Students MeanScore 
#1  1 {A,D}  94.0 
#2  2 {B,C}  86.5

来源

2017-04-17 22:11:45

不错，更浓缩！因为'strsplit'返回一个列表，'sapply'ing在我的答案中的数据是重复的... –

这是一个不错的方法。如果你改变'df1'为只有一列，并且'row.names'等于'c（“A”，“B”，“C”，“D”），你可以像这样使用子集：'df1 < - 结构（list（Test.Score = c（100L，81L，92L，88L）），.Names =“Test.Score”，row.names = c（“A”，“B”，“C”我们有'sapply（stsplit（gsub（“[{}]”，“”，df2 $ Students），“，”），function（ x）mean（df1 [x，]））''。这可以避免使用'％in％'。 –

嗨，只是上面的函数的一个补充是另一种方法，其中我限制X的学生测试分数的均值，并添加第二个条件示例，测试的年份与课程的年份相匹配。 – user7729135

使用unnest掰开甲dplyr和tidyr溶液和paste与collapse选项进行组装。从@Ben FASOLI

A <- read.csv(text = 'Students,Test Score 
A, 100 
B, 81 
C, 92 
D, 88', stringsAsFactors = F) 

B <- read.csv(text = 'Class, Students 
1,"{A,D}" 
2,"{B,C}"', stringsAsFactors = F) %>% 
mutate(Students = gsub('\\{|\\}', '', Students)) 

library(dplyr) 
library(tidyr) 
B %>% 
    unnest(Students = strsplit(Students, ",")) %>% 
    inner_join(A) %>% 
    group_by(Class) %>% 
    summarize(Students = paste0("{", paste(Students, collapse=","), "}"), mean_score = mean(Test.Score)) 

    #  Class Students mean_score 
    #  <int> <chr>  <dbl> 
    # 1  1 {A,D}  94.0 
    # 2  2 {B,C}  86.5

来源

2017-04-17 23:09:25 epi99

测试数据从dplyr和tidyr另一种解决方案。separate_rows函数可以分隔连续的字符。 data_frame是一个类似于data.frame的函数，但它不会将字符列强制为因子。

# Load packages 
library(dplyr) 
library(tidyr) 

# Create example data frames 
df1 <- data_frame(Students = c("A", "B", "C", "D"), 
        `Test Score` = c(100, 81, 92, 88)) 

df2 <- data_frame(Class = c(1, 2), 
        Students = c("{A,D}", "{B,C}")) 

# Create the output 
df3 <- df2 %>% 
    mutate(Students = gsub("\\{|\\}", "", Students)) %>% 
    separate_rows(Students) %>% 
    left_join(df1, by = "Students") %>% 
    group_by(Class) %>% 
    summarise(`Mean Score` = mean(`Test Score`)) %>% 
    right_join(df2, by = "Class") %>% 
    select(Class, Students, `Mean Score`) 

df3 
# A tibble: 2 × 3 
    Class Students `Mean Score` 
    <dbl> <chr>  <dbl> 
1  1 {A,D}   94.0 
2  2 {B,C}   86.5

来源

2017-04-17 23:21:56 www

R在列内搜索信息

回答

分步

相关问题