2017-08-15 64 views
1

此问题是早期问题的延伸(Filter values from list in R)。我有一个很长的列表,类似于下面的列表。列表中的其中一个名称为“issues.fields.customfield_10400”的重复次数比所有其他次数少。检查这个“名称”是否存在值是我尝试处理的一个任务。 NULL值非常好。当列表中的某些条目与列表中的其他条目不同时,如何在R中“取消堆叠”列表?

DF = structure(list(name = structure(c(7L, 3L, 1L, 6L, 4L, 2L, 5L, 
             7L, 3L, 1L, 6L, 4L, 2L, 5L, 7L, 3L, 1L, 6L, 4L, 5L, 7L, 3L, 1L, 
             6L, 4L, 5L), .Label = c("issues.fields.created", "issues.fields.customfield_10400", 
                   "issues.fields.issuetype.name", "issues.fields.status.name", 
                   "issues.fields.summary", "issues.fields.updated", "issues.key" 
            ), class = "factor"), value = structure(c(18L, 13L, 4L, 4L, 11L, 
                       7L, 10L, 17L, 14L, 3L, 6L, 11L, 7L, 9L, 16L, 13L, 2L, 2L, 11L, 
                       8L, 15L, 14L, 1L, 5L, 11L, 12L), .Label = c("2017-05-05T13:09:12.381-0700", 
                                  "2017-06-07T07:03:11.155-0700", "2017-07-26T11:15:03.074-0700", 
                                  "2017-08-01T09:00:44.956-0700", "2017-08-14T13:47:21.612-0700", 
                                  "2017-08-14T13:47:30.419-0700", "AA1234567", "Acquire replacement files from XYZ", 
                                  "Add measurement ", "Ingest changed file location ", "Open", 
                                  "Re-classify \"Generic Assays\" (n=24)", "Sub-task", "Task", 
                                  "TEST-1030", "TEST-1192", "TEST-1357", "TEST-1358"), class = "factor")), .Names = c("name", 
                                                       "value"), row.names = c(NA, 26L), class = "data.frame") 

           name        value 
1      issues.key       TEST-1358 
2  issues.fields.issuetype.name       Sub-task 
3   issues.fields.created  2017-08-01T09:00:44.956-0700 
4   issues.fields.updated  2017-08-01T09:00:44.956-0700 
5  issues.fields.status.name        Open 
6 issues.fields.customfield_10400       AA1234567 
7   issues.fields.summary  Ingest changed file location 
8      issues.key       TEST-1357 
9  issues.fields.issuetype.name        Task 
10   issues.fields.created  2017-07-26T11:15:03.074-0700 
11   issues.fields.updated  2017-08-14T13:47:30.419-0700 
12  issues.fields.status.name        Open 
13 issues.fields.customfield_10400       AA1234567 
14   issues.fields.summary     Add measurement 
15      issues.key       TEST-1192 
16 issues.fields.issuetype.name       Sub-task 
17   issues.fields.created  2017-06-07T07:03:11.155-0700 
18   issues.fields.updated  2017-06-07T07:03:11.155-0700 
19  issues.fields.status.name        Open 
20   issues.fields.summary Acquire replacement files from XYZ 
21      issues.key       TEST-1030 
22 issues.fields.issuetype.name        Task 
23   issues.fields.created  2017-05-05T13:09:12.381-0700 
24   issues.fields.updated  2017-08-14T13:47:21.612-0700 
25  issues.fields.status.name        Open 
26   issues.fields.summary Re-classify "Generic Assays" (n=24) 

当我弹出列表时,出现以下错误消息。

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : 
    arguments imply differing number of rows: 

有人可以建议如何处理这种情况?

我需要创建数据框,如下所示。

res = structure(list(issues.fields.created = structure(c(4L, 3L, 2L, 
                1L), .Label = c("2017-05-05T13:09:12.381-0700", "2017-06-07T07:03:11.155-0700", 
                    "2017-07-26T11:15:03.074-0700", "2017-08-01T09:00:44.956-0700" 
                ), class = "factor"), issues.fields.issuetype.name = structure(c(1L, 
                                2L, 1L, 2L), .Label = c("Sub-task", "Task"), class = "factor"), 
       issues.fields.status.name = structure(c(1L, 1L, 1L, 1L), .Label = "Open", class = "factor"), 
       issues.fields.customfield_10400 = structure(c(2L, 2L, 1L, 
                  1L), .Label = c("", "AA1234567"), class = "factor"), issues.fields.summary = structure(c(3L, 
                                         2L, 1L, 4L), .Label = c("Acquire replacement files from XYZ", 
                                               "Add measurement ", "Ingest changed file location", "Re-classify \"Generic Assays\" (n=24)" 
                                        ), class = "factor"), issues.fields.updated = structure(c(2L, 
                                                       4L, 1L, 3L), .Label = c("2017-06-07T07:03:11.155-0700", "2017-08-01T09:00:44.956-0700", 
                                                             "2017-08-14T13:47:21.612-0700", "2017-08-14T13:47:30.419-0700" 
                                                       ), class = "factor"), issues.key = structure(c(4L, 3L, 2L, 
                                                                   1L), .Label = c("TEST-1030", "TEST-1192", "TEST-1357", "TEST-1358" 
                                                                   ), class = "factor")), .Names = c("issues.fields.created", 
                                                                           "issues.fields.issuetype.name", "issues.fields.status.name", 
                                                                           "issues.fields.customfield_10400", "issues.fields.summary", "issues.fields.updated", 
                                                                           "issues.key"), row.names = c(NA, 4L), class = "data.frame") 

     issues.fields.created issues.fields.issuetype.name issues.fields.status.name 
1 2017-08-01T09:00:44.956-0700      Sub-task      Open 
2 2017-07-26T11:15:03.074-0700       Task      Open 
3 2017-06-07T07:03:11.155-0700      Sub-task      Open 
4 2017-05-05T13:09:12.381-0700       Task      Open 
    issues.fields.customfield_10400    issues.fields.summary 
1      AA1234567  Ingest changed file location 
2      AA1234567     Add measurement 
3         Acquire replacement files from XYZ 
4         Re-classify "Generic Assays" (n=24) 
     issues.fields.updated issues.key 
1 2017-08-01T09:00:44.956-0700 TEST-1358 
2 2017-08-14T13:47:30.419-0700 TEST-1357 
3 2017-06-07T07:03:11.155-0700 TEST-1192 
4 2017-08-14T13:47:21.612-0700 TEST-1030 

回答

5

用在标题中提到的unstack功能:

us = unstack(DF, value ~ name) 
data.frame(lapply(us, `length<-`, max(lengths(us)))) 

这给

  issues.fields.created issues.fields.customfield_10400 issues.fields.issuetype.name issues.fields.status.name 
1 2017-08-01T09:00:44.956-0700      AA1234567      Sub-task      Open 
2 2017-07-26T11:15:03.074-0700      AA1234567       Task      Open 
3 2017-06-07T07:03:11.155-0700       <NA>      Sub-task      Open 
4 2017-05-05T13:09:12.381-0700       <NA>       Task      Open 
       issues.fields.summary  issues.fields.updated issues.key 
1  Ingest changed file location 2017-08-01T09:00:44.956-0700 TEST-1358 
2     Add measurement 2017-08-14T13:47:30.419-0700 TEST-1357 
3 Acquire replacement files from XYZ 2017-06-07T07:03:11.155-0700 TEST-1192 
4 Re-classify "Generic Assays" (n=24) 2017-08-14T13:47:21.612-0700 TEST-1030 

缺失值都充满了NA - 在R上的标码 - - 而不是空白。

2
​​
1

这只是从'长'改为'宽'格式。使用dplyrtidyr ...

library(dplyr) 
library(tidyr) 
df2 <- df %>% mutate(case=cumsum(name=="issues.key")) %>% 
       spread(key=name, value=value) %>% 
       select(-case) 

df2 
     issues.fields.created issues.fields.customfield_10400 issues.fields.issuetype.name issues.fields.status.name    issues.fields.summary  issues.fields.updated issues.key 
1 2017-08-01T09:00:44.956-0700      AA1234567      Sub-task      Open  Ingest changed file location 2017-08-01T09:00:44.956-0700 TEST-1358 
2 2017-07-26T11:15:03.074-0700      AA1234567       Task      Open     Add measurement 2017-08-14T13:47:30.419-0700 TEST-1357 
3 2017-06-07T07:03:11.155-0700       <NA>      Sub-task      Open Acquire replacement files from XYZ 2017-06-07T07:03:11.155-0700 TEST-1192 
4 2017-05-05T13:09:12.381-0700       <NA>       Task      Open Re-classify "Generic Assays" (n=24) 2017-08-14T13:47:21.612-0700 TEST-1030 
1

随着data.table的(或reshape2的)dcast功能,可以做到以下几点:

# create ID variable 
dat$id <- cumsum(grepl("TEST-", dat$value, fixed=TRUE)) 

现在,名字重塑的ID

library(data.table) # or library(reshape2) 
dcast(dat, id~name, value.var="value", fill=NA) 

这将返回以下所需结果。

id  issues.fields.created issues.fields.customfield_10400 issues.fields.issuetype.name 
1 1 2017-08-01T09:00:44.956-0700      AA1234567      Sub-task 
2 2 2017-07-26T11:15:03.074-0700      AA1234567       Task 
3 3 2017-06-07T07:03:11.155-0700       <NA>      Sub-task 
4 4 2017-05-05T13:09:12.381-0700       <NA>       Task 
    issues.fields.status.name    issues.fields.summary  issues.fields.updated issues.key 
1      Open  Ingest changed file location 2017-08-01T09:00:44.956-0700 TEST-1358 
2      Open     Add measurement 2017-08-14T13:47:30.419-0700 TEST-1357 
3      Open Acquire replacement files from XYZ 2017-06-07T07:03:11.155-0700 TEST-1192 
4      Open Re-classify "Generic Assays" (n=24) 2017-08-14T13:47:21.612-0700 TEST-1030 
+1

出人意料的是,这似乎工作:'dcast(DAT,cumsum(名称== “issues.key”)〜名)' – Frank

+0

也就是说整洁。我还没有看到在这些功能之一中使用的复杂公式。我想,如果这种情况在lm'调用中起作用,他们应该可以在这种情况下工作。 – lmo

+0

伊莫的策略对于给定的例子有效。然而,对于成千上万的参赛作品来说,弗兰克的策略并不困难。 – RanonKahn

相关问题