2015-03-30 57 views
1

比方说,我有以下的数据表(如data):如何用R删除少数据的重复行?

row,or,d,ddate,rdate,changes,class,price,fdate,company,number,minutes,added,source 
1,VA1,VA2,2014-05-24,,0,0,2124,2014-05-22 15:50:16,,,,2014-05-22 12:20:03,tp 
2,VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,,,,2014-05-22 12:20:03,tp 
3,VA1,VA2,2014-05-26,,0,0,2124,2014-05-22 15:03:44,A1,,,2014-05-22 12:20:03,tp 
4,VA1,VA2,2014-06-05,,0,0,2124,2014-05-22 15:48:24,,,,2014-05-22 12:20:03,tp 
5,VA1,VA2,2014-06-09,,0,0,2124,2014-05-22 15:37:35,,,,2014-05-22 12:20:03,tp 
6,VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp 
7,VA1,VA2,2014-06-16,,0,0,2124,2014-05-22 14:17:33,,,,2014-05-22 12:20:03,tp 

我想删除重复的行。如果我做data <- unique(data, by = NULL),那么只有最后一行(第7行)被删除,但是我想删除第2行。我可以setkey()定义键:

setkey(data, row,or,d,ddate,rdate,changes,class,price,fdate,number,minutes,added,source) 

,它会删除任何一列2或第3行,但我想删除行,它具有较少的数据,并保持与行更多的数据。即在上面的情况下,第2行应该被删除,但第3行应该保留,因为它在第company列中具有附加值。我该怎么做?

+0

我这个问题看的问题是,如果某行具有比另一个更小的数据,它_isn't_重复。至少不是如果你使用每一列作为唯一性的关键。 – 2015-03-30 19:59:46

回答

0

如何:

# whatever the important columns are for your uniqueness criterion 
important.cols = c('or','d','ddate','rdate','changes','class','price','fdate') 

# pick row with max number of non-empty elements 
dt[, .SD[which.max(rowSums(.SD != "", na.rm = T))], by = important.cols] 
+0

谢谢。你能否澄清一下'.SD'的含义? – 2015-03-30 20:12:53

+0

它代表数据子集,请参阅[简介](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro-vignette.html)了解更多详情 – eddi 2015-03-30 20:27:47