我有一个大表:10M行乘33列,其中28列有一些NA值。这些NA值需要使用locf()
进行修补。我阅读了有关此主题的一些主题(efficiently locf by groups in a single R data.table和na.locf and inverse.rle in Rcpp)。但是,这些线程正在取代数字向量。我对Rcpp
不太熟悉,所以我不知道如何改变他们的代码来迎合字符串---我的数据都是字符串。使用R中的Data.Table或Rcpp字符串快速替代NA
这里是我的样本数据:
输入数据
Sample_File = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113),
Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456",
"GH567", "HI678", "IJ789"), Name = c(NA, NA, NA, "Human Being",
NA, "Lion", NA, "Bird"), Family = c(NA, NA, NA, "Homo Sapiens",
NA, NA, NA, "Passeridae"), SL1_Continent = c("Asia", NA,
"Asia", "Asia", NA, NA, NA, "Australia"), SL2_Country = c("China",
"China", NA, NA, NA, NA, NA, "Australia"), SL3_Direction = c("East",
NA, "East", "East", NA, NA, NA, "West"), Expiration_FY = c(2021,
NA, 2018, NA, 2012, 2012, NA, 2012), Flag = c("Y", NA, "N",
"N", NA, NA, NA, "TBD"), Insured = c("No", NA, NA, NA, NA,
NA, NA, "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4,
44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO",
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country",
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue",
"Quantity"), row.names = c(NA, 8L), class = "data.frame")
下面是使用data.table
我代码:
data.table::setDT(Sample_File)
cols <- c("Name","Family","SL1_Continent","SL2_Country","SL3_Direction","Expiration_FY","Flag","Insured")
Sample_File[, (cols):=lapply(.SD, function(x){na.locf(x,fromLast = TRUE,na.rm=TRUE)}), by = SO, .SDcols = cols]
预期输出:
Output = structure(list(SO = c(112, 112, 112, 112, 113, 113, 113, 113),
Product.ID = c("AB123", "CD234", "DE345", "EF456", "FG456",
"GH567", "HI678", "IJ789"), Name = c("Human Being", "Human Being",
"Human Being", "Human Being", "Lion", "Lion", "Bird", "Bird"
), Family = c("Homo Sapiens", "Homo Sapiens", "Homo Sapiens",
"Homo Sapiens", "Passeridae", "Passeridae", "Passeridae",
"Passeridae"), SL1_Continent = c("Asia", "Asia", "Asia",
"Asia", "Australia", "Australia", "Australia", "Australia"
), SL2_Country = c("China", "China", "China", "China", "Australia",
"Australia", "Australia", "Australia"), SL3_Direction = c("East",
"East", "East", "East", "West", "West", "West", "West"),
Expiration_FY = c(2021, 2018, 2018, 2021, 2012, 2012, 2012,
2012), Flag = c("Y", "N", "N", "N", "TBD", "TBD", "TBD",
"TBD"), Insured = c("No", "No", "No", "No", "Yes", "Yes",
"Yes", "Yes"), Revenue = c(0, 478227.44, 0, 0, 0, 0, 125550.4,
44314.51), Quantity = c(1000, 100, 100, 4, 6, 6, 4, 6)), .Names = c("SO",
"Product.ID", "Name", "Family", "SL1_Continent", "SL2_Country",
"SL3_Direction", "Expiration_FY", "Flag", "Insured", "Revenue",
"Quantity"), row.names = c(NA, -8L), class = "data.frame")
虽然上述代码需要几分之一秒来执行,它需要约10分钟,我的原始数据集,其转换为〜280分钟,处理28分列即使data.table
处理一列。
我假设我没有真正利用上面data.table
的力量。我不太确定。我衷心感谢任何帮助,以加快na.locf()
功能。
有没有更高效的方法取代NA
以上?
尼斯的答案。太糟糕了,由于参考语义的原因,我们不能真正微调它。 –
通过先制作副本找到了一种方法。那么您的Rcpp/C++ 11解决方案比data.table快3到4倍,比dplyr快5倍。 –
@DirkEddelbuettel从data.table或dplyr IIRC中没有任何东西。这只是Rcpp na.locf与动物园:: na.locf。我记得使用由eddie编写的滚动连接来查看na.locf的data.table版本。我认为[这一个](http://stackoverflow.com/a/26181795/3001626)(未经测试)。 –