我想为无监督学习随机森林准备数据。 的程序如下:为无监督学习生成合成数据
- 取数据和值1添加属性“类”的所有实例
- 生成原始数据合成数据:
- ,而你没有相同数量的的例子如在原始数据构建的例子:
- 样品新的属性从属性的所有值在原始数据
- 值做到这一点对所有的属性,并将它们组合成新的实施例
- ,而你没有相同数量的的例子如在原始数据构建的例子:
- 分配给属性综合数据值2
- 绑定两个数据一起
的 '类' 在结束它看起来像这样:
... Class
|1
Original |1
Data |1
|1
--------------
|2
Synthetic |2
Data |2
|2
我ř代码如下所示:
library(gtools) #for smartbind()
sample1 <- function(X) { sample(X, replace=T) }
g1 <- function(dat) { apply(dat,2,sample1) }
data$class <- rep(1, times=nrow(data)) #add attribute 'class' with value 1
synthData<-data.frame(g1(data[,1:ncol(data)])) #generate synthetic data with sampling from data
synthData$class <- rep(2, times=nrow(synthData)) #attribute 'class' is 2
colnames(synthData) <- colnames(data)
newData <- smartbind(data, synthData) #bind the data together
很可能很明显,我对R真的很陌生,但它的工作原理 - 只有一个问题:合成数据中属性的类型与原始数据中的属性不同。如果原来他们是数字,现在他们成为因素。如何在生成合成数据时保留相同的类型?
谢谢!
数据1(NUMS成为因素):
结构(列表(V2 = C(1.51793,1.51711,1.51645,1.51916,1.51131 ),V3 = C(13.21,12.89,13.44,14.15,13.69 ),V4 = c(3.48,3.62,0.3.61,0.3.2),V5 = c(1.41,1.57,1.54,2.09,1.81),V6 = c(72.64, 72.96,72.39,72.74,72.81),V7 = = C(0.59,0.61,0.66,0,1.76, ),V8 = c(8.43,8.11,8.03,10.88,5.43),V9 = c(0,0,0,0, 1.19),V10 = c( 0,0,0,0),realClass = structure(c(1L,2L, 2L,5L,6L),.Label = c(“1”,“2”,“3”,“5”, “6”,“7”),class =“factor”)),.Names = c(“V2”, “V3”“V4”“V5”“V6”“V7”“V8”“V9”“V10”“realClass” ,183L,186L)中,class = “data.frame”)
数据2(因素成为CHRS):
结构(列表(realClass =结构(C(2L,2L,2L,1L ,2L),.Label = c(“e”, “p”),class =“factor”),V2 =结构(c(6L,3L,4L,6L,6L),.Label = c(“b “,” “,”c“,”f“,”k“,”s“,”x“),class =”factor“),V3 =结构(c(4L, 4L,3L,1L,1L)标签= c(“f”,“g”,“s”,“y”),class =“factor”), V4 =结构(c(5L,5L,5L,3L,4L),.Label = c(“b”,“c”, “e”,“g”,“n”,“p”,“r” (1L,1L,1L,2L,1L),...,标签= c(“f”,“t” ),class =“factor”),V6 =结构(c(3L,9L,3L,6L,3L ),。标签= c(“a”,“c “,”f“,”l“,”m“,”n“,”p“,”s“,”y“,class =”factor“),V7 = structure(c(2L,2L,2L ,2L,2L, ),。标签= c(“a”,“f”),等级=“因子”),V8 =结构(c(1L, 1L,1L,1L,1L),。标签= c (“c”,“w”),class =“factor”),V9 =结构(c(2L,2L,2L,1L,1L),.Label = c(“b”,“n” ) ),V10 =结构(c(1L,1L,1L,10L, 4L),.Label = c(“b”,“e”,“g”,“h”,“k” “n”,“o”,“p”,“r”, “u”,“w”,“y”),class =“factor”),V11 = structure(c(2L, 2L,2L, 2L,1L),.Label = c(“e”,“t”),class =“factor”), V12 =结构(c(NA,NA,NA, 1L,1L),.Label = c(“b”,“c”, “e”,“r”),class =“factor”),V13 = structure(c(3L,2L,3L, 3L, 2L),.Label = c(“f”,“k”,“s”,“y”),class =“因子”),结构(c(3L,3L,2L,3L,2L), .Label = c(“f”,“k”, “s”,“y”),class =“factor”),V15 =结构(c(7L,8L,7L, 4L,7L) = c(“b”,“c”,“e”,“g”,“n”,“o”,“p”,“w”, “y”),class =“factor”结构(c(7L,7L,8L,4L, 1L),.Label = c(“b”,“c”,“e”,“g”,“n”,“o”,“p” V17 =结构(c(1L,1L,1L,1L,1L, ),。标签=“p”,等级=“因子”),V18 =结构(c(3L, 3L,3L,3L,3L),.Label = c(“n”,“o”,“w”,“y”),class =“factor”),V19 =结构(c(2L,2L,2L,2L,2L),.Label = c(“n”,“o”, “t”),class =“factor”),V20 = structure(c(1L, 1L,1L,5L, 3L),.Label = c(“e”,“f”,“l”,“n”,“p”),class =“因子”), 8L,8L,4L,2L)。标签= c(“b”,“h”, “k”,“n”,“o”,“r”,“u”,“w” y“,class =”factor“),V22 = structure(c(5L, 5L,5L,5L,6L),.Label = c(”a“,”c“,”n“,”s“ “v”,“y”),class =“因子”),结构(c(3L,3L,5L,1L,2L),.Label = c(“d”,“g”, “ “),”N“= c(”realClass“, ”V2“,”V3“,”V4“,”m“,”p“,”u“ V5,V6,V7,V8,V9,V10,V11, “V12”,“V13”,“V14”,“V15”,“V16”,“V17” “,”V18“,”V19“,”V20“, ”V21“,”V22“,”V23“)行。名称= C(4105L,6207L,6696L,2736L,3756L )的class = “data.frame”)
既然你不显示你的数据不是很明显看出,为什么你的因素在地方的数字,但你可以做'numcol < - as.numeric( as.character(factcol))' – dickoa 2012-08-05 22:28:42
是的,这有效。是否有更通用的解决方案,以便不管属性的类型如何,它们在过程之后都保持不变? – 2012-08-05 22:31:21
通过可重现的示例更容易找到答案。在这种情况下,我们对数据('str(data)'或更好的'dput(data)')不太了解。 – dickoa 2012-08-05 22:36:50