2

我想为无监督学习随机森林准备数据。 的程序如下:为无监督学习生成合成数据

  • 取数据和值1添加属性“类”的所有实例
  • 生成原始数据合成数据:
    • ,而你没有相同数量的的例子如在原始数据构建的例子:
      • 样品新的属性从属性的所有值在原始数据
      • 值做到这一点对所有的属性,并将它们组合成新的实施例
  • 分配给属性综合数据值2
  • 绑定两个数据一起

的 '类' 在结束它看起来像这样:

 ...  Class 
       |1 
    Original |1 
    Data  |1 
       |1 
    -------------- 
       |2 
    Synthetic |2 
    Data  |2 
       |2 

我ř代码如下所示:

library(gtools) #for smartbind() 

sample1 <- function(X) { sample(X, replace=T) } 
g1  <- function(dat) { apply(dat,2,sample1) } 

data$class <- rep(1, times=nrow(data)) #add attribute 'class' with value 1 

synthData<-data.frame(g1(data[,1:ncol(data)])) #generate synthetic data with sampling from data 
synthData$class <- rep(2, times=nrow(synthData)) #attribute 'class' is 2 
colnames(synthData) <- colnames(data) 
newData <- smartbind(data, synthData) #bind the data together 

很可能很明显,我对R真的很陌生,但它的工作原理 - 只有一个问题:合成数据中属性的类型与原始数据中的属性不同。如果原来他们是数字,现在他们成为因素。如何在生成合成数据时保留相同的类型?

谢谢!

数据1(NUMS成为因素):

结构(列表(V2 = C(1.51793,1.51711,1.51645,1.51916,1.51131 ),V3 = C(13.21,12.89,13.44,14.15,13.69 ),V4 = c(3.48,3.62,0.3.61,0.3.2),V5 = c(1.41,1.57,1.54,2.09,1.81),V6 = c(72.64, 72.96,72.39,72.74,72.81),V7 = = C(0.59,0.61,0.66,0,1.76, ),V8 = c(8.43,8.11,8.03,10.88,5.43),V9 = c(0,0,0,0, 1.19),V10 = c( 0,0,0,0),realClass = structure(c(1L,2L, 2L,5L,6L),.Label = c(“1”,“2”,“3”,“5”, “6”,“7”),class =“factor”)),.Names = c(“V2”, “V3”“V4”“V5”“V6”“V7”“V8”“V9”“V10”“realClass” ,183L,186L)中,class = “data.frame”)

数据2(因素成为CHRS):

结构(列表(realClass =结构(C(2L,2L,2L,1L ,2L),.Label = c(“e”, “p”),class =“factor”),V2 =结构(c(6L,3L,4L,6L,6L),.Label = c(“b “,” “,”c“,”f“,”k“,”s“,”x“),class =”factor“),V3 =结构(c(4L, 4L,3L,1L,1L)标签= c(“f”,“g”,“s”,“y”),class =“factor”), V4 =结构(c(5L,5L,5L,3L,4L),.Label = c(“b”,“c”, “e”,“g”,“n”,“p”,“r” (1L,1L,1L,2L,1L),...,标签= c(“f”,“t” ),class =“factor”),V6 =结构(c(3L,9L,3L,6L,3L ),。标签= c(“a”,“c “,”f“,”l“,”m“,”n“,”p“,”s“,”y“,class =”factor“),V7 = structure(c(2L,2L,2L ,2L,2L, ),。标签= c(“a”,“f”),等级=“因子”),V8 =结构(c(1L, 1L,1L,1L,1L),。标签= c (“c”,“w”),class =“factor”),V9 =结构(c(2L,2L,2L,1L,1L),.Label = c(“b”,“n” ) ),V10 =结构(c(1L,1L,1L,10L, 4L),.Label = c(“b”,“e”,“g”,“h”,“k” “n”,“o”,“p”,“r”, “u”,“w”,“y”),class =“factor”),V11 = structure(c(2L, 2L,2L, 2L,1L),.Label = c(“e”,“t”),class =“factor”), V12 =结构(c(NA,NA,NA, 1L,1L),.Label = c(“b”,“c”, “e”,“r”),class =“factor”),V13 = structure(c(3L,2L,3L, 3L, 2L),.Label = c(“f”,“k”,“s”,“y”),class =“因子”),结构(c(3L,3L,2L,3L,2L), .Label = c(“f”,“k”, “s”,“y”),class =“factor”),V15 =结构(c(7L,8L,7L, 4L,7L) = c(“b”,“c”,“e”,“g”,“n”,“o”,“p”,“w”, “y”),class =“factor”结构(c(7L,7L,8L,4L, 1L),.Label = c(“b”,“c”,“e”,“g”,“n”,“o”,“p” V17 =结构(c(1L,1L,1L,1L,1L, ),。标签=“p”,等级=“因子”),V18 =结构(c(3L, 3L,3L,3L,3L),.Label = c(“n”,“o”,“w”,“y”),class =“factor”),V19 =结构(c(2L,2L,2L,2L,2L),.Label = c(“n”,“o”, “t”),class =“factor”),V20 = structure(c(1L, 1L,1L,5L, 3L),.Label = c(“e”,“f”,“l”,“n”,“p”),class =“因子”), 8L,8L,4L,2L)。标签= c(“b”,“h”, “k”,“n”,“o”,“r”,“u”,“w” y“,class =”factor“),V22 = structure(c(5L, 5L,5L,5L,6L),.Label = c(”a“,”c“,”n“,”s“ “v”,“y”),class =“因子”),结构(c(3L,3L,5L,1L,2L),.Label = c(“d”,“g”, “ “),”N“= c(”realClass“, ”V2“,”V3“,”V4“,”m“,”p“,”u“ V5,V6,V7,V8,V9,V10,V11, “V12”,“V13”,“V14”,“V15”,“V16”,“V17” “,”V18“,”V19“,”V20“, ”V21“,”V22“,”V23“)行。名称= C(4105L,6207L,6696L,2736L,3756L )的class = “data.frame”)

+1

既然你不显示你的数据不是很明显看出,为什么你的因素在地方的数字,但你可以做'numcol < - as.numeric( as.character(factcol))' – dickoa 2012-08-05 22:28:42

+0

是的,这有效。是否有更通用的解决方案,以便不管属性的类型如何,它们在过程之后都保持不变? – 2012-08-05 22:31:21

+0

通过可重现的示例更容易找到答案。在这种情况下,我们对数据('str(data)'或更好的'dput(data)')不太了解。 – dickoa 2012-08-05 22:36:50

回答

3

你总是可以用这一招有数字列

numcol <- as.numeric(as.character(factcol)) 

但我怀疑你的data.frame中有因子变量。 由于apply返回一个矩阵,如果您的数据中有一个因子,则所有数值变量也将被强制分解。

下面是一个例子,使用数据集玩具

set.seed(123) 
toydat <- data.frame(A = 1:10, B = rnorm(10), C = LETTERS[1:10]) 
str(toydat) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: int 1 2 3 4 5 6 7 8 9 10 
## $ B: num -0.5605 -0.2302 1.5587 0.0705 0.1293 ... 
## $ C: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 

set.seed(1) 
str(data.frame(apply(toydat[,1:2], 2, sample, replace = TRUE))) 

## 'data.frame': 10 obs. of 2 variables: 
## $ A: num 3 4 6 10 3 9 10 7 7 1 
## $ B: num 1.5587 -0.2302 0.4609 0.0705 -1.2651 ... 

# with the factor column C  
set.seed(2) 
str(data.frame(apply(toydat[,1:3], 2, sample, replace = TRUE))) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: Factor w/ 6 levels "10"," 2"," 5",..: 2 5 4 2 1 1 2 6 3 4 
## $ B: Factor w/ 8 levels " 0.129288","-0.230177",..: 8 7 6 2 1 5 3 7 1 4 
## $ C: Factor w/ 6 levels "B","D","E","G",..: 4 2 5 1 2 3 1 2 6 1 

这就是plyr包成为有用的,因为可以控制输出(使用**帘布层)。但是,在这种情况下,colwise功能足以

require(plyr) 
set.seed(2) 
mysamplingfun <- colwise(function(x) sample(x, replace = TRUE)) 
str(mysamplingfun(toydat[,1:3])) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: int 2 8 6 2 10 10 2 9 5 6 
## $ B: num 1.715 1.559 -1.265 -0.23 0.129 ... 
## $ C: Factor w/ 10 levels "A","B","C","D",..: 7 4 9 2 4 5 2 4 10 2 
+0

是的,colwise做我需要的。谢谢,我非常感谢你帮助我的努力。 – 2012-08-06 14:57:48