为无监督学习生成合成数据

我想为无监督学习随机森林准备数据。的程序如下：为无监督学习生成合成数据

取数据和值1添加属性“类”的所有实例
生成原始数据合成数据：
- ，而你没有相同数量的的例子如在原始数据构建的例子：
  - 样品新的属性从属性的所有值在原始数据
  - 值做到这一点对所有的属性，并将它们组合成新的实施例
分配给属性综合数据值2
绑定两个数据一起

的 '类' 在结束它看起来像这样：

 ...  Class 
       |1 
    Original |1 
    Data  |1 
       |1 
    -------------- 
       |2 
    Synthetic |2 
    Data  |2 
       |2

我ř代码如下所示：

library(gtools) #for smartbind() 

sample1 <- function(X) { sample(X, replace=T) } 
g1  <- function(dat) { apply(dat,2,sample1) } 

data$class <- rep(1, times=nrow(data)) #add attribute 'class' with value 1 

synthData<-data.frame(g1(data[,1:ncol(data)])) #generate synthetic data with sampling from data 
synthData$class <- rep(2, times=nrow(synthData)) #attribute 'class' is 2 
colnames(synthData) <- colnames(data) 
newData <- smartbind(data, synthData) #bind the data together

很可能很明显，我对R真的很陌生，但它的工作原理 - 只有一个问题：合成数据中属性的类型与原始数据中的属性不同。如果原来他们是数字，现在他们成为因素。如何在生成合成数据时保留相同的类型？

谢谢！

数据1（NUMS成为因素）：

结构（列表（V2 = C（1.51793，1.51711，1.51645，1.51916，1.51131 ），V3 = C（13.21，12.89，13.44，14.15，13.69 ），V4 = c（3.48,3.62,0.3.61,0.3.2），V5 = c（1.41,1.57,1.54,2.09,1.81），V6 = c（72.64, 72.96,72.39,72.74,72.81），V7 = = C（0.59,0.61,0.66,0,1.76, ），V8 = c（8.43,8.11,8.03,10.88,5.43），V9 = c（0,0,0,0, 1.19），V10 = c（ 0，0,0,0），realClass = structure（c（1L，2L， 2L，5L，6L），.Label = c（“1”，“2”，“3”，“5”， “6”，“7”），class =“factor”）），.Names = c（“V2”， “V3”“V4”“V5”“V6”“V7”“V8”“V9”“V10”“realClass” ，183L，186L）中，class = “data.frame”）

数据2（因素成为CHRS）：

结构（列表（realClass =结构（C（2L，2L，2L，1L ，2L），.Label = c（“e”， “p”），class =“factor”），V2 =结构（c（6L，3L，4L，6L，6L），.Label = c（“b “，” “，”c“，”f“，”k“，”s“，”x“），class =”factor“），V3 =结构（c（4L， 4L，3L，1L，1L）标签= c（“f”，“g”，“s”，“y”），class =“factor”）， V4 =结构（c（5L，5L，5L，3L，4L），.Label = c（“b”，“c”， “e”，“g”，“n”，“p”，“r” （1L，1L，1L，2L，1L），...，标签= c（“f”，“t” ），class =“factor”），V6 =结构（c（3L，9L，3L，6L，3L ），。标签= c（“a”，“c “，”f“，”l“，”m“，”n“，”p“，”s“，”y“，class =”factor“），V7 = structure（c（2L，2L，2L ，2L，2L，），。标签= c（“a”，“f”），等级=“因子”），V8 =结构（c（1L， 1L，1L，1L，1L），。标签= c （“c”，“w”），class =“factor”），V9 =结构（c（2L，2L，2L，1L，1L），.Label = c（“b”，“n” ）），V10 =结构（c（1L，1L，1L，10L， 4L），.Label = c（“b”，“e”，“g”，“h”，“k” “n”，“o”，“p”，“r”， “u”，“w”，“y”），class =“factor”），V11 = structure（c（2L， 2L，2L， 2L，1L），.Label = c（“e”，“t”），class =“factor”）， V12 =结构（c（NA，NA，NA， 1L，1L），.Label = c（“b”，“c”， “e”，“r”），class =“factor”），V13 = structure（c（3L，2L，3L， 3L， 2L），.Label = c（“f”，“k”，“s”，“y”），class =“因子”），结构（c（3L，3L，2L，3L，2L）， .Label = c（“f”，“k”， “s”，“y”），class =“factor”），V15 =结构（c（7L，8L，7L， 4L，7L） = c（“b”，“c”，“e”，“g”，“n”，“o”，“p”，“w”， “y”），class =“factor”结构（c（7L，7L，8L，4L， 1L），.Label = c（“b”，“c”，“e”，“g”，“n”，“o”，“p” V17 =结构（c（1L，1L，1L，1L，1L，），。标签=“p”，等级=“因子”），V18 =结构（c（3L， 3L，3L，3L，3L），.Label = c（“n”，“o”，“w”，“y”），class =“factor”），V19 =结构（c（2L，2L，2L，2L，2L），.Label = c（“n”，“o”， “t”），class =“factor”），V20 = structure（c（1L， 1L，1L，5L， 3L），.Label = c（“e”，“f”，“l”，“n”，“p”），class =“因子”）， 8L，8L，4L，2L）。标签= c（“b”，“h”， “k”，“n”，“o”，“r”，“u”，“w” y“，class =”factor“），V22 = structure（c（5L， 5L，5L，5L，6L），.Label = c（”a“，”c“，”n“，”s“ “v”，“y”），class =“因子”），结构（c（3L，3L，5L，1L，2L），.Label = c（“d”，“g”， “ “），”N“= c（”realClass“， ”V2“，”V3“，”V4“，”m“，”p“，”u“ V5，V6，V7，V8，V9，V10，V11， “V12”，“V13”，“V14”，“V15”，“V16”，“V17” “，”V18“，”V19“，”V20“， ”V21“，”V22“，”V23“）行。名称= C（4105L，6207L，6696L，2736L，3756L ）的class = “data.frame”）

来源

2012-08-05 Uros K

既然你不显示你的数据不是很明显看出，为什么你的因素在地方的数字，但你可以做'numcol < - as.numeric（ as.character（factcol））' – dickoa 2012-08-05 22:28:42

是的，这有效。是否有更通用的解决方案，以便不管属性的类型如何，它们在过程之后都保持不变？ – 2012-08-05 22:31:21

通过可重现的示例更容易找到答案。在这种情况下，我们对数据（'str（data）'或更好的'dput（data）'）不太了解。 – dickoa 2012-08-05 22:36:50

你总是可以用这一招有数字列

numcol <- as.numeric(as.character(factcol))

但我怀疑你的data.frame中有因子变量。由于apply返回一个矩阵，如果您的数据中有一个因子，则所有数值变量也将被强制分解。

下面是一个例子，使用数据集玩具

set.seed(123) 
toydat <- data.frame(A = 1:10, B = rnorm(10), C = LETTERS[1:10]) 
str(toydat) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: int 1 2 3 4 5 6 7 8 9 10 
## $ B: num -0.5605 -0.2302 1.5587 0.0705 0.1293 ... 
## $ C: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 

set.seed(1) 
str(data.frame(apply(toydat[,1:2], 2, sample, replace = TRUE))) 

## 'data.frame': 10 obs. of 2 variables: 
## $ A: num 3 4 6 10 3 9 10 7 7 1 
## $ B: num 1.5587 -0.2302 0.4609 0.0705 -1.2651 ... 

# with the factor column C  
set.seed(2) 
str(data.frame(apply(toydat[,1:3], 2, sample, replace = TRUE))) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: Factor w/ 6 levels "10"," 2"," 5",..: 2 5 4 2 1 1 2 6 3 4 
## $ B: Factor w/ 8 levels " 0.129288","-0.230177",..: 8 7 6 2 1 5 3 7 1 4 
## $ C: Factor w/ 6 levels "B","D","E","G",..: 4 2 5 1 2 3 1 2 6 1

这就是plyr包成为有用的，因为可以控制输出（使用**帘布层）。但是，在这种情况下，colwise功能足以

require(plyr) 
set.seed(2) 
mysamplingfun <- colwise(function(x) sample(x, replace = TRUE)) 
str(mysamplingfun(toydat[,1:3])) 

## 'data.frame': 10 obs. of 3 variables: 
## $ A: int 2 8 6 2 10 10 2 9 5 6 
## $ B: num 1.715 1.559 -1.265 -0.23 0.129 ... 
## $ C: Factor w/ 10 levels "A","B","C","D",..: 7 4 9 2 4 5 2 4 10 2

来源

2012-08-05 22:55:35 dickoa

是的，colwise做我需要的。谢谢，我非常感谢你帮助我的努力。 – 2012-08-06 14:57:48

为无监督学习生成合成数据

回答

相关问题