2016-09-20 71 views
0

我试图在具有不同样本大小的类的数据集上拟合LDA模型。LDA:类中的不同样本大小

TL; DR

lda.predict()如果我训练与不具有相同数目的样本类的分类工作不正常。

龙说明

我有7类,每3个样品,并且一个类只有2个样品:

tortle -14,6379 -17,3731 
tortle -14,9339 -17,4379 
bull  -11,7777 -13,1383 
bull  -11,6207 -13,4596 
bull  -11,4616 -12,9811 
hawk  -9,01229 -12,777 
hawk  -8,88177 -12,4383 
hawk  -8,93559 -13,0143 
pikachu -6,50024 -7,92564 
pikachu -6,00418 -8,59305 
pikachu -6,0769 -6,00419 
pizza  2,02872 3,07972 
pizza  2,084  2,73762 
pizza  2,20269 2,90577 
sangoku -3,14428 -3,14415 
sangoku -4,02675 -3,55358 
sangoku -3,26119 -2,95265 
charizard -0,159746 0,434694 
charizard 0,0191964 0,514596 
charizard 0,0422884 0,512207 
tomatoe -1,15295 -2,09673 
tomatoe -0,562748 -1,80215 
tomatoe -0,716941 -1,83503 

这里是一个工作示例:

#!/usr/bin/python 
# coding: utf-8 

from matplotlib import pyplot as plt 
import numpy as np 
from sklearn import preprocessing 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA 
from sklearn import cross_validation 

analytes = ['tortle', 'tortle', 'bull', 'bull', 'bull', 'hawk', 'hawk', 'hawk', 'pikachu', 'pikachu', 'pikachu', 'pizza', 'pizza', 'pizza', 'sangoku', 'sangoku', 'sangoku', 'charizard', 'charizard', 'charizard', 'tomatoe', 'tomatoe', 'tomatoe'] 

# Transform the names of the samples into integers 
lb = preprocessing.LabelEncoder().fit(analytes) 
analytes = lb.transform(analytes) 


# Create an array w/ the measurements 
dimensions = [[-14.6379, -14.9339, -11.7777, -11.6207, -11.4616, -9.01229, -8.88177, -8.93559, -6.50024, -6.00418, -6.0769, 2.02872, 2.084, 2.20269, -3.14428, -4.02675, -3.26119, -0.159746, 0.0191964, 0.0422884, -1.15295, -0.562748, -0.716941], [-17.3731, -17.4379, -13.1383, -13.4596, -12.9811, -12.777, -12.4383, -13.0143, -7.92564, -8.59305, -6.00419, 3.07972, 2.73762, 2.90577, -3.14415, -3.55358, -2.95265, 0.434694, 0.514596, 0.512207, -2.09673, -1.80215, -1.83503]] 

# Transform the array of the results 
all_samples = np.array(dimensions).T 

# Normalize the data 
preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True, 
        copy=False) 

# Train the LDA classifier. Use the eigen solver 
lda = LDA(solver='eigen', n_components=2) 
transformed = lda.fit_transform(all_samples, analytes) 


# Fit the LDA classifier on the new subspace 
lda.fit(transformed, analytes) 

fig = plt.figure() 

plt.plot(transformed[:, 0], transformed[:, 1], 'o') 

# Get the limits of the graph. Used for adapted color areas 
x_min, x_max = fig.axes[0].get_xlim() 
y_min, y_max = fig.axes[0].get_ylim() 

# Step size of the mesh. Decrease to increase the quality of the VQ. 
# point in the mesh [x_min, m_max]x[y_min, y_max]. 
# h = 0.01 
h = 0.001 

# Create a grid for incoming plottings 
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) 

# Predict the class for each unit of the grid 
Z = lda.predict(np.c_[xx.ravel(), yy.ravel()]) 

Z = Z.reshape(xx.shape) 

# Plot the areas 
plt.imshow(Z, extent=(x_min, x_max, y_min, y_max), aspect='auto', origin='lower', alpha=0.6) 

plt.show() 

这是输出:

enter image description here

正如你所看到的,右边的两点同化紫色班,而他们不应该。他们应该属于黄色类,如果我增加了图形的限制成为可见:

enter image description here

基本上,我的问题是,lda.predict()如果我训练的分类器工作不正常与不具有相同样本数的类相关联。

是否有解决方法?

回答

0

我花了一段时间才弄明白这一点。预处理步骤负责错误分类。更改

preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True, 
       copy=False) 

preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True) 

解决了我的问题。但是,我的数据现在不以相同的方式缩放。

相关问题