2017-06-03 76 views
6

我想加载sklearn.dataset,并根据密钥(target_names,目标& DESCR)缺少一列。我尝试了各种方法来包含最后一列,但有错误。加载SK到癌症数据集熊猫DataFrame

import numpy as np 
import pandas as pd 
from sklearn.datasets import load_breast_cancer 

cancer = load_breast_cancer() 
print cancer.keys() 

键是[ 'target_names', '数据', '目标', 'DESCR', 'feature_names']

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names]) 
print data.describe() 

与上面的代码,它只返回30列,当我需要31列。将scikit-learn数据集加载到pandas DataFrame中的最佳方式是什么?

+0

你能解释为什么应该有31列?如果您使用'cancer.data.shape'或检查[数据集描述](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html),似乎只有30数据集中的列。你错过了哪一列? –

+1

我缺少dataset.keys()中的target/target_names列,因为它尚未加载到数据框中。 – pythonhunter

回答

2

如果你想有一个target列,你需要添加它,因为它不在cancer.datacancer.target的列有01,并且cancer.target_names有标签。我希望以下是你想要的:

import numpy as np 
import pandas as pd 
from sklearn.datasets import load_breast_cancer 

cancer = load_breast_cancer() 
print cancer.keys() 

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names]) 
print data.describe() 

data = data.assign(target=pd.Series(cancer.target)) 
print data.describe() 

# In case you want labels instead of numbers. 
data.replace(to_replace={'target': {0: cancer.target_names[0]}}, inplace=True) 
data.replace(to_replace={'target': {1: cancer.target_names[1]}}, inplace=True) 
print data.shape # data.describe() won't show the "target" column here because I converted its value to string. 
+0

是的,我只是想通了,data ['Target'] = pd.Series(data = cancer.target,index = data.index)也能工作。谢谢。 – pythonhunter

1

这也适用,也使用pd.Series。

import numpy as np 
import pandas as pd 
from sklearn.datasets import load_breast_cancer 

cancer = load_breast_cancer() 
print cancer.keys() 

data = pd.DataFrame(cancer.data, columns=[cancer.feature_names]) 
data['Target'] = pd.Series(data=cancer.target, index=data.index) 

print data.keys() 
print data.shape 
3

另一种选择,而是一个班轮,创建数据框,包括功能和目标变量是:

import pandas as pd 
import numpy as np 
from sklearn.datasets import load_breast_cancer 

cancer = load_breast_cancer() 
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], 
        columns= np.append(cancer['feature_names'], ['target']))