2016-06-21 29 views
0

我preparting数据从Graphlab运行KMEAMS,和我遇到了以下错误:SFrame K均值 - 隐蔽到整型,浮点型,快译通

tmp = data.select_columns(['a.item_id']) 
tmp['sku'] = tmp['a.item_id'].apply(lambda x: x.split(',')) 
tmp = tmp.unpack('sku') 

kmeans_model = gl.kmeans.create(tmp, num_clusters=K) 

Feature 'sku.0' excluded because of its type. Kmeans features must be int, float, dict, or array.array type. 
Feature 'sku.1' excluded because of its type. Kmeans features must be int, float, dict, or array.array type. 

这里是每一列的数据类型电流:

a.item_id str 
sku.0 str 
sku.1 str 

如果我可以从str到int的数据类型我认为它应该工作。但是,使用SFrames比标准的python库更棘手。任何帮助到达那里都表示赞赏。

回答

0

kmeans模型确实允许在字典形式的功能,只是不在列表形式。这与你现在得到的略有不同,因为字典丢失了SKU的顺序,但就模型质量而言,我怀疑它实际上更有意义。他们的关键功能是在文本分析工具包中的count_words

https://dato.com/products/create/docs/generated/graphlab.text_analytics.count_words.html

import graphlab as gl 
sf = gl.SFrame({'item_id': ['abc,xyz,cat', 'rst', 'abc,dog']}) 
sf['sku_count'] = gl.text_analytics.count_words(sf['item_id'], delimiters=[',']) 

model = gl.kmeans.create(sf, num_clusters=2, features=['sku_count']) 
print model.cluster_id 

+--------+------------+----------------+ 
| row_id | cluster_id | distance | 
+--------+------------+----------------+ 
| 0 |  1  | 0.866025388241 | 
| 1 |  0  |  0.0  | 
| 2 |  1  | 0.866025388241 | 
+--------+------------+----------------+ 
[3 rows x 3 columns]