PySpark- OneHotEncoding

这可能是天真的，但我刚开始使用PySpark和Spark。请帮助我了解Pyspark的一项热门技术。我正在尝试在其中一列上进行OneHotEncoding。在一次热编码之后，数据帧架构添加了一个向量。但是要应用机器学习算法，那应该是将单个列添加到现有数据框中，每列代表一个类别，而不是矢量类型列。如何验证OneHotEncoding。PySpark- OneHotEncoding

我的代码：

stringIndexer = StringIndexer(inputCol="business_type", outputCol="business_type_Index") 
    model = stringIndexer.fit(df) 
    indexed = model.transform(df) 
    encoder = OneHotEncoder(dropLast=False, inputCol="business_type_Index", outputCol="business_type_Vec") 
    encoded = encoder.transform(indexed) 
    encoded.select("business_type_Vec").show()

这显示：

+-----------------+ 
|business_type_Vec| 
+-----------------+ 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
| (2,[0],[1.0])| 
+-----------------+ 
only showing top 20 rows

新添加的列是向量类型的。我如何将它转换为每个类别的各个栏目

来源

2016-09-29 Jack Daniel

这是预期的行为，您不需要转换为单个列，因为spark ML可以处理特征向量。 – mtoto