2016-09-17 109 views
0

我有以下RDD:pyspark:转换RDD [DenseVector]到数据帧

rdd.take(5)给我:

[DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), 
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), 
DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]), 
DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]), 
DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699])] 

我想使它成为一个数据帧应该看如:

------------------------------------------------------------------- 
| features              | 
------------------------------------------------------------------- 
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] | 
|-----------------------------------------------------------------| 
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] | 
|-----------------------------------------------------------------| 
| [5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]    | 
|-----------------------------------------------------------------| 
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] | 
|-----------------------------------------------------------------| 
| [9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] | 
|-----------------------------------------------------------------| 

这可能吗?我试图使用df_new = sqlContext.createDataFrame(rdd,['features']),但它没有奏效。有没有人有任何建议?谢谢!

回答

3

地图tuples第一:

rdd.map(lambda x: (x,)).toDF(["features"]) 

请记住,作为星火2.0的有两种不同的Vector实施的ml算法需要pyspark.ml.Vector

+0

谢谢!地图(lambda x:(x,))看起来很神秘,请您详细说明一下?谢谢! – Edamame

+0

'(x,)'是单个元素'元组'。映射是必需的,因为只有[某些对象可以转换为'Row'](http://stackoverflow.com/a/32742294/1560062) – zero323