2016-11-08 62 views
1

我目前正在尝试在火花簇上执行LDA。我有一个RDD这样Pyspark mllib LDA错误:对象无法转换为java.util.List

>>> myRdd.take(2) 
[(218603, [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]), (95680, [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0])] 

但调用

model = LDA.train(myRdd, k=5, seed=42) 

给出了劳动者有下列错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5874.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5874.0): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.List

我不知道如何从明显的一边解释这个错误,所以任何意见,将不胜感激;在mllib的LDA的文档是相当稀少

我从下面的过程中获得RDD,与具有“doc_label”和“术语”

hashingTF = HashingTF(inputCol="terms", outputCol="term_frequencies", numFeatures=10) 
tf_matrix = hashingTF.transform(document_instances) 
myRdd = tf_matrix.select("doc_label", "term_frequencies").rdd 

使用此列的数据帧document_instances开始直接给出了同样的错误。现在,这是使用HashingTFpyspark.ml.feature,所以我怀疑可能会有一个冲突导致mllib与矢量ml之间的区别,但直接使用Vector.fromML()函数的映射给出了同样的错误使用

myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \ 
            (old_row.term, old_row.term_frequencies.toArray().tolist())) 
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \ 
            (old_row.term, old_row.term_frequencies.toArray())) 
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \ 
            (old_row.term, Vectors.fromML(old_row.term_frequencies))) 
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \ 
            (old_row.term, old_row.term_frequencies)) 

回答

1

所以,事实证明,火花文档是当它说有点误导“的文件,这些文件元组文档ID和术语(单词)的计载体 RDD。”也许我误解,但改变的元组列表时,这个错误似乎消失了(尽管它似乎已经通过不同的错误代替)

更改

myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \ 
            (old_row.term, old_row.term_frequencies)) 

myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \ 
            [old_row.term, Vectors.fromML(old_row.term_frequencies)]) 

出现以缓解问题,与他们的示例代码进行比较后

http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA