我目前正在尝试在火花簇上执行LDA。我有一个RDD这样Pyspark mllib LDA错误:对象无法转换为java.util.List
>>> myRdd.take(2)
[(218603, [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]), (95680, [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0])]
但调用
model = LDA.train(myRdd, k=5, seed=42)
给出了劳动者有下列错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5874.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5874.0): java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.List
我不知道如何从明显的一边解释这个错误,所以任何意见,将不胜感激;在mllib的LDA的文档是相当稀少
我从下面的过程中获得RDD,与具有“doc_label”和“术语”
hashingTF = HashingTF(inputCol="terms", outputCol="term_frequencies", numFeatures=10)
tf_matrix = hashingTF.transform(document_instances)
myRdd = tf_matrix.select("doc_label", "term_frequencies").rdd
使用此列的数据帧document_instances
开始直接给出了同样的错误。现在,这是使用HashingTF在pyspark.ml.feature,所以我怀疑可能会有一个冲突导致mllib与矢量ml之间的区别,但直接使用Vector.fromML()函数的映射给出了同样的错误使用
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, old_row.term_frequencies.toArray().tolist()))
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, old_row.term_frequencies.toArray()))
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, Vectors.fromML(old_row.term_frequencies)))
myRdd = tf_matrix.select(...).rdd.map(lambda old_row: \
(old_row.term, old_row.term_frequencies))