2017-10-12 119 views
1

提取的词汇,我可以通过以下方式如何从管道

fl = StopWordsRemover(inputCol="words", outputCol="filtered") 
df = fl.transform(df) 
cv = CountVectorizer(inputCol="filtered", outputCol="rawFeatures") 
model = cv.fit(df) 

print(model.vocabulary) 

上面的代码将打印词汇索引列表,因为它是从IDS中提取CountVecotizerModel词汇。

现在我已经创建了上面的代码的管道如下:

rm_stop_words = StopWordsRemover(inputCol="words", outputCol="filtered") 
count_freq = CountVectorizer(inputCol=rm_stop_words.getOutputCol(), outputCol="rawFeatures") 

pipeline = Pipeline(stages=[rm_stop_words, count_freq]) 
model = pipeline.fit(dfm) 
df = model.transform(dfm) 

print(model.vocabulary) # This won't work as it's not CountVectorizerModel 

它会引发以下错误

print(len(model.vocabulary)) 

AttributeError: 'PipelineModel' object has no attribute 'vocabulary'

因此,如何提取管道模型属性?

回答

1

以同样的方式,与任何其他阶段的属性,提取stages

stages = model.stages 

找到一个(-s)你有兴趣:

from pyspark.ml.feature import CountVectorizerModel 

vectorizers = [s for s in stages if isinstance(s, CountVectorizerModel)] 

,并获得所需的字段:

[v.vocabulary for v in vectorizers]