我是新来的星火2. 我试图星火TFIDF例如星火HashingTF如何工作
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)
for each in featurizedData.collect():
print(each)
它输出
Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
我预计在rawFeatures
我会得到长期的频率就像{0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}
。因为条款频率为:
tf(w) = (Number of times the word appears in a document)/(Total number of words in the document)
在我们的例子是:tf(w) = 1/5 = 0.2
为每个单词,因为每个字在文档中apears一次。 如果我们想象输出rawFeatures
字典包含单词索引作为关键字,并且文档中的单词出现次数为值,为什么键1
等于3.0
?没有文字出现在文档中3次。 这让我感到困惑。我错过了什么?