Scikit-Learn TfidfVectorizer

我正在处理文本分类问题，解析来自RSS提要的新闻报道，并且我怀疑很多HTML元素和乱码都被计为记号。我知道Beautiful Soup提供了清理HTML的方法，但是我想尝试传递一个字典来更好地控制什么是记号。Scikit-Learn TfidfVectorizer

这个概念看起来很简单，但我得到的结果我不明白。

from sklearn.feature_extraction.text import TfidfVectorizer 

eng_dictionary = [] 
with open("C:\\Data\\words_alpha.txt") as f: 
    eng_dictionary = f.read().splitlines() 

short_dic = [] 
short_dic.append(("short")) 
short_dic.append(("story")) 

stories = [] 
stories.append("This is a short story about the color red red red red blue blue blue i am in a car") 
stories.append("This is a novel about the color blue red red red red i am in a boot") 
stories.append("I like the color green, but prefer blue blue blue blue blue red red red red i am on a bike") 

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True) 
pos_vector = vec.fit_transform(stories).toarray() 

print(vec.get_feature_names()) 

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=short_dic) 
pos_vector = vec.fit_transform(stories).toarray() 

print(vec.get_feature_names()) 

vec = TfidfVectorizer(decode_error=u'ignore', stop_words='english', analyzer='word', lowercase=True, vocabulary=eng_dictionary) 
pos_vector = vec.fit_transform(stories).toarray() 

print(vec.get_feature_names())

该程序的输出如下;

['bike', 'blue', 'boot', 'car', 'color', 'green', 'like', 'novel', 'prefer', 'red', 'short', 'story'] 
['short', 'story'] 
ptic', 'skeptical', 'skeptically', 'skepticalness', 'skepticism', 'skepticize', 'skepticized', 'skepticizing'...

第三打印的输出上和去，所以我故意把它剪短，什么是奇怪的是，虽然它开始中间字，正如我告诉它上面。前两份印刷声明的结果对我来说是有意义的;

缺少词汇表示要素是直接从语料库构建的。
提供了一个词汇意味着功能从令牌建立在语料库和词汇

然而，在第三打印出的特点是不是我的文集的一部分，他们为什么不显示？

来源

2017-08-16 Nibroc A Rehpotsirhc

“词汇”参数将创建一个TF-IDF矩阵，其中包含词汇中的词汇。然后，如果该单词存在，则这些值将被填充。

例如，假设“色”是你的“words_alpha.txt”文件：

   skeptical skeptically ... ... ...  color 
stories[2]  0    0  ... ... ... TFI-DF value

这就是为什么他们会显示出来。

它正在开始中间词的事实可能与您的文件有关。你使用的是splitlines（），所以我的猜测是你的文件有一堆单词，达到极限，然后到“怀疑论词”中间的下一行，这就是你的词汇表（eng_dictionary）开始的地方

来源

2017-08-16 02:05:22 AMC

我如何才能从词汇中提取特征？ –

你是指eng_dictionary和第三个故事中的单词重叠吗？或者只是第三个故事中的单词？ – AMC

我一直在寻找eng_dictionary和每个故事中的单词之间的重叠 –

Scikit-Learn TfidfVectorizer

回答

相关问题