我有一个NLP任务,我正在使用scikit-learn。阅读tutorials我发现必须对文本进行矢量化,以及如何使用此矢量化模型来提供分类算法。假设我有一些文字,我想如下向量化它:scipy中的这个稀疏矩阵是什么意思?
from sklearn.feature_extraction.text import CountVectorizer
corpus =['''Computer science is the scientific and
practical approach to computation and its applications.'''
#this is another opinion
'''It is the systematic study of the feasibility, structure,
expression, and mechanization of the methodical
procedures that underlie the acquisition,
representation, processing, storage, communication of,
and access to information, whether such information is encoded
as bits in a computer memory or transcribed in genes and
protein structures in a biological cell.'''
#anotherone
'''A computer scientist specializes in the theory of
computation and the design of computational systems''']
vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(corpus)
print X
的问题是,我不明白输出的意思,我没有看到有文字和返回的矩阵任何关系通过向量化:
(0, 12) 3
(0, 33) 1
(0, 20) 3
(0, 45) 7
(0, 34) 1
(0, 2) 6
(0, 28) 1
(0, 4) 1
(0, 47) 2
(0, 10) 2
(0, 22) 1
(0, 3) 1
(0, 21) 1
(0, 42) 1
(0, 40) 1
(0, 26) 5
(0, 16) 1
(0, 38) 1
(0, 15) 1
(0, 23) 1
(0, 25) 1
(0, 29) 1
(0, 44) 1
(0, 49) 1
(0, 1) 1
: :
(0, 30) 1
(0, 37) 1
(0, 9) 1
(0, 0) 1
(0, 19) 2
(0, 50) 1
(0, 41) 1
(0, 14) 1
(0, 5) 1
(0, 7) 1
(0, 18) 4
(0, 24) 1
(0, 27) 1
(0, 48) 1
(0, 17) 1
(0, 31) 1
(0, 39) 1
(0, 6) 1
(0, 8) 1
(0, 35) 1
(0, 36) 1
(0, 46) 1
(0, 13) 1
(0, 11) 1
(0, 43) 1
而且我不明白什么是与输出发生的事情时,我使用toarray()
方法:
print X.toarray()
究竟手段输出什么关系与胼?:
[[1 1 6 1 1 1 1 1 1 1 2 1 3 1 1 1 1 1 4 2 3 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 7 1 2 1 1 1]]
您可能想了解Manning&Schuetze书中的向量空间模型:http://nlp.stanford.edu/IR-book/pdf/06vect.pdf – mbatchkarov 2014-12-02 13:51:05