15
我正在使用NLTK和scikit-learn
的CountVectorizer
组合来标记词和标记。在NLTK和scikit-learn中结合文本标注和删除标点符号
下面是CountVectorizer
的普通使用的例子:
from sklearn.feature_extraction.text import CountVectorizer
vocab = ['The swimmer likes swimming so he swims.']
vec = CountVectorizer().fit(vocab)
sentence1 = vec.transform(['The swimmer likes swimming.'])
sentence2 = vec.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vec.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
,它将打印
Vocabulary: ['he', 'likes', 'so', 'swimmer', 'swimming', 'swims', 'the']
Sentence 1: [[0 1 0 1 1 0 1]]
Sentence 2: [[0 0 0 1 0 1 1]]
现在,让我们说,我想删除停用词和干的话。一种选择是做像这样:
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
#######
# based on http://www.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html
stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item))
return stemmed
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = stem_tokens(tokens, stemmer)
return stems
########
vect = CountVectorizer(tokenizer=tokenize, stop_words='english')
vect.fit(vocab)
sentence1 = vect.transform(['The swimmer likes swimming.'])
sentence2 = vect.transform(['The swimmer swims.'])
print('Vocabulary: %s' %vect.get_feature_names())
print('Sentence 1: %s' %sentence1.toarray())
print('Sentence 2: %s' %sentence2.toarray())
它打印:
Vocabulary: ['.', 'like', 'swim', 'swimmer']
Sentence 1: [[1 1 1 1]]
Sentence 2: [[1 0 1 1]]
但我怎么才能得到最好在这第二个版本去掉标点符号?
简单而有效。谢谢! – Sebastian 2014-10-01 03:57:23
请注意,第二个不会捕获'...'或其他多字符标点符号。 – 2014-10-01 19:04:43
@FredFoo和其他人:您如何评价GENSIM和Scikit以提取关键字而不是普通文档?你可以回答我吗? http://stackoverflow.com/questions/40436110/rake-with-gensim – 2016-11-05 08:53:00