2017-08-21 26 views
0

我从电子邮件文本列表(以csv格式存储)对垃圾邮件进行分类,但在我可以做到这一点之前,我想从一些简单的统计数据输出。我用CountVectorizer从sklearn作为第一步,并通过下面的代码从数组和列表中获取各种令牌计数统计的更有效方式

import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn.feature_extraction.text import CountVectorizer 

#import data from csv 

spam = pd.read_csv('spam.csv') 
spam['Spam'] = np.where(spam['Spam']=='spam',1,0) 

#split data 

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

#convert 'features' to numeric and then to matrix or list 
cv = CountVectorizer() 
x_traincv = cv.fit_transform(X_train) 
a = x_traincv.toarray() 
a_list = cv.inverse_transform(a) 

的输出被存储在一个矩阵(命名为“A”)或(命名为“的a_list”)阵列的列表格式看起来像实施这

[array(['do', 'I', 'off', 'text', 'where', 'you'], 
     dtype='<U32'), 
array(['ages', 'will', 'did', 'driving', 'have', 'hello', 'hi', 'hol', 'in', 'its', 'just', 'mate', 'message', 'nice', 'off', 'roads', 'say', 'sent', 'so', 'started', 'stay'], dtype='<U32'),  
     ... 
array(['biz', 'for', 'free', 'is', '1991', 'network', 'operator', 'service', 'the', 'visit'], dtype='<U32')] 

但我发现它有点难以得到这些输出一些简单的计数统计,如最长/最短令牌,平均记号,等我怎样才能得到这些简单的计数统计从长度矩阵或列表输出,我生成的?

+0

这是你在找什么? https://stackoverflow.com/a/16078639/2491761 –

+0

不能,countvectorizer()。vocabulary_将自动编译(也许我不应该使用这个术语)每个术语的频率。我想得到最长和最短的词。我目前使用这个'max_len = len(max(cv.vocabulary_,key = len))'和'[如果len(word)== max_len',则在[cv.vocabulary_中单词一词]''。想知道有没有更好的解决方案? –

回答

1

您可以将令牌,令牌计数和令牌长度加载到新的Pandas数据框中,然后执行自定义查询。

这是玩具数据集的一个简单示例。

import pandas as pd 
import numpy as np 
from sklearn.feature_extraction.text import CountVectorizer 

texts = ["dog cat fish","dog cat cat","fish bird walrus monkey","bird lizard"] 

cv = CountVectorizer() 
cv_fit = cv.fit_transform(texts) 
# https://stackoverflow.com/a/16078639/2491761 
tokens_and_counts = zip(cv.get_feature_names(), np.asarray(cv_fit.sum(axis=0)).ravel()) 

df = pd.DataFrame(tokens_and_counts, columns=['token', 'count']) 

df['length'] = df.token.str.len() # https://stackoverflow.com/a/29869577/2491761 

# all the tokens with length equal to min token length: 
df.loc[df['length'] == df['length'].min(), 'token'] 

# all the tokens with length equal to max token length: 
df.loc[df['length'] == df['length'].max(), 'token'] 

# all tokens with length less than mean token length: 
df.loc[df['length'] < df['length'].mean(), 'token'] 

# all tokens with length greater than 1 standard deviation from the mean: 
df.loc[df['length'] > df['length'].mean() + df['length'].std(), 'token'] 

如果您想根据计数执行查询,可以轻松扩展。

+0

@克里斯T.这仍然不是你在找什么?请指教 –