从文件中删除停用词

我想从我的文件中的数据列中删除停用词。我过滤了最终用户讲话时的线路。但它并没有过滤出与usertext.apply(lambda x: [word for word in x if word not in stop_words]) 停止词我做错了什么？从文件中删除停用词

import pandas as pd 
from stop_words import get_stop_words 
df = pd.read_csv("F:/textclustering/data/cleandata.csv", encoding="iso-8859-1") 
usertext = df[df.Role.str.contains("End-user",na=False)][['Data','chatid']] 
stop_words = get_stop_words('dutch') 
clean = usertext.apply(lambda x: [word for word in x if word not in stop_words]) 
print(clean)

来源

2017-03-08 DataNewB

first can y ou 1）打印'stop_words'，2）尝试'clean = usertext.apply（lambda x：[]）'看它是否删除所有单词？（只是测试） –

Data [] chatid [] dtype：object ['aan'，'al'，'alles'，'als'，'altijd'，'andere'，'ben'，'bij' ，'dar'，'dan'，'dat'，'de'，'der'，'deze'，'die'，'dit'，'doch'，'doen'，'door' een'，eens，en，er，ge，geen，geweest，haar，had，heb，hebben，heeft，，'het'，'hier'，'hij'，'hoe'，'hun'，'iemand'，'iets'，'ik'，'in'，'是'，'ja'，'je'，' kan'kon'kunnen'maar'me''meer''men''met'mij'mijn'moet'na'naar' ，'niet'，'niets'，'nog'，'nu'，'of'，'om'，'omdat'，...]这是 – DataNewB

clean = usertext.apply(lambda x: x if x not in stop_words else '')

来源

2017-03-08 14:40:22 galaxyan

的输出，如果可以的话，我建议使用'设置'stop_words'来提高效率。 –

我得到NameError：（“名称”字'未定义'，'发生在索引数据'）当我运行它 – DataNewB

@DataNewB对不起，它应该是x – galaxyan

你可以建立你的停止字的正则表达式，并调用矢量化str.replace将其删除：

In [124]: 
stop_words = ['a','not','the'] 
stop_words_pat = '|'.join(['\\b' + stop + '\\b' for stop in stop_words]) 
stop_words_pat 

Out[124]: 
'\\ba\\b|\\bnot\\b|\\bthe\\b' 

In [125]:  
df = pd.DataFrame({'text':['a to the b', 'the knot ace a']}) 
df['text'].str.replace(stop_words_pat, '') 

Out[125]: 
0   to b 
1  knot ace 
Name: text, dtype: object

在这里，我们执行列表中理解到建立围绕每个停用词的模式与'\b'这是一个休息，然后我们or使用的所有单词'|'

来源

2017-03-08 14:55:42 EdChum

两个问题：

首先，您有一个名为stop_words的模块，稍后您将创建一个名为stop_words的变量。这是不好的形式。

其次，您将一个lambda函数传递给.apply，它希望其x参数成为列表，而不是列表中的值。

也就是说，而不是做df.apply(sqrt)你在做df.apply(lambda x: [sqrt(val) for val in x])。

您应该做的列表处理自己：

clean = [x for x in usertext if x not in stop_words]

或者你应该做的应用，与只接受一个字在时间的函数：

clean = usertext.apply(lambda x: x if x not in stop_words else '')

正如@让 - FrançoisFabre在评论中建议，如果你的stop_words是一套而不是一个列表，你可以加快速度：

from stop_words import get_stop_words 

nl_stop_words = set(get_stop_words('dutch')) # NOTE: set 

usertext = ... 
clean = usertext.apply(lambda word: word if word not in nl_stop_words else '')

来源

2017-03-08 15:10:39

从文件中删除停用词

回答

相关问题