提高对数据帧文本清理的性能

我有一个DF：提高对数据帧文本清理的性能

id text 
1  This is a good sentence 
2  This is a sentence with a number: 2015 
3  This is a third sentence

我有一个文本清洗功能：

def clean(text): 
    lettersOnly = re.sub('[^a-zA-Z]',' ', text) 
    tokens = word_tokenize(lettersOnly.lower()) 
    stops = set(stopwords.words('english')) 
    tokens = [w for w in tokens if not w in stops] 
    tokensPOS = pos_tag(tokens) 
    tokensLemmatized = [] 
    for w in tokensPOS: 
     tokensLemmatized.append(WordNetLemmatizer().lemmatize(w[0], get_wordnet_pos(w[1]))) 
    clean = " ".join(tokensLemmatized) 
    return clean

get_wordnet_pos()是这样的：

def get_wordnet_pos(treebank_tag): 
    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return wordnet.NOUN

我我正在将extractFeatures()应用到熊猫专栏，并创建一个新结果列：

df['cleanText'] = df['text'].apply(clean)

得到的DF：

id cleanText 
1  good sentence 
2  sentence number 
3  third sentence

循环时出现成倍增长。例如，使用%%timeit，将其应用于五行，每个循环以17 ms运行。 300行以每个循环800毫秒运行。 500行以每循环1.26秒运行。

我通过实例化stops和WordNetLemmatizer()以外的函数来改变它，因为这些函数只需要调用一次。

stops = set(stopwords.words('english')) 
lem = WordNetLemmatizer() 
def clean(text): 
    lettersOnly = re.sub('[^a-zA-Z]',' ', text) 
    tokens = word_tokenize(lettersOnly.lower()) 
    tokens = [w for w in tokens if not w in stops] 
    tokensPOS = pos_tag(tokens) 
    tokensLemmatized = [] 
    for w in tokensPOS: 
     tokensLemmatized.append(lem.lemmatize(w[0], get_wordnet_pos(w[1]))) 
    clean = " ".join(tokensLemmatized) 
    return clean

在apply线运行%prun -l 10导致该表：

  672542 function calls (672538 primitive calls) in 2.798 seconds 

    Ordered by: internal time 
    List reduced from 211 to 10 due to restriction <10> 

    ncalls tottime percall cumtime percall filename:lineno(function) 
    4097 0.727 0.000 0.942 0.000 perceptron.py:48(predict) 
    4500 0.584 0.000 0.584 0.000 {built-in method nt.stat} 
    3500 0.243 0.000 0.243 0.000 {built-in method nt._isdir} 
    14971 0.157 0.000 0.178 0.000 {method 'sub' of '_sre.SRE_Pattern' objects} 
    57358 0.129 0.000 0.155 0.000 perceptron.py:250(add) 
    4105 0.117 0.000 0.201 0.000 {built-in method builtins.max} 
    184365 0.084 0.000 0.084 0.000 perceptron.py:58(<lambda>) 
    4097 0.057 0.000 0.213 0.000 perceptron.py:245(_get_features) 
     500 0.038 0.000 1.220 0.002 perceptron.py:143(tag) 
    2000 0.034 0.000 0.068 0.000 ntpath.py:471(normpath)

它看起来像恶搞感知是，可以预见，采取了大量的资源，但我不知道如何简化它。另外，我不确定nt.stat或nt._isdir在哪里被调用。

我该如何改变功能或应用方法来提高性能？这个函数是Cython还是Numba的候选人？

来源

2017-08-28 Cameron Taylor

不能说没有你的数据和预期的输出。 –

增加样品输入数据和清洁功能的结果。我得到了正确的输出 - 问题更多的是如何更快地获得适当的输出。 –

有趣。言语的顺序是否重要？我猜是的？ –

改善的第一个明显的一点，我在这里看到的是整个get_wordnet_pos功能应该还原为一个字典查找：

def get_wordnet_pos(treebank_tag): 
    if treebank_tag.startswith('J'): 
     return wordnet.ADJ 
    elif treebank_tag.startswith('V'): 
     return wordnet.VERB 
    elif treebank_tag.startswith('N'): 
     return wordnet.NOUN 
    elif treebank_tag.startswith('R'): 
     return wordnet.ADV 
    else: 
     return wordnet.NOUN

取而代之的是，从collections包初始化一个defaultdict：

import collections 
get_wordnet_pos = collections.defaultdict(lambda: wordnet.NOUN) 
get_wordnet_pos.update({'J' : wordnet.ADJ, 
         'V' : wordnet.VERB, 
         'N' : wordnet.NOUN, 
         'R' : wordnet.ADV })

然后，您将访问查找这样的：

get_wordnet_pos[w[1][0]]

接下来，如果要在多个位置使用它，则可以考虑预编译您的正则表达式模式。你得到的加速并不是那么多，但这一切都很重要。

pattern = re.compile('[^a-zA-Z]')

里面的功能，你会打电话：

pattern.sub(' ', text)

OTOH，如果您知道您的文字是从哪里来的，并有可能会和可能看不到什么了，你可以预编译字符的列表，而是使用str.translate，比笨重的基于正则表达式替换快得多得多：

tab = str.maketrans(dict.fromkeys("[email protected]#$%^&*()_+-={}[]|\'\":;,<.>/?\\~`", '')) # pre-compiled use once substitution table (keep this outside the function) 

text = 'hello., hi! lol, what\'s up' 
new_text = text.translate(tab) # this would run inside your function 

print(new_text) 

'hello hi lol whats up'

此外，我想说的是word_tokenize overk生病 - 你所做的就是摆脱特殊字符，所以你失去word_tokenize的所有好处，这实际上与标点符号等有所不同。你可以选择退回text.split()。

最后，跳过clean = " ".join(tokensLemmatized)步骤。只需返回列表，然后在最后一步中致电df.applymap(" ".join)。

我将基准给你。

来源

2017-08-28 14:07:42

非常感谢 - 非常有帮助。对于defaultdict，它会抛出一个错误，指出'TypeError：'collections.defaultdict'对象不可调用'。除此之外，你对替换和分裂的看法很有意义。 –

@CameronTaylor有一个小错误。你可以调用像'get_wordnet_pos [']'的字典，而不是'（...）'。将编辑我的答案。 –

另一个怪癖可能就是在原始函数中，标签是由'startswith'找到的。有没有办法将它实现到'defaultdict'中？因为目前我相信它把大多数东西当作名词来对待，因为很多标签不仅仅是一个字母。 –

提高对数据帧文本清理的性能

回答

相关问题