0

我正在尝试查找使用nltk模块在Python中拆分单词的方法。鉴于我拥有的原始数据,我不确定如何达到我的目标。正如你可以看到很多单词粘在一起(即'到'和'产生'卡在一个字符串'toproduce'中)。这是从PDF文件中抓取数据的工件,我想找到一种方法,使用Python中的nltk模块来分割粘连在一起的单词(即将'toproduce'分成两个单词:'to'和'produce';将“标准操作程序”分成三个词:“标准”,“操作”,“程序”)。在Python中使用nltk模块拆分单词

我感谢任何帮助!

回答

1

我相信你会希望在这种情况下使用分词,我不知道NLTK中的任何分词功能,将处理无空格的英文句子。您可以改用pyenchant。我仅以示例的方式提供以下代码。 (它适用于数量较少的相对较短的字符串 - 例如您的示例列表中的字符串 - 但对于较长的字符串或更多的字符串而言效率会非常低)。它需要修改,并且不会成功地将每个字符串字符串在任何情况下。

import enchant # pip install pyenchant 
eng_dict = enchant.Dict("en_US") 

def segment_str(chars, exclude=None): 
    """ 
    Segment a string of chars using the pyenchant vocabulary. 
    Keeps longest possible words that account for all characters, 
    and returns list of segmented words. 

    :param chars: (str) The character string to segment. 
    :param exclude: (set) A set of string to exclude from consideration. 
        (These have been found previously to lead to dead ends.) 
        If an excluded word occurs later in the string, this 
        function will fail. 
    """ 
    words = [] 

    if not chars.isalpha(): # don't check punctuation etc.; needs more work 
     return [chars] 

    if not exclude: 
     exclude = set() 

    working_chars = chars 
    while working_chars: 
     # iterate through segments of the chars starting with the longest segment possible 
     for i in range(len(working_chars), 1, -1): 
      segment = working_chars[:i] 
      if eng_dict.check(segment) and segment not in exclude: 
       words.append(segment) 
       working_chars = working_chars[i:] 
       break 
     else: # no matching segments were found 
      if words: 
       exclude.add(words[-1]) 
       return segment_str(chars, exclude=exclude) 
      # let the user know a word was missing from the dictionary, 
      # but keep the word 
      print('"{chars}" not in dictionary (so just keeping as one segment)!' 
        .format(chars=chars)) 
      return [chars] 
    # return a list of words based on the segmentation 
    return words 

正如你所看到的,这种方法(大概)误段只有你的字符串之一:

>>> t = ['usingvariousmolecularbiology', 'techniques', 'toproduce', 'genotypes', 'following', 'standardoperatingprocedures', '.', 'Operateandmaintainautomatedequipment', '.', 'Updatesampletrackingsystemsandprocess', 'documentation', 'toallowaccurate', 'monitoring', 'andrapid', 'progression', 'ofcasework'] 
>>> [segment(chars) for chars in t] 
"genotypes" not in dictionary (so just keeping as one segment) 
[['using', 'various', 'molecular', 'biology'], ['techniques'], ['to', 'produce'], ['genotypes'], ['following'], ['standard', 'operating', 'procedures'], ['.'], ['Operate', 'and', 'maintain', 'automated', 'equipment'], ['.'], ['Updates', 'ample', 'tracking', 'systems', 'and', 'process'], ['documentation'], ['to', 'allow', 'accurate'], ['monitoring'], ['and', 'rapid'], ['progression'], ['of', 'casework']] 

然后可以使用chain扁平化列表名单:

>>> from itertools import chain 
>>> list(chain.from_iterable(segment_str(chars) for chars in t)) 
"genotypes" not in dictionary (so just keeping as one segment)! 
['using', 'various', 'molecular', 'biology', 'techniques', 'to', 'produce', 'genotypes', 'following', 'standard', 'operating', 'procedures', '.', 'Operate', 'and', 'maintain', 'automated', 'equipment', '.', 'Updates', 'ample', 'tracking', 'systems', 'and', 'process', 'documentation', 'to', 'allow', 'accurate', 'monitoring', 'and', 'rapid', 'progression', 'of', 'casework'] 
+0

太棒了,谢谢!这是我寻找的东西。我认为这可以用nltk语料库完成,但我很乐意与pyenchant一起工作! – Kookaburra

+0

嘿,我知道这个答案有点古老,但有一件事要被厌倦的是set()默认参数,它会导致一些奇怪的行为,如果你尝试: '在[6]中:segment_str(“tookapill “) Out [6]:['to','okapi','ll'] In [7]:segment_str(”tookapillinibiza“) ”tookapillinibiza“不在字典中(所以只保留一个段)! Out [7]:['tookapillinibiza'] In [8]:segment_str(“tookapill”) “tookapill”不在字典中(所以只是保持一段)! 输出[8]:['tookapill']' 我添加了一个默认的None并在使用时进行了检查:http://effbot.org/zone/default-values.htm –