Python - 在txt中分割单词

我想制作程序，它将分割txt文件中的每个单词，以及单词的返回列表，但不重复任何单词。我将我的PDF书转换为txt，然后使用我的程序，但它完全失败。我不知道，我做错了什么。这是我的代码：Python - 在txt中分割单词

def split(file): 
    lines = open(file, 'rU').readlines() 
    words = [] 
    word = '' 
    for line in lines: 
     for letter in line: 
      if letter not in [' ', '\n', '.', ',']: 
       word += letter 
      elif letter in [' ', '\n', '.', ',']: 
       if word not in words: 
        words.append(word) 
        word = '' 

    words.sort() 
    return words 


for word in split('AKiss.txt'): 
    print(word, end=' ')

我还附加了AKiss.txt和原始PDF以防万一它可能有用。

PDF - http://1drv.ms/b/s!AtZrd19H_8oyabhAx-NZvIQD_Ug

TXT - http://1drv.ms/t/s!AtZrd19H_8oyapvBvAo27rNJSwQ

来源

2017-10-17 F_Zimny

*没有重复* ...为什么不使用set而不是列表？ – Mangohero1

你能描述它是如何失败的吗？ – glibdud

@glibdud它在理论上返回其他词，但有相同的词，但没有什么区别，真正奇怪的是 - 它们不存在于文件中：“Do”不要“不要扭转”不要扭转“多萝西”多萝西“ –

你可以试试这个：

import itertools 
words = list(set(itertools.chain.from_iterable([[''.join(c for c in b if c.isalpha()) for b in i.strip('\n').split()] for i in open('filename.txt') if i != "\n"])))

来源

2017-10-17 19:52:38 Ajax1234

我工作过，但是我得到了与'？'相同的单词。或者用圆点表示，是否有办法，不仅可以“消除”新的线条，而且还可以用问号，逗号等来表示？ –

@F_Zimny请用上面的代码再试一次 – Ajax1234

它很有用，非常感谢。坐在讲座上，发现100个单词，我不知道（英语不是我的母语）：D再次感谢。 –

您可能需要采取不同的方式：

def split_file(file): 
    all_words = set() 
    for ln in open(file, 'rU').readlines(): 
     words = ln.strip().split() 

     dot_split = [] 
     for w in words: 
      dot_split.extend(w.split('.')) 
     comma_split = [] 
     for w in dot_split: 
      comma_split.extend(w.split(',')) 

     all_words = all_words.union(set(comma_split)) 

    print(sorted(all_words)) 

split_file('test_file.txt')

或者更简单，使用正则表达式：

import re 

def split_file2(file): 
    all_words2 = set() 
    for ln in open(file, 'rU').readlines(): 
     words2 = re.split('[ \t\n\.,]', ln.strip()) # note the escaped '.'! 
     all_words2 = all_words2.union(set(words2)) 
    print(sorted(all_words))

作为一个边注意我不会使用split作为函数名称，因为它隐藏了您可能希望从标准库/ string库中使用的功能。

来源

2017-10-17 19:50:58 sophros

我这样做是这样的，但在输出我得到空列表。 –

该行'all_words.union（set（words.split（'。'）。split（'，'）））'all_words = all_words.union（set（words.split（'。'）。split（'，' ）））'用于联盟用作暗示 – Arunmozhi

@sophros此代码有多个错误，尝试改进并放弃 – Arunmozhi

使用strip()和split()方法应该帮助你在这里。

来源

2017-10-17 19:55:18

Python - 在txt中分割单词

回答

相关问题