读取文件而不截取单词

我有一个非常大的文件，我想阅读并执行一些操作。根据我的代码，我已经分配了1024个字节来读取，然后循环直到读取所有内容。但有时候这样做会导致我的单词被截断。读取文件而不截取单词

即使当我提到一个大小阅读我想确保它正在阅读一个完整的单词。我所有的话都是用空格分开的。

with open('test.txt', mode='r',encoding="utf-8") as f: 

      chunk_size = 1024 

      f_chunk = f.read(chunk_size) 

      while len(f_chunk)>0: 

       for word in f_chunk.split(): 
       #do something 
       print (word) 
       f_chunk = f.read(chunk_size)

来源

2016-12-05 choman

我不知道是否有一个内置的方式，但你可以尝试这样的：

chunk_size = 1024 
data = '' 
while True: 
    data += f.read(chunk_size) 
    if not data: 
     break 
    last_sp = data.rfind(' ') 
    if last_sp == -1:    # No space at the end 
     last_sp = len(data) 
    block = data[:last_sp] 
    data = data[last_sp + 1:] 

    for word in block.split(): 
     print(word)

基本上，你还记得最后一个块的的下一个结束 - 如果你的单词大于你的块大小，这将不起作用，如果你有一个分隔符而不是一个空格（例如' '），则这可能不会起作用。

来源

2016-12-05 07:37:01 Holt

作为一个替代方法，可以按如下方式创建一个字发生器：

def read_word(f): 
    word = [] 
    c = '.' 

    while c: 
     c = f.read(1) 

     if c.isalnum(): 
      word.append(c) 
     elif len(word): 
      yield ''.join(word) 
      word = [] 

    yield ''.join(word) 

with open('input.txt') as f_input: 
    for word in read_word(f_input): 
     print(word)

这将返回整个单词拆分基于是否有使用isalnum()字母数字字符。所以read_word()也删除所有的空格。

例如，如果input.txt包含：

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Hoc loco tenere se Triarius non potuit.

输出将是：

Lorem 
ipsum 
dolor 
sit 
amet 
consectetur 
adipiscing 
elit 
Hoc 
loco 
tenere 
se 
Triarius 
non 
potuit

来源

2016-12-05 08:45:16

读取文件而不截取单词

回答

相关问题