由新行分割的有限文本块

我在python中包含大型文本文件（超过1MiB）的字符串。我需要将它拆分为块。由新行分割的有限文本块

限制：

块只能由换行符被splited，并
LEN（块）必须是一样大possbile但小于LIMIT（即100KiB）

线长于LIMIT可以忽略不计。

任何想法如何在python中很好地实现这个？

预先感谢您。

来源

2017-03-31 Michał Šrajer

要拆分成新文件？ – RomanPerekhrest

没有时间写出来，但最好的解决方案可能是跳到LIMIT，向后工作，直到找到换行符，添加一个块，再从那里跳到LIMIT，然后重复。 – Linuxios

这是我不那么Python的解决方案：

def line_chunks(lines, chunk_limit): 
    chunks = [] 
    chunk = [] 
    chunk_len = 0 
    for line in lines: 
     if len(line) + chunk_len < chunk_limit: 
      chunk.append(line) 
      chunk_len += len(line) 
     else: 
      chunks.append(chunk) 
      chunk = [line] 
      chunk_len = len(line) 
    chunks.append(chunk) 
    return chunks 

chunks = line_chunks(data.split('\n'), 150) 
print '\n---new-chunk---\n'.join(['\n'.join(chunk) for chunk in chunks])

来源

2017-03-31 21:46:29

继Linuxios的建议，你可以使用RFIND发现在这一点上限制和组内的最后一个换行符。如果没有找到换行符，则该块太大并且可能被解散。

chunks = [] 

not_chunked_text = input_text 

while not_chunked_text: 
    if len(not_chunked_text) <= LIMIT: 
     chunks.append(not_chunked_text) 
     break 
    split_index = not_chunked_text.rfind("\n", 0, LIMIT) 
    if split_index == -1: 
     # The chunk is too big, so everything until the next newline is deleted 
     try: 
      not_chunked_text = not_chunked_text.split("\n", 1)[1] 
     except IndexError: 
      # No "\n" in not_chunked_text, i.e. the end of the input text was reached 
      break 
    else: 
     chunks.append(not_chunked_text[:split_index+1]) 
     not_chunked_text = not_chunked_text[split_index+1:]

rfind("\n", 0, LIMIT)返回在其中一个换行符发现你的极限的边界内的最高指数。
not_chunked_text[:split_index+1]是需要的，以便换行符包含在块中

我将LIMIT解释为允许的块的最大长度。如果不应该允许长度为LIMIT的块，则必须在此代码中添加-1之后的LIMIT。

来源

2017-03-31 22:04:22 BurningKarl

由新行分割的有限文本块

回答

相关问题