2013-03-14 56 views
2

python新手,需要我的程序帮助。我有一个代码,它接受一个未格式化的文本文档,进行一些格式设置(设置页面宽度和边距),并输出一个新的文本文档。我的整个代码工作正常,除了这个产生最终输出的函数。如何使用text.split()并保留空行(空)

这是问题的代码段:

def process(document, pagewidth, margins, formats): 
    res = [] 
    onlypw = [] 
    pwmarg = [] 
    count = 0 
    marg = 0 


    for segment in margins: 

     for i in range(count, segment[0]): 
      res.append(document[i]) 
     text = '' 

    foundmargin = -1 
    for i in range(segment[0], segment[1]+1): 
     marg = segment[2] 
     text = text + '\n' + document[i].strip(' ') 

    words = text.split() 

注:段[0]表示文档的开头,和段[1]只是意味着该文件结束时,如果你想知道关于范围。我的问题是当我将文本复制到单词(单词= text.split())时,它不保留我的空白行。我应该得到的输出是:

 This is my substitute for pistol and ball. With a 
     philosophical flourish Cato throws himself upon his sword; I 
     quietly take to the ship. There is nothing surprising in 
     this. If they but knew it, almost all men in their degree, 
     some time or other, cherish very nearly the same feelings 
     towards the ocean with me. 

     There now is your insular city of the Manhattoes, belted 
     round by wharves as Indian isles by coral reefs--commerce 
     surrounds it with her surf. 

什么我的电流输出的样子:

 This is my substitute for pistol and ball. With a 
     philosophical flourish Cato throws himself upon his sword; I 
     quietly take to the ship. There is nothing surprising in 
     this. If they but knew it, almost all men in their degree, 
     some time or other, cherish very nearly the same feelings 
     towards the ocean with me. There now is your insular city of 
     the Manhattoes, belted round by wharves as Indian isles by 
     coral reefs--commerce surrounds it with her surf. 

我知道当我复制文本的话,因为它不留空白的问题发生线。我怎样才能确保它复制空白行和单词? 请让我知道如果我应该添加更多的代码或更多的细节!

+0

你可以尝试先分成几段,然后处理每个段落 - 第一个'text.split('\ n \ n ')'和split()'的每个段落。 – dmg 2013-03-14 20:27:11

回答

4

至少2换行符,然后分裂的话第一次分裂:

import re 

paragraphs = re.split('\n\n+', text) 
words = [paragraph.split() for paragraph in paragraphs] 

你现在有一个列表的列表,每个段落之一;处理这些每款,之后就可以归队了整个事情与在插回双换行的新文本

我用re.split()支持超过2个换行分隔正在段落。如果在段落之间只有2个换行符,则可以使用简单的text.split('\n\n')

+0

'\ n {2,}'是“2个或更多换行符”的一个很好的符号,可以很容易地调整到2,3或更多,等等。 – kindall 2013-03-14 20:58:52

+1

@kindall:我意识到符号;在这种情况下,为了创建与'text.split('\ n \ n')替代我选择'\ n \ n +'版本的对称性。 – 2013-03-14 20:59:51

1

使用正规找到的话的空行,而不是分裂

m = re.compile('(\S+|\n\n)') 
words=m.findall(text)