2017-05-07 71 views
3

我有这种形式的文本文件:如何阅读其中一些内容有换行符的文本文件?

06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde 

可以看到,每一行由换行符分隔,但有些行的内容在他们换行符。所以,简单地按行分隔并不能正确解析每一行。

举例来说,对于第5项,我想我的输出是 07/01/2016, 6:14 pm - abcde fghe

这里是我当前的代码:

with open('file.txt', 'r') as text_file: 
data = [] 
for line in text_file: 
    row = line.strip() 
    data.append(row) 
+0

是可以包含行数据符本身包含双引号,以任何机会呢? –

+0

你能告诉我数据应该怎么看吗?你的描述不清楚。我看到收入,但不清楚结果应该如何。 – TitanFighter

+0

我希望'data'的每个元素都以日期开头。 – Imran

回答

1

鉴于你例如输入,你可以用一个regex以正前瞻:

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M) 

with open (fn) as f: 
    pprint([m.group(1) for m in pat.finditer(f.read())])  

打印:

['06/01/2016, 10:40 pm - abcde\n', 
'07/01/2016, 12:04 pm - abcde\n', 
'07/01/2016, 12:05 pm - abcde\n', 
'07/01/2016, 12:05 pm - abcde\n', 
'07/01/2016, 6:14 pm - abcde\n\nfghe\n', 
'07/01/2016, 6:20 pm - abcde\n', 
'07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n', 
'07/01/2016, 7:58 pm - abcde\n'] 

随着Dropbox的例子,打印:

['11/11/2015, 3:16 pm - IK: 12\n', 
'13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n', 
'13/11/2015, 12:11 pm - IK: Boo\n', 
'15/11/2015, 8:36 pm - IR: Root\n', 
'15/11/2015, 8:36 pm - IR: LaTeX?\n', 
'15/11/2015, 8:43 pm - IK: Ws\n'] 

如果你想删除\n在被捕获的内容中,只需将m.group(1).strip().replace('\n', '')添加到上面的列表理解中即可。


说明正则表达式:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z) 

^              start of line 
    ^^^^ ^         pattern for a date 
        ^        capture the rest... 
         ^       until (look ahead) 
            ^^^   another date 
               ^ or 
                ^end of string 
+0

这完美谢谢!你能解释're.compile'里面的代码吗? – Imran

0

你可以使用正则表达式(使用re模块)来检查日期是这样的:

import re 
with open('file.txt', 'r') as text_file: 
    data = [] 
    for line in text_file: 
    row = line.strip() 
    if re.match(r'\d{2}/\d{2}/\d{4}.*'): 
     data.append(row) # date: new record 
    else: 
     data[-1] += '\n' + row # no date: append to last record 

# '\d{2}': two digits 
# '.*': any character, zero or more times 
+0

与迄今为止的其他方法一样:如果数据包含分隔符序列(此格式的日期),则中断。 – handle

1

考虑到','只能显示为分隔符,我们m AY检查线路有一个逗号,如果没有它串联到最后一行:

data = [] 

with open('file.txt', 'r') as text_file: 
    for line in text_file: 
     row = line.strip() 
     if ',' not in row: 
      data[-1] += '\n' + row 
     else: 
      data.append(row) 
+0

到目前为止,阻止逗号出现在数据中(实际上,在问题评论中链接的数据文件中有几个)。可靠的分离是不可能的。 – handle

+0

当我发布时,只有这个问题的例子,我的代码将是“最简单的事情,可能工作”。但是,在评论中链接的数据是正确的,这是行不通的... –

0

的长度简单的测试:

#!python3 
#coding=utf-8 

data = """06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde""" 

lines = data.split("\n") 
out = [] 
for l in lines: 
    c = l.strip() 
    if c: 
     if len(c) < 10: 
      out[-1] += c 
     else: 
      out.append(c) 
    #skip empty 

for o in out: 
    print(o) 

结果:

06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcdefghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcdefgheijkl 
07/01/2016, 7:58 pm - abcde 

不包含数据中的换行符!


但这一个衬里的正则表达式应该这样做(在断行分割后按数字),至少对样品数据(断裂时的数据包含换行符后按数字):

#!python3 
#coding=utf-8 

text_file = """06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde""" 

import re 
data = re.split("\n(?=\d)", text_file) 

print(data) 

for d in data: 
    print(d) 

输出:

['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\ 
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde'] 
06/01/2016, 10:40 pm - abcde 
07/01/2016, 12:04 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 12:05 pm - abcde 
07/01/2016, 6:14 pm - abcde 

fghe 
07/01/2016, 6:20 pm - abcde 
07/01/2016, 7:58 pm - abcde 

fghe 

ijkl 
07/01/2016, 7:58 pm - abcde 

(固定用先行)

+0

如果数据包含换行符+数字,则失败,因此正则表达式需要扩展。在另一方面,这种方法[未经消毒的数据,没有分隔符]很容易,如果数据中包含着一些新行,看起来像一个数据头破... – handle

+0

如果其中一个日期,是'2016年12月21日'?如果您使用're.split(r'\ n \ d',txt)'您的日期变为'2/21/2016' ... – dawg

+0

糟糕,没有注意到它会消耗数字。 – handle