我需要解析一堆巨大的文本文件,每个文件都是100MB +。它们是CSV格式的格式不佳的日志文件,但每条记录都是多行,所以我不能只读取每行并用分隔符分隔它们。它也不是一个固定的行数,因为如果有空白值,那么有时会跳过这行,或者某些行溢出到下一行。另外,记录分隔符可以在同一文件中的变化,从“”到“*****”,有时有这样一行“的日志#”解析格式不当的日志文件,其中记录是多行,没有设置行数
样品日志末尾:
"Date:","6/23/2015","","Location:","Kol","","Target Name:","ILO.sed.908"
"ID:","ke.lo.213"
"User:","EDU\namo"
"Done:","Edit File"
"Comment","File saved successfully"
""
"Date:","6/27/2015","","Location:","Los Angeles","","Target Name:","MAL.21.ol.lil"
"ID:","uf.903.124.56"
"Done:","dirt emptied and driven to locations without issue, yet to do anyt"
"hing with the steel pipes, no planks "
"Comment"," l"
""
"end of log 1"
"Date:","5/16/2015","","Location:","Springfield","","Target Name:","ile.s.ol.le"
"ID:","84l.df.345"
"User:","EDU\bob2"
"Done:","emptied successfully"
"Comment","File saved successfully"
" ******* "
我应该如何处理这个问题?它需要高效,以便我能够快速处理,所以更少的文件io操作会更好。目前,我只是将它读入内存中的所有一次:
with open('Path/to/file', 'r') as content_file:
content = content_file.read()
我也有些新的蟒蛇,我知道如何处理读取多个文件,并运行在每个代码,我有一个toString将其输出到新的csv文件。
另一个问题是一些日志文件的大小只有几GB,它不会将所有内容一次读入内存,但我不知道如何将其分成块。我不能只读取X行,因为记录行数没有设置。
需要将注释保存在一起并串联在一个字符串中。
所以请大家帮忙!
如何读取大块文件的示例:http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python –