这里是一个Python脚本,你可以使用分裂使用subprocess
大文件:
"""
Splits the file into the same directory and
deletes the original file
"""
import subprocess
import sys
import os
SPLIT_FILE_CHUNK_SIZE = '5000'
SPLIT_PREFIX_LENGTH = '2' # subprocess expects a string, i.e. 2 = aa, ab, ac etc..
if __name__ == "__main__":
file_path = sys.argv[1]
# i.e. split -a 2 -l 5000 t/some_file.txt ~/tmp/t/
subprocess.call(["split", "-a", SPLIT_PREFIX_LENGTH, "-l", SPLIT_FILE_CHUNK_SIZE, file_path,
os.path.dirname(file_path) + '/'])
# Remove the original file once done splitting
try:
os.remove(file_path)
except OSError:
pass
,可在外部调用它:
import os
fs_result = os.system("python file_splitter.py {}".format(local_file_path))
您还可以导入subprocess
并直接在程序中运行它。
此方法的问题是内存使用率高:subprocess
创建一个内存占用空间与您的进程大小相同的分叉,并且如果进程内存已经很大,它会在运行时加倍。与os.system
同样的事情。
这里是这样做的另一个纯Python的方式,虽然我没有测试它的巨大的文件,它会慢一些,但对于内存精简:
CHUNK_SIZE = 5000
def yield_csv_rows(reader, chunk_size):
"""
Opens file to ingest, reads each line to return list of rows
Expects the header is already removed
Replacement for ingest_csv
:param reader: dictReader
:param chunk_size: int, chunk size
"""
chunk = []
for i, row in enumerate(reader):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
with open(local_file_path, 'rb') as f:
f.readline().strip().replace('"', '')
reader = unicodecsv.DictReader(f, fieldnames=header.split(','), delimiter=',', quotechar='"')
chunks = files.yield_csv_rows(reader, CHUNK_SIZE)
for chunk in chunks:
if not chunk:
break
# Do something with your chunk here
不受欢迎的建议:获得更好的文本编辑器。 :-)如果你在Windows上,EmEditor是我知道的,它可以无缝地编辑文件,而无需将它们完全加载到内存中。 – bobince 2008-11-15 13:00:35