如何流中，并通过跨类别总结操纵蟒蛇

一个大的数据文件，我有一个比较大的（1 GB）的文本，我想在尺寸，以减少文件：如何流中，并通过跨类别总结操纵蟒蛇

Geography AgeGroup Gender Race Count 
County1 1  M  1 12 
County1 2  M  1 3 
County1 2  M  2 0

要：

Geography Count 
County1 15 
County2 23

这将是一件简单的事情，如果整个文件可以适应内存，但使用pandas.read_csv()给出MemoryError。所以我一直在研究其他方法，看起来有很多选项 - HDF5？使用itertools（这看起来很复杂 - 生成器？）或者只是使用标准文件方法读取第一个地理区域（70行），对count列进行求和，然后在另外70行加载之前写出。

有没有人有最好的方法来做到这一点的任何建议？我特别喜欢将数据流式传输的想法，特别是因为我可以考虑很多其他可能有用的地方。我对这种方法最感兴趣，或者同样使用最基本的功能。

编辑：在这个小案例中，我只想要按地理位置计算的总和。但是，如果我可以读取块，指定任何函数（例如，一起添加2列，或按地理位置获取列的最大值），应用函数并在读入新块之前写入输出，那么这将是理想的。

来源

2016-07-05 HFBrowning

所以你不想在中间的3列？ – ayhan

我编辑了一个问题来澄清，谢谢 – HFBrowning

你知道[大熊猫阅读大全]（http://pandas.pydata.org/pandas-docs/stable/io.html#iterating-through-files-chunk-by-块）？ 'Pd等。read_csv（'myfile.csv'，chunksize = 1000）'。然后你可以在一个循环内对零件进行操作。 – chrisaycock

您可以使用dask.dataframe，这在语法上是相似pandas，但执行的操作外的核心，所以内存不应该是一个问题：

import dask.dataframe as dd 

df = dd.read_csv('my_file.csv') 
df = df.groupby('Geography')['Count'].sum().to_frame() 
df.to_csv('my_output.csv')

另外，如果pandas是一个要求，你可以使用分块读取，如@chrisaycock所述。您可能想要试验chunksize参数。

# Operate on chunks. 
data = [] 
for chunk in pd.read_csv('my_file.csv', chunksize=10**5): 
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum() 
    data.append(chunk) 

# Combine the chunked data. 
df = pd.concat(data, ignore_index=True) 
df = df.groupby('Geography')['Count'].sum().to_frame() 
df.to_csv('my_output.csv')

来源

2016-07-05 16:43:17 root

我喜欢@根本的解决办法，但我会去有点进一步优化内存使用情况 - 只保留聚集DF在内存中，仅读取这些列，你真的需要：

cols = ['Geography','Count'] 
df = pd.DataFrame() 

chunksize = 2 # adjust it! for example --> 10**5 
for chunk in (pd.read_csv(filename, 
          usecols=cols, 
          chunksize=chunksize) 
      ): 
    # merge previously aggregated DF with a new portion of data and aggregate it again 
    df = (pd.concat([df, 
        chunk.groupby('Geography')['Count'].sum().to_frame()]) 
      .groupby(level=0)['Count'] 
      .sum() 
      .to_frame() 
     ) 

df.reset_index().to_csv('c:/temp/result.csv', index=False)

测试数据：

Geography,AgeGroup,Gender,Race,Count 
County1,1,M,1,12 
County2,2,M,1,3 
County3,2,M,2,0 
County1,1,M,1,12 
County2,2,M,1,33 
County3,2,M,2,11 
County1,1,M,1,12 
County2,2,M,1,111 
County3,2,M,2,1111 
County5,1,M,1,12 
County6,2,M,1,33 
County7,2,M,2,11 
County5,1,M,1,12 
County8,2,M,1,111 
County9,2,M,2,1111

output.csv：

Geography,Count 
County1,36 
County2,147 
County3,1122 
County5,24 
County6,33 
County7,11 
County8,111 
County9,1111

使用此方法的PS可以处理大量文件。

PPS采用分块方法应该工作，除非你需要理清你的数据 - 在这种情况下，我会用经典UNIX工具，如awk，sort等排序第一数据

我也建议使用PyTables（HDF5存储），而不是CSV文件 - 它非常快速并允许您有条件地读取数据（使用where参数），所以它非常方便并节省了大量资源，通常与CSV相比较为much faster。

来源

2016-07-05 17:11:36 MaxU

如何流中，并通过跨类别总结操纵蟒蛇

回答

相关问题