阅读文件在python

我有一堆包含格式的数据文件（几乎100）：（人数）\吨（平均年龄）从一个随机游走产生阅读文件在python

这些文件针对特定人群的人口进行。每个文件有100,000行，对应于1到100,000的大小的平均年龄。每个文件对应于第三世界国家中的不同地区。我们将把这些数值与发达国家同类地区的平均年龄进行比较。

我想要做的是，

for each i (i ranges from 1 to 100,000): 
    Read in the first 'i' values of average-age 
    perform some statistics on these values

这意味着，每次运行我（其中我范围为1〜100,000），在平均 - 第一我读取值年龄，将它们添加到列表中，并运行一些测试（如柯尔莫哥洛夫 - 斯米尔诺夫或卡方）

为了并行打开所有这些文件，我想通了BES方式将是一个文件对象的字典。但我坚持尝试做上述操作。

我的方法是最好的（复杂性）？

有没有更好的方法？

来源

2011-06-02 Craig

“读取所有文件（第一* I *平均年龄将它们放到一个列表或东西”？这是什么意思？它是否意味着'我在范围内（100）：从文件中读取我行'？如果是这样，请更新您的算法。 – 2011-06-02 21:17:59

如果文件很小，则会增加一个开销以访问所有文件同时由于GIL和这些文件在同一个硬盘中 – JBernardo 2011-06-02 21:19:47

每个文件中有100,000行，我想读取第一个i文件，范围从1到100,000 – Craig 2011-06-02 21:23:19

为什么不采取一种简单的方法：

打开每个文件依次
并读取其线以填充一个内存中的数据结构
在存储器内数据结构执行统计

这是一个独立的示例，包含3个“文件”，每个文件包含3行。它采用StringIO为了方便，而不是实际的文件：

#!/usr/bin/env python 
# coding: utf-8 

from StringIO import StringIO 

# for this example, each "file" has 3 lines instead of 100000 
f1 = '1\t10\n2\t11\n3\t12' 
f2 = '1\t13\n2\t14\n3\t15' 
f3 = '1\t16\n2\t17\n3\t18' 

files = [f1, f2, f3] 

# data is a list of dictionaries mapping population to average age 
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with 
# population of 10000. 
data = [] 

for i,filename in enumerate(files): 
    f = StringIO(filename) 
    # f = open(filename, 'r') 
    data.append(dict()) 

    for line in f: 
     population, average_age = (int(s) for s in line.split('\t')) 
     data[i][population] = average_age 

print data 

# gather custom statistics on the data 

# i.e. here's how to calculate the average age across all locations where 
# population is 2: 
num_locations = len(data) 
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations 
print 'Average age with population 2 is', pop2_avg, 'years old'

输出是：

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}] 
Average age with population 2 is 14 years old

来源

2011-06-02 22:55:59 Gregg

实际上，将有可能在内存中保存10,000,000行。

制作一个字典，其中的键是number of people，值是列表average age其中列表的每个元素都来自不同的文件。因此，如果有100个文件，则每个列表将包含100个元素。

这样，你不需要到文件对象存储在dict

希望这有助于

来源

2011-06-02 21:28:00 inspectorG4dget

听起来好像很多数据，但真的 - 事实并非如此，假设您存储的数字只是整数，您“再不会寻找在兆几10S – 2011-06-02 22:21:14

我......不知道我喜欢这种方法，但它是可能的，它可以为你工作。它有消耗大量内存的潜力，但可以做你需要的东西。我假设你的数据文件是编号的。如果情况并非如此，则可能需要适应。

# open the files. 
handles = [open('file-%d.txt' % i) for i in range(1, 101)] 

# loop for the number of lines. 
for line in range(100000): 
    lines = [fh.readline() for fh in handles] 

    # Some sort of processing for the list of lines.

这可能会接近你所需要的，但我不知道我喜欢它。如果你有任何文件没有相同的行数，这可能会遇到麻烦。

来源

2011-06-02 21:35:35

是，该文件具有行不同数量发生这种情况，因为随机游走在某些运行 – Craig 2011-06-02 22:16:32

@Craig失败了。 - 我只是跑了一个快速测试，它看起来像的ReadLine （）在到达文件结尾时会返回一个空字符串，那会的让你的测试变得简单，并且看起来不会引发异常。 – 2011-06-02 22:19:56

阅读文件在python

回答

相关问题