如何优化阅读和处理大文件？

我有一个脚本，可以将一些可怜的人从API返回的数据缓存为JSON对象的平面文件。一个结果/每行JSON对象。如何优化阅读和处理大文件？

缓存工作流程如下：

阅读在整个缓存文件 - >检查每一行是太旧了，一行行 - >保存是不是太旧到新列表中的 - >将新的新缓存列表打印到文件中，并将新列表用作筛选器，以便不针对API调用的传入数据。

到目前为止，这个过程的最长的部分是粗体上面。以下是代码：

print "Reading cache file into memory ---" 
with open('cache', 'r') as f: 
    cache_lines = f.readlines() 

print "Turning cache lines into json and checking if they are stale or not ---" 
for line in cache_lines 
    # Load the line back up as a json object 
    try: 
     json_line = json.loads(line) 
    except Exception as e: 
     print e 

    # Get the delta to determine if data is stale. 
    delta = meta_dict["timestamp_start"] - parser.parse(json_line['timestamp_start']) 

    # If the data is still fresh then hold onto it 
    if cache_timeout >= delta: 
     fresh_cache.append(json_line)

根据散列文件的大小可能需要几分钟。有没有更快的方法来做到这一点？我理解，阅读整个文件并不理想，但最容易实现。

来源

2015-12-21 Thisisstackoverflow

根据您的文件大小，它可能会导致内存问题。我不知道这是否是你遇到的问题。上面的代码可以改写如下：

delta = meta_dict['timestamp_start'] 
with open('cache', 'r') as f: 
    while True: 
     line = f.readline() 
     if not line: 
      break 
     line = json.loads(line) 
     if delta - parser.parse(line['timestamp_start']) <= cache_timeout: 
      fresh_cache.append(json_line)

此外，

没有，如果你使用dateutils解析日期，每次通话可能是昂贵的。如果您的格式是已知的，可能要使用由datetime或dateutils
提供的标准转换工具，如果你的文件是真正的大和fresh_cache必须是真正的大，你可以使用另一个with上的中间文件写新鲜项声明。

来源

2015-12-21 21:40:34 ohe

感谢您的意见。我希望有一些黑魔法，但看起来我运气不好。我会尽量不parser.parsing每个电话，看看是否有帮助。 – Thisisstackoverflow

你也可以尝试'simplejson'库，它比标准的'json'库更快... – ohe

好点。也是一个镜头。 – Thisisstackoverflow

回报 - 1. simplejson几乎没有效果。 2.做手动日期时间提取有很大的作用。从8m11.578s减少到2m55.681s，减少了。这取代了上面的parser.parse 行： datetime.datetime.strptime（json_line ['timestamp_start']，'％Y-％m-％d ％H：％M：％S.％f“） -

来源

2015-12-21 23:31:07 Thisisstackoverflow

如何优化阅读和处理大文件？

回答

相关问题