2017-07-03 114 views
0

我正在处理一个非常大的数据集,并且遇到了无法找到任何答案的问题。 我试图解析来自JSON数据,这里是我做过什么从整个数据集一块的数据和工作原理:如何解析python中的BIG JSON文件

import json 

s = set() 

with open("data.raw", "r") as f: 

    for line in f: 
     d = json.loads(line) 

混乱的部分是,当我申请我的主数据代码(大小约200G)它显示了以下错误(不包括外出内存):

d = json.loads(line) 
    File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads 
    return _default_decoder.decode(s) 
    File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode 
    raise JSONDecodeError("Expecting value", s, err.value) from None 
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1) 

类型(F)= TextIOWrapper是否有帮助......但这种数据类型也为小数据集。 ..

这里有几行我的数据看格式:

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}} 
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}} 
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}} 
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}} 

这是因为Json的我已经在第一解析2000行和它完美的作品。但是,当我尝试对大文件使用相同的过程时,它会显示数据的第一行中的错误。

+0

应该对那个json数据做些什么改变? – RomanPerekhrest

+0

'data.raw'是每个行上的json对象的json文件还是文件?如果前者使用['json.load'](https://docs.python.org/3.5/library/json.html#json.load) – Will

+0

你的文件不是有效的JSON。不过,它似乎在每一行都包含有效的JSON文本。我的建议是,修正产生这个“JSON”的东西(它实际上不是JSON)。除此之外,我想你可以一行一行地将反序列化的对象堆积成一个列表或其他东西。 –

回答

2

下面是一些简单的代码,看看哪些数据是无效的JSON和它在哪里:

for i, line in enumerate(f): 
    try: 
     d = json.loads(line) 
    except json.decoder.JSONDecodeError: 
     print('Error on line', i + 1, ':\n', repr(line)) 
+0

谢谢@alex。我用这个代码,结果很奇怪!根据结果​​,我对每条偶数行都有错误!但我使用了我的大文件的第一个2000行,并没有显示任何错误...这太混乱了...... – Mina

+0

@Mina你能告诉我们一个错误信息吗?我特别想看到一条失败的路线。 –

+0

你不能相信它,但那是关键:我在主要大文件中包含额外的输入,这就是错误信息的原因!顺便说一句,你的建议对我找到错误的根源非常有帮助。谢谢。 – Mina

1

一个很好的解决方案来读取一个大JSON数据集,它是在python使用像yield发电机,因为200G对于你的内存来说太大了,如果你的json解析器将整个文件存储在内存中,一步一步地将内存与迭代器一起保存。

您可以使用迭代JSON解析器与Pythonic接口http://pypi.python.org/pypi/ijson/

但是这里你的文件有.raw扩展名,它不是json文件。

要读那些:

import numpy as np 

content = np.fromfile("data.raw", dtype=np.int16, sep="") 

但是这种解决方案可以为崩溃的大文件。

如果事实.raw似乎一个.csv文件,那么你可以像创建你的读者:

import csv 

def read_big_file(filename): 
    with open(filename, "rb") as csvfile: 
     reader = csv.reader(csvfile) 
     for row in reader: 
      yield row 

或者像taht为一个文本文件:

def read_big_file(filename): 
    with open(filename, "r") as _file: 
     for line in _file: 
      yield line 

使用rb只有当你的文件是二进制的。

执行:

for line in read_big_file(filename): 
    <treatment> 
    <free memory after a size of chunk> 

,我可以精确我的回答如果你给你的文件的第一行。