使用pandas数据框中的json对象优化解析文件

我有一段和下面描述的代码的和平，执行时间约为5秒，对于1000行文件来说相当长，所以我正在寻找优化方法，但我不知道如何改进现有版本。使用pandas数据框中的json对象优化解析文件

我有一个大的文件，包含在每行有效的JSON，每个JSON看起来像（真实的数据更大型，嵌套，所以JSON的这种和平将显示为说明只是）：

{"location":{"town":"Rome","groupe":"Advanced", 
    "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}}, 
    "id":"145", 
    "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2, 
    "Father":{"FatherName":"Peter","FatherAge":"51"}, 
    "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing", 
    "season":["summer","spring"]}

我需要解析这个文件，以从每一个JSON只提取了一些键值，获取应该是一个数据帧：

Groupe  Id MotherName FatherName 
Advanced 56 Laure   James 
Middle  11 Ann   Nicolas 
Advanced 6 Helen   Franc

但一些关键，我需要在数据帧，在一些失踪json对象，所以我应该验证密钥是否存在，否则用空值填充相应的值。我继续以下方法：

df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) 
with open (path/to/file) as f: 
     for chunk in f: 
      jfile=json.loads(chunk) 
      if 'groupe' in jfile['location']: 
       groupe=jfile['location']['groupe'] 
      else: 
       groupe=np.nan 
      if 'id' in jfile: 
       id=jfile['id'] 
      else: 
       id=np.nan 
      if 'MotherName' in jfile['Mother']: 
       MotherName=jfile['Mother']['MotherName'] 
      else: 
       MotherName=np.nan 
      if 'FatherName' in jfile['Father']: 
       FatherName=jfile['Father']['FatherName'] 
      else: 
       FatherName=np.nan 
      df = df.append({"groupe":group,"id":id,"MotherName":MotherName,"FatherName":FatherName}, 
      ignore_index=True)

我需要优化1000行整个文件的执行时间至少2秒。在perl中，相同的解析函数只需不到1秒，但我需要在Python中实现它。

来源

2016-02-26 Amanda

如果您可以在初始化期间的单个步骤中构建数据帧，您将获得最佳性能。 DataFrame.from_record需要一系列元组，您可以从一次读取一条记录的发生器提供这些元组。您可以使用get更快地解析数据，当找不到该项目时它将提供默认参数。我创建了一个空的dict，调用dummy来传递中间值get，这样就可以知道链接获取会起作用。

我创建了1000条记录数据集，在我的蹩脚笔记本电脑上，时间从18秒变为0.06秒。这很不错。

import numpy as np 
import pandas as pd 
import json 
import time 

def extract_data(data): 
    """ convert 1 json dict to records for import""" 
    dummy = {} 
    jfile = json.loads(data.strip()) 
    return (
     jfile.get('location', dummy).get('groupe', np.nan), 
     jfile.get('id', np.nan), 
     jfile.get('Mother', dummy).get('MotherName', np.nan), 
     jfile.get('Father', dummy).get('FatherName', np.nan)) 

start = time.time() 
df = pd.DataFrame.from_records(map(extract_data, open('file.json')), 
    columns=['group', 'id', 'Father', 'Mother']) 
print('New algorithm', time.time()-start) 

# 
# The original way 
# 

start= time.time() 
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) 
with open ('file.json') as f: 
     for chunk in f: 
      jfile=json.loads(chunk) 
      if 'groupe' in jfile['location']: 
       groupe=jfile['location']['groupe'] 
      else: 
       groupe=np.nan 
      if 'id' in jfile: 
       id=jfile['id'] 
      else: 
       id=np.nan 
      if 'MotherName' in jfile['Mother']: 
       MotherName=jfile['Mother']['MotherName'] 
      else: 
       MotherName=np.nan 
      if 'FatherName' in jfile['Father']: 
       FatherName=jfile['Father']['FatherName'] 
      else: 
       FatherName=np.nan 
      df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName}, 
      ignore_index=True) 
print('original', time.time()-start)

来源

2016-02-26 07:21:05 tdelaney

我有'AttributeError：'列表'对象没有属性'get''与这种方法！不要忘了我每行都有一个json的文件，也许这是一个问题。所以我需要遍历行来解析每个json – Amanda

，这样整个文件就不是json本身，但是这个文件的每一行都是有效的json – Amanda

它的工作原理除了不是字典而是嵌套的json！在这种情况下如何使用.get方法？ @tdelaney – Amanda

关键部分不是将每行添加到循环中的数据帧。您希望将集合保存在列表或字典容器中，然后将它们一次连接起来。您还可以使用简单的get简化您的if/else结构，如果在字典中未找到该项目，该结构将返回默认值（例如np.nan）。

with open (path/to/file) as f: 
    d = {'group': [], 'id': [], 'Father': [], 'Mother': []} 
    for chunk in f: 
     jfile = json.loads(chunk) 
     d['groupe'].append(jfile['location'].get('groupe', np.nan)) 
     d['id'].append(jfile.get('id', np.nan)) 
     d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan)) 
     d['FatherName'].append(jfile['Father'].get('FatherName', np.nan)) 

    df = pd.DataFrame(d)

来源

2016-02-26 06:39:58 Alexander

你的回答是不错的n，而是有一个错误'类型错误：列表索引必须是整数，而转换成字典大熊猫数据帧 – Amanda

听起来有可能是与数据问题而无法str'。尝试从每列创建一个DataFrame，并查看是否可以找出问题。 – Alexander

使用pandas数据框中的json对象优化解析文件

回答

相关问题