2016-02-29 57 views
0

我想创建一个python脚本,它可以解析以下类型的日志条目,其中包括键和值。对于每个键,可能有也可能不存在另一对嵌套键和值。一个例子如下。嵌套的深度可以根据我得到的日志而变化,所以它必须是动态的。然而,深度是用大括号封装的。Python:嵌套键值数据解析

我将与键和值的字符串是这样的:上面

Countries =  { 
    "USA" = 0; 
    "Spain" = 0; 
    Connections = 1; 
    Flights =   { 
     "KLM" = 11; 
     "Air America" = 15; 
     "Emirates" = 2; 
     "Delta" = 3; 
    }; 
    "Belgium" = 1; 
    "Czech Republic" = 0; 
    "Netherlands" = 1; 
    "Hungary" = 0; 
    "Luxembourg" = 0; 
    "Italy" = 0; 

}; 

的数据可以有多个巢为好。我想编写将通过此解析功能,并把它放在一组数据(或类似),使得我能得到这样一个特定键的值:

print countries.belgium 
      value should be printed as 1 

同样,

print countries.flights.delta 
      value should be printed as 3. 

请注意,输入不需要在所有键(如连接或航班)中有引号。

任何指向我可以开始的东西。任何可以像这样解析的python库?

回答

1

我已经创建了一个示例Python脚本,将做的工作,只是调整它作为你喜欢。它将您的格式转换为嵌套字典。它像你喜欢的一样动态。

在这里看看:Paste bin 代码:

import re 
import ast 

data = """ { Countries = { USA = 1; "Connections" = { "1 Flights" = 0; "10 Flights" = 0; "11 Flights" = 0; "12 Flights" = 0; "13 Flights" = 0; "14 Flights" = 0; "15 Flights" = 0; "16 Flights" = 0; "17 Flights" = 0; "18 Flights" = 0; "More than 25 Flights" = 0; }; "Single Connections" = 0; "No Connections" = 0; "Delayed" = 0; "Technical Fault" = 0; "Others" = 0; }; }""" 


def arrify(string): 
    string = string.replace("=", " : ") 
    string = string.replace(";", " , ") 
    string = string.replace("\"", "") 
    stringDict = string.split() 
    # print stringDict 
    newArr = [] 
    quoteCosed = True 
    for i, splitStr in enumerate(stringDict): 
     if i > 0: 
      # print newArr 
      if not isDelim(splitStr): 
       if isDelim(newArr[i-1]) and quoteCosed: 
        splitStr = "\"" + splitStr 
        quoteCosed = False 

       if isDelim(stringDict[i+1]) and not quoteCosed: 
        splitStr += "\"" 
        quoteCosed = True 

     newArr.append(splitStr) 

    newString = " ".join(newArr) 
    newDict = ast.literal_eval(newString) 
    return normalizeDict(newDict) 

def isDelim(string): 
    return str(string) in "{:,}" 


def normalizeDict(dic): 
    for key, value in dic.items(): 
     if type(value) is dict: 
      dic[key] = normalizeDict(value) 
      continue 
     dic[key] = normalize(value) 
    return dic 

def normalize(string): 
    try: 
     return int(string) 
    except: 
     return string 

print arrify(data) 

从样本数据结果:

{'Countries': {'USA': 1, 'Technical Fault': 0, 'No Connections': 0, 'Delayed': 0, 'Connections': {'17 Flights': 0, '10 Flights': 0, '11 Flights': 0, 'More than 25 Flights': 0, '14 Flights': 0, '15 Flights': 0, '12 Flights': 0, '18 Flights': 0, '16 Flights': 0, '1 Flights': 0, '13 Flights': 0}, 'Single Connections': 0, 'Others': 0}} 

,你可以得到像一个正常的字典值将:)希望它帮助...

+0

你确实需要在你的答案中包含代码。只是连接到它是不够的。 – Blckknght

+0

@richmondwang,正是我在找的东西。然而,这次我的动态字符串如下,这给了我一个语法错误: – user2605278

+0

你传递了​​什么数据? @ user2605278 – rrw

1

迭代数据并检查元素是否是另一个键 - 值对,如果是,则递归调用该函数。事情是这样的:

def parseNestedData(data): 
    if isinstance(data, dict): 
     for k in data.keys(): 
      parseNestedData(data.get(k)) 
    else: 
     print data 

输出:

>>> Countries =  { 
"USA" : 0, 
"Spain" : 0, 
"Connections" : 1, 
"Flights" :   { 
    "KLM" : 11, 
    "Air America" : 15, 
    "Emirates" : 2, 
    "Delta" : 3, 
}, 
"Belgium" : 1, 
"Czech Republic" : 0, 
"Netherlands" : 1, 
"Hungary" : 0, 
"Luxembourg" : 0, 
"Italy" :0 
}; 

>>> Countries 
{'Connections': 1, 
'Flights': {'KLM': 11, 'Air America': 15, 'Emirates': 2, 'Delta': 3}, 
'Netherlands': 1, 
'Italy': 0, 
'Czech Republic': 0, 
'USA': 0, 
'Belgium': 1, 
'Hungary': 0, 
'Luxembourg': 0, 'Spain': 0} 
>>> parseNestedData(Countries) 
1 
11 
15 
2 
3 
1 
0 
0 
0 
1 
0 
0 
0 
+0

谢谢Himanshu。我怎样才能得到说捷克共和国的价值(应该返回我只是0) – user2605278

+0

也需要一些预处理?因为并非所有密钥都用双引号括起来,例如 - Connections – user2605278

+0

如果您知道捷克共和国密钥存在于第一级别,那么只需执行'data.get('Czech Republic')' – Himanshu

1

定义一个类结构来处理和存储信息,可以给你这样的事情:

import re 

class datastruct(): 
    def __init__(self,data_in): 
     flights = re.findall('(?:Flights\s=\s*\{)([\s"A-Z=0-9;a-z]*)};',data_in) 
     flight_dict = {} 
     for flight in flights[0].split(';')[0:-1]: 
      key,val = self.split_data(flight) 
      flight_dict[key] = val 

     countries = re.findall('("[A-Za-z]+\s?[A-Za-z]*"\s=\s[0-9]{1,2})',data_in) 
     countries_dict = {} 
     for country in countries: 
      key,val = self.split_data(country) 
      if key not in flight_dict: 
       countries_dict[key]=val 

     connections = re.findall('(?:Connections\s=\s)([0-9]*);',data_in) 
     self.country= countries_dict 
     self.flight = flight_dict 
     self.connections = int(connections[0]) 

    def split_data(self,data2): 
     item = data2.split('=') 
     key = item[0].strip().strip('"') 
     val = int(item[1].strip()) 
     return key,val 

请注意,如果数据与我在下面假设的不完全一致,则可能需要调整Regex。数据可以如下设置和参考:

raw_data = 'Countries =  { "USA" = 0; "Spain" = 0; Connections = 1; Flights =   {  "KLM" = 11;  "Air America" = 15;  "Emirates" = 2;  "Delta" = 3; }; "Belgium" = 1; "Czech Republic" = 0; "Netherlands" = 1; "Hungary" = 0; "Luxembourg" = 0; "Italy" = 0;};' 

flight_data = datastruct(raw_data) 
print("No. Connections:",flight_data.connections) 
print("Country 'USA':",flight_data.country['USA'],'\n' 
print("Flight 'KLM':",flight_data.flight['KLM'],'\n') 

for country in flight_data.country.keys(): 
    print("Country: {0} -> {1}".format(country,flight_data.country[country]))