2016-08-19 105 views
1

我对Python/JSON很新,所以请耐心等待。我可以在R中执行此操作,但我们需要使用Python以将其转换为Python/Spark/MongoDB。此外,我只是发布一个最小的子集 - 我有更多的文件类型,所以如果有人可以帮助我,我可以在此基础上整合更多文件和文件类型:Python:将两个CSV文件合并为多级JSON

回到我的问题:

我有两个tsv输入文件,我需要合并并转换为JSON。这两个文件都有基因和样本列以及一些附加列。然而,genesample可能会或可能不会重叠,如我所示 - f2.tsv具有f1.tsv中的所有基因,但也有一个额外的基因g3。同样,这两个文件在sample列中都有重叠以及不重叠的值。

# f1.tsv – has gene, sample and additional column other1 

$ cat f1.tsv 
gene sample other1 
g1  s1  a1 
g1  s2  b1 
g1  s3a  c1 
g2  s4  d1 

# f2.tsv – has gene, sample and additional columns other21, other22 

$ cat f2.tsv 
gene sample other21 other22 
g1  s1  a21  a22 
g1  s2  b21  b22 
g1  s3b  c21  c22 
g2  s4  d21  d22 
g3  s5  f21  f22 

该基因形成的顶层,每个基因具有形成第二级和其他列形成extras这是第三级的多个样品。附加内容分为两部分,因为一个文件有other1,第二个文件有other21other22。稍后我将包含的其他文件将包含其他字段,如other31other32等,但它们仍将具有基因和样本列。

# expected output – JSON by combining both tsv files. 
$ cat output.json 
[{ 
    "gene":"g1", 
    "samples":[ 
    { 
     "sample":"s2", 
     "extras":[ 
     { 
      "other1":"b1" 
     }, 
     { 
      "other21":"b21", 
      "other22":"b22" 
     } 
     ] 
    }, 
    { 
     "sample":"s1", 
     "extras":[ 
     { 
      "other1":"a1" 
     }, 
     { 
      "other21":"a21", 
      "other22":"a22" 
     } 
     ] 
    }, 
    { 
     "sample":"s3b", 
     "extras":[ 
     { 
      "other21":"c21", 
      "other22":"c22" 
     } 
     ] 
    }, 
    { 
     "sample":"s3a", 
     "extras":[ 
     { 
      "other1":"c1" 
     } 
     ] 
    } 
    ] 
},{ 
    "gene":"g2", 
    "samples":[ 
    { 
     "sample":"s4", 
     "extras":[ 
     { 
      "other1":"d1" 
     }, 
     { 
      "other21":"d21", 
      "other22":"d22" 
     } 
     ] 
    } 
    ] 
},{ 
    "gene":"g3", 
    "samples":[ 
    { 
     "sample":"s5", 
     "extras":[ 
     { 
      "other21":"f21", 
      "other22":"f22" 
     } 
     ] 
    } 
    ] 
}] 

如何将两个csv文件转换为基于两个公共列的单一多级JSON?

我真的很感激任何帮助,我可以得到这一点。

谢谢!

回答

2

这里的另一种选择方式。当您开始添加更多文件时,我试图使其易于管理。您可以在命令行上运行并为每个要添加的文件提供参数。基因/样本名称存储在字典中以提高效率。你想要的JSON对象的格式是在每个类的format()方法中完成的。希望这可以帮助。

import csv, json, sys 

class Sample(object): 
    def __init__(self, name, extras): 
     self.name = name 
     self.extras = [extras] 

    def format(self): 
     map = {} 
     map['sample'] = self.name 
     map['extras'] = self.extras 
     return map 

    def add_extras(self, extras): 
     #edit 8/20 
     #always just add the new extras to the list 
     for extra in extras: 
      self.extras.append(extra) 

class Gene(object): 
    def __init__(self, name, samples): 
     self.name = name 
     self.samples = samples 

    def format(self): 
     map = {} 
     map ['gene'] = self.name 
     map['samples'] = sorted([self.samples[sample_key].format() for sample_key in self.samples], key=lambda sample: sample['sample']) 
     return map 

    def create_or_add_samples(self, new_samples): 
     # loop through new samples, seeing if they already exist in the gene object 
     for sample_name in new_samples: 
      sample = new_samples[sample_name] 
      if sample.name in self.samples: 
       self.samples[sample.name].add_extras(sample.extras) 
      else: 
       self.samples[sample.name] = sample 

class Genes(object): 
    def __init__(self): 
     self.genes = {} 

    def format(self): 
     return sorted([self.genes[gene_name].format() for gene_name in self.genes], key=lambda gene: gene['gene']) 

    def create_or_add_gene(self, gene): 
     if not gene.name in self.genes: 
      self.genes[gene.name] = gene 
     else: 
      self.genes[gene.name].create_or_add_samples(gene.samples) 

def row_to_gene(headers, row): 
    gene_name = "" 
    sample_name = "" 
    extras = {} 
    for value in enumerate(row): 
     if headers[value[0]] == "gene": 
      gene_name = value[1] 
     elif headers[value[0]] == "sample": 
      sample_name = value[1] 
     else: 
      extras[headers[value[0]]] = value[1] 
    sample_dict = {} 
    sample_dict[sample_name] = Sample(sample_name, extras) 
    return Gene(gene_name, sample_dict) 

if __name__ == '__main__': 
    delim = "\t" 
    genes = Genes() 
    files = sys.argv[1:] 

    for file in files: 
     print("Reading " + str(file)) 
     with open(file,'r') as f1: 
      reader = csv.reader(f1, delimiter=delim) 
      headers = [] 
      for row in reader: 
       if len(headers) == 0: 
        headers = row 
       else: 
        genes.create_or_add_gene(row_to_gene(headers, row)) 

    result = json.dumps(genes.format(), indent=4) 
    print(result) 
    with open('json_output.txt', 'w') as output: 
     output.write(result) 
+0

它工作得很好 - 我真的很喜欢你有它如此普遍 - 我可以指定分隔符以及任何数量的文件。这难以置信! –

+0

我只有一个问题 - 对于G1/S1它显示了''' “群众演员”: { “其他1”: “A1” }, [ { “other22”: “A22”, “other21 “:”a21“ } ] ]'''我想删除额外的内部方括号。 –

+0

@KomalRathi哎呀,对不起。我编辑修复 – gregbert

2

这看起来像是pandas的问题!不幸的是,熊猫只能把我们带到目前为止,然后我们必须自己做一些操作。这既不是快速也不是特别有效的代码,但它会完成工作。

import pandas as pd 
import json 
from collections import defaultdict 

# here we import the tsv files as pandas df 
f1 = pd.read_table('f1.tsv', delim_whitespace=True) 
f2 = pd.read_table('f2.tsv', delim_whitespace=True) 

# we then let pandas merge them 
newframe = f1.merge(f2, how='outer', on=['gene', 'sample']) 

# have pandas write them out to a json, and then read them back in as a 
# python object (a list of dicts) 
pythonList = json.loads(newframe.to_json(orient='records')) 


newDict = {} 
for d in pythonList: 
    gene = d['gene'] 
    sample = d['sample'] 
    sampleDict = {'sample':sample, 
        'extras':[]} 

    extrasdict = defaultdict(lambda:dict()) 

    if gene not in newDict: 
     newDict[gene] = {'gene':gene, 'samples':[]} 

    for key, value in d.iteritems(): 
     if 'other' not in key or value is None: 
      continue 
     else: 
      id = key.split('other')[-1] 
      if len(id) == 1: 
       extrasdict['1'][key] = value 
      else: 
       extrasdict['{}'.format(id[0])][key] = value 

    for value in extrasdict.values(): 
     sampleDict['extras'].append(value) 

    newDict[gene]['samples'].append(sampleDict) 

newList = [v for k, v in newDict.iteritems()] 

print json.dumps(newList) 

如果这看起来像一个解决方案,为你工作,我很乐意花一些时间来清除它,使它诱饵更具可读性和效率。

PS:如果你喜欢R,那么大熊猫是去(这是写给Python中的R类似的接口数据)

+0

该解决方案是完美的!我刚接受gregbert的回答,因为他的代码具有指定尽可能多的输入文件和分隔符的功能。非常感谢你。 –

1

做,在步骤:

  1. 读取传入tsv文件和聚集来自不同基因的信息到字典中。
  2. 处理所述字典以匹配您所需的格式。
  3. 将结果写入JSON文件。

下面是代码:

import csv 
import json 
from collections import defaultdict 

input_files = ['f1.tsv', 'f2.tsv'] 
output_file = 'genes.json' 

# Step 1 
gene_dict = defaultdict(lambda: defaultdict(list)) 
for file in input_files: 
    with open(file, 'r') as f: 
     reader = csv.DictReader(f, delimiter='\t') 
     for line in reader: 
      gene = line.pop('gene') 
      sample = line.pop('sample') 
      gene_dict[gene][sample].append(line) 

# Step 2 
out = [{'gene': gene, 
     'samples': [{'sample': sample, 'extras': extras} 
        for sample, extras in samples.items()]} 
     for gene, samples in gene_dict.items()] 

# Step 3 
with open(output_file, 'w') as f: 
    json.dump(out, f) 
+0

这个解决方案是完美的!我刚接受gregbert的回答,因为他的代码具有指定尽可能多的输入文件和分隔符的功能。非常感谢你。 –

+0

请注意,这些代码也很容易处理我的代码:添加更多输入文件,将它们的名称追加到'input_files'列表中;要更改分隔符,请编辑第12行。 –