2017-04-24 57 views
0

我不确定我是否适当地提出问题标题。但是,我试图解释下面的问题。如果你能想到这个问题,请建议适当的标题。如何找到两个列表之间的匹配并根据匹配写入输出?

说我有两种类型的列表数据:

list_headers = ['gene_id', 'gene_name', 'trans_id'] 
# these are the features to be mined from each line of `attri_values` 

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'] 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'] 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"'] 

我努力使基础上list in the headerattribute in the attri_values的匹配表。

output = open('gtf_table', 'w') 
output.write('\t'.join(list_headers) + '\n') # this will first write the header 

# then I want to read each line 
for values in attri_values: 
    for list in list_headers: 
     if values.startswith(list): 
      attr_id = ''.join([x for x in attri_values if list in x]) 
      attr_id = attr_id.replace('"', '').split(' ')[1] 
      output.write('\t' + '\t'.join([attr_id])) 

     elif not values.startswith(list): 
      attr_id = 'NA' 
      output.write('\t' + '\t'.join([attr_id])) 

     output.write('\n') 

问题:是,当从list of list_headers匹配字符串values of attri_values发现一切运作良好,但是当没有比赛有很多重复的“NA”的。

最终预期的结果:

gene_id gene_name trans_id 
scaffold_200001.1 NA NA 
scaffold_200001.1 NA scaffold_200001.1 
scaffold_200002.1 NA scaffold_200002.1 

帖子编辑: 这个问题我怎么写了我的elif(因为每一个非匹配会写“NA”)。我试图以不同的方式移动NA的条件,但没有成功。 如果我删除elif得到它作为第输出(NA丢失):

gene_id gene_name trans_id 
scaffold_200001.1 
scaffold_200001.1 scaffold_200001.1 
scaffold_200002.1 scaffold_200002.1 

回答

1

Python有字符串,你可以用它来遍历每个attri_values每个列表头find方法。尝试使用此功能:

def Get_Match(search_space,search_string): 
    start_character = search_space.find(search_string) 

    if start_character == -1: 
     return "N/A" 
    else: 
     return search_space[(start_character + len(search_string)):] 

for i in range(len(attri_values_1)): 
    for j in range(len(list_headers)): 
     print Get_Match(attri_values_1[i],list_headers[j]) 
1

我使用的答案大熊猫

import pandas as pd 

# input data 
list_headers = ['gene_id', 'gene_name', 'trans_id'] 

attri_values = [ 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'], 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'], 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']] 

# process input data 
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values] 

# Create DataFrame with the desired columns 
df = pd.DataFrame(attri_values_X, columns=list_headers) 

# print dataframe 
print df 

输出

   gene_id gene_name    trans_id 
0 "scaffold_200001.1"  NaN     NaN 
1 "scaffold_200001.1"  NaN "scaffold_200001.1" 
2 "scaffold_200002.1"  NaN "scaffold_200002.1" 

没有大熊猫是很容易为好。我已经给你attri_values_X,那么你几乎在那里,只是从字典中删除你不想要的钥匙。

1

我设法写一个函数,这将有助于解析您的数据。我试图修改你发布的原代码,有什么事在这里复杂的是你存储你的数据需要被解析的方式,反正我不是在一个位置来判断,这里是我的代码:

def searchHeader(title, values): 
    """" 
    searchHeader(title, values) --> list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring. 
    Example: 
      >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza'] 
      >>> searchHeader('spam', attri_values) 
      ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """ 
    res = [] 
    for x in values: 
     if title in x: 
      res.append(x) 
     else: 
      res.append('N\A')      # If no match found append N\A for every string in values 

    res = ' '.join(res) 
    # res = res.replace('"', '')     You can use this for your code or use it after you call the function on res 
    res = res.split(' ') 
    res = [x for x in res if x != title]   # Remove title string from res 
    return res 

正则表达式在这种情况下也可以很方便。使用此功能解析数据,然后格式化结果以写入文件表。此函数只使用一个for循环和一个列表理解,在您的代码中使用两个嵌套的for循环和一个列表理解。

单独通过每个头字符串的功能,如以下:

for title in list_headers: 
    result = searchHeader(title, attri_values) 
    ...format as table... 
    ...write to file... 

如果有可能,可以考虑从一个简单的列表移动到字典你attri_values,这样你可以用组的字符串他们的标头:

attri_values = {'header': ('data1', 'data2',...)} 

在我看来,这比使用列表更好。另外请注意,你的代码中的list这个名字是压倒一切的,这不是一件好事,这是因为list实际上是创建列表的内建类。

+0

感谢您的回答。使用字典会很复杂,因为这些只是大数据的一小部分。我认为简单的嵌套for循环会解决它。顺便说一句,我得到'类型错误'result = searchHeader(list_headers,attri_values)' – everestial007

+0

@ everestial007我的坏!我应该将'title'而不是'list_headers'传递给函数:'result = searchHeader(title,attri_values)'。这可能是深夜编写代码的结果:P? – direprobs

+0

我了解电脑太多和/或困倦的后果。顺便说一句,代码仍然无法为我解决问题。我试着改变一些像**而不是'如果在x中的标题:'我认为它应该'如果x.startswith(标题)'原因在那里将不会有一个命中列表比较,除非所有字符串匹配* *。我也尝试改变其他的东西,但没有运气。你能给我一个完整的工作例子吗? - 这是可能的。请注意这个问题,以便更多关注这个问题。 – everestial007

相关问题