如何找到两个列表之间的匹配并根据匹配写入输出？

我不确定我是否适当地提出问题标题。但是，我试图解释下面的问题。如果你能想到这个问题，请建议适当的标题。如何找到两个列表之间的匹配并根据匹配写入输出？

说我有两种类型的列表数据：

list_headers = ['gene_id', 'gene_name', 'trans_id'] 
# these are the features to be mined from each line of `attri_values` 

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'] 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'] 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']

我努力使基础上list in the header和attribute in the attri_values的匹配表。

output = open('gtf_table', 'w') 
output.write('\t'.join(list_headers) + '\n') # this will first write the header 

# then I want to read each line 
for values in attri_values: 
    for list in list_headers: 
     if values.startswith(list): 
      attr_id = ''.join([x for x in attri_values if list in x]) 
      attr_id = attr_id.replace('"', '').split(' ')[1] 
      output.write('\t' + '\t'.join([attr_id])) 

     elif not values.startswith(list): 
      attr_id = 'NA' 
      output.write('\t' + '\t'.join([attr_id])) 

     output.write('\n')

问题：是，当从list of list_headers匹配字符串values of attri_values发现一切运作良好，但是当没有比赛有很多重复的“NA”的。

最终预期的结果：

gene_id gene_name trans_id 
scaffold_200001.1 NA NA 
scaffold_200001.1 NA scaffold_200001.1 
scaffold_200002.1 NA scaffold_200002.1

帖子编辑： 这个问题我怎么写了我的elif（因为每一个非匹配会写“NA”）。我试图以不同的方式移动NA的条件，但没有成功。 如果我删除elif得到它作为第输出（NA丢失）：

gene_id gene_name trans_id 
scaffold_200001.1 
scaffold_200001.1 scaffold_200001.1 
scaffold_200002.1 scaffold_200002.1

来源

2017-04-24 everestial007

Python有字符串，你可以用它来遍历每个attri_values每个列表头find方法。尝试使用此功能：

def Get_Match(search_space,search_string): 
    start_character = search_space.find(search_string) 

    if start_character == -1: 
     return "N/A" 
    else: 
     return search_space[(start_character + len(search_string)):] 

for i in range(len(attri_values_1)): 
    for j in range(len(list_headers)): 
     print Get_Match(attri_values_1[i],list_headers[j])

来源

2017-04-24 20:01:18

我使用的答案大熊猫

import pandas as pd 

# input data 
list_headers = ['gene_id', 'gene_name', 'trans_id'] 

attri_values = [ 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'], 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'], 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']] 

# process input data 
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values] 

# Create DataFrame with the desired columns 
df = pd.DataFrame(attri_values_X, columns=list_headers) 

# print dataframe 
print df

输出

   gene_id gene_name    trans_id 
0 "scaffold_200001.1"  NaN     NaN 
1 "scaffold_200001.1"  NaN "scaffold_200001.1" 
2 "scaffold_200002.1"  NaN "scaffold_200002.1"

没有大熊猫是很容易为好。我已经给你attri_values_X，那么你几乎在那里，只是从字典中删除你不想要的钥匙。

来源

2017-04-24 21:11:16 Elmex80s

我设法写一个函数，这将有助于解析您的数据。我试图修改你发布的原代码，有什么事在这里复杂的是你存储你的数据需要被解析的方式，反正我不是在一个位置来判断，这里是我的代码：

def searchHeader(title, values): 
    """" 
    searchHeader(title, values) --> list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring. 
    Example: 
      >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza'] 
      >>> searchHeader('spam', attri_values) 
      ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """ 
    res = [] 
    for x in values: 
     if title in x: 
      res.append(x) 
     else: 
      res.append('N\A')      # If no match found append N\A for every string in values 

    res = ' '.join(res) 
    # res = res.replace('"', '')     You can use this for your code or use it after you call the function on res 
    res = res.split(' ') 
    res = [x for x in res if x != title]   # Remove title string from res 
    return res

正则表达式在这种情况下也可以很方便。使用此功能解析数据，然后格式化结果以写入文件表。此函数只使用一个for循环和一个列表理解，在您的代码中使用两个嵌套的for循环和一个列表理解。

单独通过每个头字符串的功能，如以下：

for title in list_headers: 
    result = searchHeader(title, attri_values) 
    ...format as table... 
    ...write to file...

如果有可能，可以考虑从一个简单的列表移动到字典你attri_values，这样你可以用组的字符串他们的标头：

attri_values = {'header': ('data1', 'data2',...)}

在我看来，这比使用列表更好。另外请注意，你的代码中的list这个名字是压倒一切的，这不是一件好事，这是因为list实际上是创建列表的内建类。

来源

2017-04-24 21:27:32 direprobs

感谢您的回答。使用字典会很复杂，因为这些只是大数据的一小部分。我认为简单的嵌套for循环会解决它。顺便说一句，我得到'类型错误'result = searchHeader（list_headers，attri_values）' – everestial007

@ everestial007我的坏！我应该将'title'而不是'list_headers'传递给函数：'result = searchHeader（title，attri_values）'。这可能是深夜编写代码的结果：P？ – direprobs

我了解电脑太多和/或困倦的后果。顺便说一句，代码仍然无法为我解决问题。我试着改变一些像**而不是'如果在x中的标题：'我认为它应该'如果x.startswith（标题）'原因在那里将不会有一个命中列表比较，除非所有字符串匹配* *。我也尝试改变其他的东西，但没有运气。你能给我一个完整的工作例子吗？ - 这是可能的。请注意这个问题，以便更多关注这个问题。 – everestial007

如何找到两个列表之间的匹配并根据匹配写入输出？

回答

相关问题