2013-05-12 103 views
4

我正在尝试改变以前的脚本,该脚本利用biopython获取关于物种门的信息。这个脚本是为了一次检索一个物种的信息而编写的。我想修改脚本,以便我一次可以处理100个生物体。 这里是最初的代码尝试从Biopython获取分类信息

import sys 
from Bio import Entrez 

def get_tax_id(species): 
    """to get data from ncbi taxomomy, we need to have the taxid. we can 
    get that by passing the species name to esearch, which will return 
    the tax id""" 
    species = species.replace(" ", "+").strip() 
    search = Entrez.esearch(term = species, db = "taxonomy", retmode = "xml") 
    record = Entrez.read(search) 
    return record['IdList'][0] 

def get_tax_data(taxid): 
    """once we have the taxid, we can fetch the record""" 
    search = Entrez.efetch(id = taxid, db = "taxonomy", retmode = "xml") 
    return Entrez.read(search) 

Entrez.email = "" 
if not Entrez.email: 
    print "you must add your email address" 
    sys.exit(2) 
taxid = get_tax_id("Erodium carvifolium") 
data = get_tax_data(taxid) 
lineage = {d['Rank']:d['ScientificName'] for d in 
    data[0]['LineageEx'] if d['Rank'] in ['family', 'order']} 

我已成功地修改脚本,以便它接受一个包含我现在用的是生物的一个本地文件。但是我需要将它延伸到100个生物体。 因此,这个想法是从我的有机体文件中生成一个列表,并以某种方式将列表中生成的每个项目分别送入taxid = get_tax_id("Erodium carvifolium")行,并用我的有机体名称替换“Erodium carvifolium”。但我不知道该怎么做。

这里是代码的样本版本与我的一些调整

import sys 
from Bio import Entrez 


def get_tax_id(species): 
    """to get data from ncbi taxomomy, we need to have the taxid. we can 
    get that by passing the species name to esearch, which will return 
    the tax id""" 
    species = species.replace(' ', "+").strip() 
    search = Entrez.esearch(term = species, db = "taxonomy", retmode = "xml") 
    record = Entrez.read(search) 
    return record['IdList'][0] 

def get_tax_data(taxid): 
    """once we have the taxid, we can fetch the record""" 
    search = Entrez.efetch(id = taxid, db = "taxonomy", retmode = "xml") 
    return Entrez.read(search) 

Entrez.email = "" 
if not Entrez.email: 
    print "you must add your email address" 
    sys.exit(2) 
list = ['Helicobacter pylori 26695', 'Thermotoga maritima MSB8', 'Deinococcus radiodurans R1', 'Treponema pallidum subsp. pallidum str. Nichols', 'Aquifex aeolicus VF5', 'Archaeoglobus fulgidus DSM 4304'] 
i = iter(list) 
item = i.next() 
for item in list: 
    ??? 
taxid = get_tax_id(?) 
data = get_tax_data(taxid) 
lineage = {d['Rank']:d['ScientificName'] for d in 
    data[0]['LineageEx'] if d['Rank'] in ['phylum']} 
print lineage, taxid 

问号是指在那里我难倒下一步做什么的地方。我不明白我如何连接我的循环来替换?在get_tax_id(?)中。或者我需要以某种方式附加列表中的每个项目,以便每次修改它们以包含get_tax_id(Helicobacter pylori 26695),然后找到某种方法将它们放置在包含taxid的行中=

+1

你应该问biostars:http://www.biostars.org/ – Pierre 2013-05-12 17:51:17

+1

谢谢你的忠告 – user2374216 2013-05-12 23:09:46

回答

2

以下是您需要的内容,请将它放在下面你的函数定义,行之后即说:sys.exit(2)

species_list = ['Helicobacter pylori 26695', 'Thermotoga maritima MSB8', 'Deinococcus radiodurans R1', 'Treponema pallidum subsp. pallidum str. Nichols', 'Aquifex aeolicus VF5', 'Archaeoglobus fulgidus DSM 4304'] 

taxid_list = [] # Initiate the lists to store the data to be parsed in 
data_list = [] 
lineage_list = [] 

print('parsing taxonomic data...') # message declaring the parser has begun 

for species in species_list: 
    print ('\t'+species) # progress messages 

    taxid = get_tax_id(species) # Apply your functions 
    data = get_tax_data(taxid) 
    lineage = {d['Rank']:d['ScientificName'] for d in data[0]['LineageEx'] if d['Rank'] in ['phylum']} 

    taxid_list.append(taxid) # Append the data to lists already initiated 
    data_list.append(data) 
    lineage_list.append(lineage) 

print('complete!')