urllib2.HTTPError Python

我有一个带有GI号码的文件，想从ncbi获得FASTA序列。urllib2.HTTPError Python

from Bio import Entrez 
import time 
Entrez.email ="[email protected]" 
f = open("C:\\bioinformatics\\gilist.txt") 
for line in iter(f): 
    handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml") 
    records = Entrez.read(handle) 
    print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"] 
    time.sleep(1) # to make sure not many requests go per second to ncbi 
f.close()

这个脚本运行良好，但在几个序列后突然出现这个错误信息。

Traceback (most recent call last): 
    File "C:/Users/Ankur/PycharmProjects/ncbiseq/getncbiSeq.py", line 7, in <module> 
    handle = Entrez.efetch(db="nucleotide", id=line, retmode="xml") 
    File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 139, in efetch 
    return _open(cgi, variables) 
    File "C:\Python27\lib\site-packages\Bio\Entrez\__init__.py", line 455, in _open 
    raise exception 
urllib2.HTTPError: HTTP Error 500: Internal Server Error

我当然可以用http://www.ncbi.nlm.nih.gov/sites/batchentrez但我想创建一个管道，并会喜欢的东西自动化。

如何防止NCBI从“踢我出去”

来源

2013-02-12 Ank

我不熟悉的NCBI API，但我的猜测是，你违反了某种速率限制规则（甚至用“睡眠（ 1）“），因此您之前的请求可以正常工作，但是经过一些请求后，服务器会发现您经常触碰并阻止您。这对你来说是有问题的，因为你的代码中没有错误处理。

我建议在try/except块中包装数据获取，以使您的脚本等待更长时间，然后在遇到问题时再试一次。如果一切都失败了，把导致错误的id写到一个文件并继续（如果id是某种方式的罪魁祸首，可能导致Entrez库产生一个错误的URL）。

试着改变你的代码是这样的（未经测试）：

from urllib2 import HTTPError 
from Bio import Entrez 
import time 

def get_record(_id): 
    handle = Entrez.efetch(db="nucleotide", id=_id, retmode="xml") 
    records = Entrez.read(handle) 
    print ">GI "+line.rstrip()+" "+records[0]["GBSeq_primary-accession"]+" "+records[0]["GBSeq_definition"]+"\n"+records[0]["GBSeq_sequence"] 
    time.sleep(1) # to make sure not many requests go per second to ncbi 

Entrez.email ="[email protected]" 
f = open("C:\\bioinformatics\\gilist.txt") 
for id in iter(f): 
    try: 
     get_record(id) 
    except HTTPError: 
     print "Error fetching", id 
     time.sleep(5) # we have angered the API! Try waiting longer? 
     try: 
      get_record(id) 
     except: 
      with open('error_records.bad','a') as f: 
       f.write(str(id)+'\n') 
      continue # 
f.close()

来源

2013-02-12 07:33:51

有一个叫周围的工作efetch。你可以将你的列表分成200个批次（直觉，这是一个好的批量大小），并使用efetch一次发送所有这些ID。

首先，这比发送200个单独的查询要快得多。其次，它也有效地符合“每秒3查询”规则，因为每个查询的处理时间长于0.33秒但不会太长。

但是，您确实需要一种机制来抓住“坏苹果”。即使您的200个ID中有一个不好，NCBI也会返回0结果。换句话说，当且仅当所有200个ID都有效时，NCBI才会返回结果。

在苹果不好的情况下，我一个接一个地遍历200个ID，忽略坏苹果。这个“如果坏苹果”的情况也会告诉你不要让批量太大，以防苹果不好。如果它很大，首先，有一个坏苹果的机会更大，也就是说，你经常需要迭代整个事情。其次，批次越大，您必须迭代的个别项目越多。

我用下面的代码来下载CAZY蛋白质和它工作得很好：

import urllib2 


prefix = "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&id=" 
id_per_request = 200 


def getSeq (id_list): 
    url = prefix + id_list[:len(id_list)-1] 

    temp_content = "" 
    try: 
     temp_content += urllib2.urlopen(url).read() 

### if there is a bad apple, try one by one 
    except: 
     for id in id_list[:len(id_list)-1].split(","): 
      url = prefix + id 
    #print url 
      try: 
       temp_content += urllib2.urlopen(url).read() 
      except: 
      #print id 
       pass 
    return temp_content 


content = "" 
counter = 0 
id_list = "" 

#define your accession numbers first, here it is just an example!! 

accs = ["ADL19140.1","ABW01768.1","CCQ33656.1"] 
for acc in accs: 

    id_list += acc + "," 
    counter += 1 

    if counter == id_per_request: 
     counter = 0 
     content += getSeq(id_list) 
     id_list = "" 

if id_list != "": 
    content += getSeq(id_list) 
    id_list = "" 


print content

来源

2014-06-20 21:49:50 dgg32

urllib2.HTTPError Python

回答

相关问题