使用BeautifulSoup4和Python从不一致的HTML页面提取数据

我试图从这个webpage中提取数据，并且由于页面HTML格式中的不一致，我遇到了一些麻烦。我有一个OGAP ID列表，我想为每个OGAP ID提取基因名称和任何文献信息（PMID＃）。感谢这里的其他问题以及BeautifulSoup文档，我一直能够一致地获得每个ID的基因名称，但是我在文献部分遇到了麻烦。以下是一些突出显示不一致的搜索条件。使用BeautifulSoup4和Python从不一致的HTML页面提取数据

HTML样本的作品

搜索条件：OG00131

<tr> 
 
    <td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation: 
 
    <br>&nbsp;&nbsp;PMID: 
 
    <a href="http://www.ncbi.nlm.nih.gov/pubmed/20068230">20068230</a> 
 
    [CAD, ETD MS/MS]; <br> 
 
    <br> 
 
    </td> 
 
</tr>

HTML样品不工作

搜索条件：OG00020

<td align="top" bgcolor="#FBFFCC"> 
 
    <div class="STYLE28">Literature describing O-GlcNAcylation: </div> 
 
    <div class="STYLE28"> 
 
    <div class="STYLE28">PMID: 
 
     <a href="http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation">16408927</a> 
 
     [Azide-tag, nano-HPLC/tandem MS] 
 
    </div> 
 
    <br> 
 
    Site has not yet been determined. Use 
 
    <a href="parser2.cgi?ACLY_HUMAN" target="_blank">OGlcNAcScan</a> 
 
    to predict the O-GlcNAc site. </div> 
 
</td>

这里是我的代码至今

import urllib2 
from bs4 import BeautifulSoup 

#define list of genes 

#initialize variables 
gene_list = [] 
literature = [] 
# Test list 
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"] 


for i in range(len(gene_listID)): 
    print gene_listID[i] 
    # Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided 
    dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i] 
    # Opens the URL as a page 
    page = urllib2.urlopen(dbOGAP) 
    # Reads the page and parses it through "lxml" format 
    soup = BeautifulSoup(page, "lxml") 

    gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text 
    print gene_name[1:] 
    gene_list.append(gene_name[1:]) 

    # PubMed IDs are located near the <td> tag with the term "Data and Source" 
    pmid = soup.find("span", text="Data and Source") 

    # Based on inspection of the website, need to move up to the parent <td> tag 
    pmid_p = pmid.parent 

    # Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag) 
    pmid_s = pmid_p.next_sibling 
    #for child in pmid_s.descendants: 
    # print child 
    # Now we search down the tree to find the next table data (<td>) tag 
    pmid_c = pmid_s.find("td") 
    temp_lit = [] 
    # Next we print the text of the data 
    #print pmid_c.text 
    if "No literature is available" in pmid_c.text: 
     temp_lit.append("No literature is available") 
     print "Not available" 
    else: 
    # and then print out a list of urls for each pubmed ID we have 
     print "The following is available" 
     for link in pmid_c.find_all('a'): 
      # the <a> tag includes more than just the link address. 
      # for each <a> tag found, print the address (href attribute) and extra bits 
      # link.string provides the string that appears to be hyperlinked. 
      # In this case, it is the pubmedID 
      print link.string 
      temp_lit.append("PMID: " + link.string + " URL: " + link.get('href')) 
    literature.append(temp_lit) 
    print "\n"

如此看来元素是什么抛出的代码为一个循环。有没有办法搜索任何带有文本“PMID”的元素，并返回它后面的文本（如果有PMID号，则返回url）？如果没有，我是否想检查每个孩子，寻找我感兴趣的文字？

我使用Python 2.7.10

来源

2016-12-05 Peter M.

import requests 
from bs4 import BeautifulSoup 
import re 
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"] 
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID) 

for url in urls: 
    r = requests.get(url) 
    soup = BeautifulSoup(r.text, 'lxml') 
    regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+') 

    a_tag = soup.find('a', href=regex) 
    has_pmid = 'PMID' in a_tag.previous_element 

    if has_pmid : 
     print(a_tag.text, a_tag.next_sibling, a_tag.get("href")) 
    else: 
     print("Not available")

出来：

18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734 
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230 
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230 
Not available 
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927 
Not available 
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation

找到的第一个匹配的目标URL，它与数字结束，一个标签，不是检查是否 'PMID'在它之前的元素。这个网站如此不稳定，我多次尝试，希望这会有所帮助

来源

2016-12-06 01:26:45

嘿，感谢您的帮助。我应该能够玩弄这个，看看我能否使用这种方法得到所有的文献。 –

使用BeautifulSoup4和Python从不一致的HTML页面提取数据

回答

相关问题