2017-04-04 193 views
2

对于下列二进制文件(可从以下地址下载,here):的Python - 格式化输出

*NEWRECORD 
RECTYPE = D 
MH = Calcimycin 
AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT EC HI IM IP ME PD PK PO RE SD ST TO TU UR 
ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef 
ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef 
ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef 
ENTRY = A 23187 
ENTRY = A23187, Antibiotic 
MN = D03.633.100.221.173 
PA = Anti-Bacterial Agents 
PA = Calcium Ionophores 
MH_TH = FDA SRS (2014) 
MH_TH = NLM (1975) 
ST = T109 
ST = T195 
N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))- 
RN = 37H9VM9WZL 
RR = 52665-69-7 (Calcimycin) 
PI = Antibiotics (1973-1974) 
PI = Carboxylic Acids (1973-1974) 
MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems. 
OL = use CALCIMYCIN to search A 23187 1975-90 
PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83) 
HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83) 
MR = 20160527 
DA = 19741119 
DC = 1 
DX = 19840101 
UI = D000001 

*NEWRECORD 
RECTYPE = D 
MH = Temefos 
AQ = AA AD AE AG AI AN BL CF CH CL CS CT EC HI IM IP ME PD PK RE SD ST TO TU UR 
ENTRY = Abate|T109|T131|TRD|NRW|NLM (1996)|941114|abbcdef 
ENTRY = Difos|T109|T131|TRD|NRW|UNK (19XX)|861007|abbcdef 
ENTRY = Temephos|T109|T131|TRD|EQV|NLM (1996)|941201|abbcdef 
MN = D02.705.400.625.800 
MN = D02.705.539.345.800 
MN = D02.886.300.692.800 
PA = Insecticides 
MH_TH = FDA SRS (2014) 
MH_TH = INN (19XX) 
MH_TH = USAN (1974) 
ST = T109 
ST = T131 
N1 = Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester 
RN = ONP3ME32DL 
RR = 3383-96-8 (Temefos) 
AN = for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropriate qualifiers may be used when other aspects of the insecticide are discussed such as the effect on a physiologic process or behavioral aspect of the insect; for poisoning, coordinate with ORGANOPHOSPHATE POISONING 
PI = Insecticides (1966-1971) 
MS = An organothiophosphate insecticide. 
PM = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90) 
HN = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90) 
MR = 20130708 
DA = 19990101 
DC = 1 
DX = 19910101 
UI = D000002 

我有以下Python代码:

import re 

terms = {} 
numbers = {} 

meshFile = 'd2017.bin' 
with open(meshFile, mode='rb') as file: 
    mesh = file.readlines() 

outputFile = open('mesh.txt', 'w') 

for line in mesh: 
    meshTerm = re.search(b'MH = (.+)$', line) 
    if meshTerm: 
     term = meshTerm.group(1) 
    meshNumber = re.search(b'MN = (.+)$', line) 
    if meshNumber: 
     number = meshNumber.group(1) 
     numbers[str(number)] = term 
     if term in terms: 
      terms[term] = terms[term] + ' ' + str(number) 
     else: 
      terms[term] = str(number) 

cumlist = [] 
keylist = terms.keys() 
for key in keylist: 
    #print('THE ORIGIN FOR ', key, file=outputFile) 

    item_list = terms[key].split(" ") 
    for phrase in item_list: 
     cumlist.append(phrase) 

print(cumlist) 

for item in cumlist: 
    print(numbers[str(item)], '\n', item, file=outputFile) 

的输出如下:

b'Calcimycin\r' 
b'D03.633.100.221.173\r' 
b'Temefos\r' 
b'D02.705.400.625.800\r' 
b'Temefos\r' 
b'D02.705.539.345.800\r' 
b'Temefos\r' 
b'D02.886.300.692.800\r' 

如何重新格式化输出,如下所示:

Calcimycin 
D03.633.100.221.173 
Temefos 
D02.705.400.625.800 
D02.705.539.345.800 
D02.886.300.692.800 

谢谢。

+0

您是否有使用二进制字符串的原因? – TidB

+0

str.decode('utf-8')。strip() – RaminNietzsche

+0

@TidB如果您在这里指的是正则表达式,并使用“b”而不是“r”,这是因为我正在读取一个二进制文件,是一个MeSH文件。当我使用“r”时,正则表达式不起作用。我有回答你的问题吗? – Simplicity

回答

0
UPDATE: I simplified the source a bit 

你可以试试这个正则表达式:

MH\s*=\s*(\w+)\s*|MN\s*= \s*([^\s]*) 

Demo

示例代码:(Run it here

import re 

regex = r"MH\s*=\s*(\w+)\s*|MN\s*= \s*([^\s]*)" 

test_str = ("*NEWRECORD\n" 
    "RECTYPE = D\n" 
    "MH = Calcimycin\n" 
    "AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT EC HI IM IP ME PD PK PO RE SD ST TO TU UR\n" 
    "ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef\n" 
    "ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef\n" 
    "ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef\n" 
    "ENTRY = A 23187\n" 
    "ENTRY = A23187, Antibiotic\n" 
    "MN = D03.633.100.221.173\n" 
    "PA = Anti-Bacterial Agents\n" 
    "PA = Calcium Ionophores\n" 
    "MH_TH = FDA SRS (2014)\n" 
    "MH_TH = NLM (1975)\n" 
    "ST = T109\n" 
    "ST = T195\n" 
    "N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11-trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7-dioxaspiro(5.5)undec-2-yl)methyl)-, (6S-(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))-\n" 
    "RN = 37H9VM9WZL\n" 
    "RR = 52665-69-7 (Calcimycin)\n" 
    "PI = Antibiotics (1973-1974)\n" 
    "PI = Carboxylic Acids (1973-1974)\n" 
    "MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems.\n" 
    "OL = use CALCIMYCIN to search A 23187 1975-90\n" 
    "PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n" 
    "HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83)\n" 
    "MR = 20160527\n" 
    "DA = 19741119\n" 
    "DC = 1\n" 
    "DX = 19840101\n" 
    "UI = D000001\n\n" 
    "*NEWRECORD\n" 
    "RECTYPE = D\n" 
    "MH = Temefos\n" 
    "AQ = AA AD AE AG AI AN BL CF CH CL CS CT EC HI IM IP ME PD PK RE SD ST TO TU UR\n" 
    "ENTRY = Abate|T109|T131|TRD|NRW|NLM (1996)|941114|abbcdef\n" 
    "ENTRY = Difos|T109|T131|TRD|NRW|UNK (19XX)|861007|abbcdef\n" 
    "ENTRY = Temephos|T109|T131|TRD|EQV|NLM (1996)|941201|abbcdef\n" 
    "MN = D02.705.400.625.800\n" 
    "MN = D02.705.539.345.800\n" 
    "MN = D02.886.300.692.800\n" 
    "PA = Insecticides\n" 
    "MH_TH = FDA SRS (2014)\n" 
    "MH_TH = INN (19XX)\n" 
    "MH_TH = USAN (1974)\n" 
    "ST = T109\n" 
    "ST = T131\n" 
    "N1 = Phosphorothioic acid, O,O'-(thiodi-4,1-phenylene) O,O,O',O'-tetramethyl ester\n" 
    "RN = ONP3ME32DL\n" 
    "RR = 3383-96-8 (Temefos)\n" 
    "AN = for use to kill or control insects, use no qualifiers on the insecticide or the insect; appropriate qualifiers may be used when other aspects of the insecticide are discussed such as the effect on a physiologic process or behavioral aspect of the insect; for poisoning, coordinate with ORGANOPHOSPHATE POISONING\n" 
    "PI = Insecticides (1966-1971)\n" 
    "MS = An organothiophosphate insecticide.\n" 
    "PM = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n" 
    "HN = 96; was ABATE 1972-95 (see under INSECTICIDES, ORGANOTHIOPHOSPHATE 1972-90)\n" 
    "MR = 20130708\n" 
    "DA = 19990101\n" 
    "DC = 1\n" 
    "DX = 19910101\n" 
    "UI = D000002\n\n\n\n\n\n\n" 
    "Calcimycin \n" 
    "D03.633.100.221.173\n" 
    "Temefos \n" 
    "D02.705.400.625.800\n" 
    "D02.705.539.345.800\n" 
    "D02.886.300.692.800") 

matches = re.finditer(regex, test_str, re.MULTILINE | re.IGNORECASE) 

for matchNum, match in enumerate(matches): 
    matchNum = matchNum + 1 
    for groupNum in range(0, len(match.groups())): 
     groupNum = groupNum + 1 
     if(match.group(groupNum) is not None): 
      print(match.group(groupNum)) 

样本输出:

Calcimycin 
D03.633.100.221.173 
Temefos 
D02.705.400.625.800 
D02.705.539.345.800 
D02.886.300.692.800 
+0

如何将它用作Python代码? – Simplicity

+0

@Simplicity上面的代码给你所有你想要的只有一个正则表达式...你可以从输出中决定你想如何处理它们..更新了一下..你现在不能测试,它更加合成 –