2013-04-11 80 views
1

我有以下格式的数据文件:问题编写正则表达式

1 AA/BB     0C89JG 
    2 ABANO/ANA VICTORIA  F12LFJ 
    3 ABBOUDLASTNAME/ABBOUDF DWPTHC 
    4 ABDALLAH/SIJAM   H0ZDM9 
    5 ABDEL MESSIH/DINA  T0SF8N 
    6 ABHISHEK/PRAMANIK  7SLKXV 
    7 ABHYANKAR/DHANANJAY 7SM0BV 
    8 ABOUSALAMA/FEMKE  LTTRQC 
    9 ABRAMOVA/NATALIA  77LCPZ 
    10 ABRANTES/JOAO   KXZC7Q 
    11 ABRATH/LUC    D5J99J 
    12 ABREO/HECTOR   CXDH4G 
    13 ABREU/ANDREA   242GRC 
    14 ABREU/MARCELO   2436R7 
    15 ABREU/VANDA   3HDNQQ 
    16 ABTS/NATHALIE   DSK9TN 
    17 ABTS/NATHALIE   FZ0LN4 

而且我想最后6个字符例如提取从线17 正则表达式,我拿出FZ0LN4是:

([0-9]{1,5})([A-Z /]) ([0-9A-Z]{6}) 

但它不是做什么工作的。任何人都可以请指出什么是问题?

回答

2

有几个问题:

  • 你不匹配的一些空格的。
  • [A-Z /]缺少重复操作符。

我已经重写了正则表达式,像这样:

In [8]: re.match(r'\s*(\d+)\s*([A-Z /]+?)\s*(\w+)$', ' 15 ABREU/VANDA   3HDNQQ').groups() 
Out[8]: ('15', 'ABREU/VANDA', '3HDNQQ') 

如果你只需要在最后六个字符,那么就没有必要对一个正则表达式:

In [15]: s = ' 15 ABREU/VANDA   3HDNQQ' 

In [16]: s[-6:] 
Out[16]: '3HDNQQ' 
+1

比我好多了:) 但它会在第二纪录失败:( – RAB 2013-04-11 16:30:39

+0

@RaheelAliBaloch:好点,我忽略了空间。现在修复。 – NPE 2013-04-11 16:56:33

0

使用$字符对于非线性字符和\S

import re 
>>> s = s = ''' 1 AA/BB     0C89JG 
    2 ABANO/ANA VICTORIA  F12LFJ 
    3 ABBOUDLASTNAME/ABBOUDF DWPTHC 
    4 ABDALLAH/SIJAM   H0ZDM9 
    5 ABDEL MESSIH/DINA  T0SF8N 
    6 ABHISHEK/PRAMANIK  7SLKXV 
    7 ABHYANKAR/DHANANJAY 7SM0BV 
    8 ABOUSALAMA/FEMKE  LTTRQC 
    9 ABRAMOVA/NATALIA  77LCPZ 
    10 ABRANTES/JOAO   KXZC7Q 
    11 ABRATH/LUC    D5J99J 
    12 ABREO/HECTOR   CXDH4G 
    13 ABREU/ANDREA   242GRC 
    14 ABREU/MARCELO   2436R7 
    15 ABREU/VANDA   3HDNQQ 
    16 ABTS/NATHALIE   DSK9TN 
    17 ABTS/NATHALIE   FZ0LN4''' 

>>> re.findall('\\S{6}$', s, re.MULTILINE) 
['0C89JG', 'F12LFJ', 'DWPTHC', 'H0ZDM9', 'T0SF8N', '7SLKXV', '7SM0BV', 'LTTRQC', '77LCPZ', 'KXZC7Q', 'D5J99J', 'CXDH4G', '242GRC', '2436R7', '3HDNQQ', 'DSK9TN', 'FZ0LN4'] 
2

如果你只需要串在该行的末尾,你可以使用一个更简单的正则表达式,如:\b\w{6}\b$

1

你只是为了寻找最后一行(17)?如果是这样,re.search整个字符串:

import re 
myString=""" 
    1 AA/BB     0C89JG 
    2 ABANO/ANA VICTORIA  F12LFJ 
    3 ABBOUDLASTNAME/ABBOUDF DWPTHC 
    4 ABDALLAH/SIJAM   H0ZDM9 
    5 ABDEL MESSIH/DINA  T0SF8N 
    6 ABHISHEK/PRAMANIK  7SLKXV 
    7 ABHYANKAR/DHANANJAY 7SM0BV 
    8 ABOUSALAMA/FEMKE  LTTRQC 
    9 ABRAMOVA/NATALIA  77LCPZ 
    10 ABRANTES/JOAO   KXZC7Q 
    11 ABRATH/LUC    D5J99J 
    12 ABREO/HECTOR   CXDH4G 
    13 ABREU/ANDREA   242GRC 
    14 ABREU/MARCELO   2436R7 
    15 ABREU/VANDA   3HDNQQ 
    16 ABTS/NATHALIE   DSK9TN 
    17 ABTS/NATHALIE   FZ0LN4 
""" 

m = re.search("(\S{6})$", myString) 
if m: 
    print m.group(1) 

如果你需要找到特定行,你应该遍历单独的线:

for line in myString.split("\n"): 
    m = re.search("^\s*17\s*.*(\S{6})$", line) 
    if m: 
     print m.group(1) 
+0

+1与我的相同 – User 2013-04-11 16:34:51

1

这是很容易没有一个正则表达式来完成:

st='''\ 
    1 AA/BB     0C89JG 
    2 ABANO/ANA VICTORIA  F12LFJ 
    3 ABBOUDLASTNAME/ABBOUDF DWPTHC 
    4 ABDALLAH/SIJAM   H0ZDM9 
    5 ABDEL MESSIH/DINA  T0SF8N 
    6 ABHISHEK/PRAMANIK  7SLKXV 
    7 ABHYANKAR/DHANANJAY 7SM0BV 
    8 ABOUSALAMA/FEMKE  LTTRQC 
    9 ABRAMOVA/NATALIA  77LCPZ 
    10 ABRANTES/JOAO   KXZC7Q 
    11 ABRATH/LUC    D5J99J 
    12 ABREO/HECTOR   CXDH4G 
    13 ABREU/ANDREA   242GRC 
    14 ABREU/MARCELO   2436R7 
    15 ABREU/VANDA   3HDNQQ 
    16 ABTS/NATHALIE   DSK9TN 
    17 ABTS/NATHALIE   FZ0LN4''' 

for line in st.splitlines(): 
    print line.split()[-1] 

打印:

0C89JG 
F12LFJ 
DWPTHC 
H0ZDM9 
T0SF8N 
7SLKXV 
7SM0BV 
LTTRQC 
77LCPZ 
KXZC7Q 
D5J99J 
CXDH4G 
242GRC 
2436R7 
3HDNQQ 
DSK9TN 
FZ0LN4 

或者,如果你只是想 '第n个' 之一,是这样的:

>>> li=[line.split()[-1] for line in st.splitlines()] 
>>> li[-1] 
'FZ0LN4' 
>>> li[-2] 
'DSK9TN' # etc etc 

或者,如果你真的一个正则表达式:

>>> re.findall(r'\s(\S{6})$',st,re.MULTILINE) 
['0C89JG', 'F12LFJ', 'DWPTHC', 'H0ZDM9', 'T0SF8N', '7SLKXV', '7SM0BV', 'LTTRQC', '77LCPZ', 'KXZC7Q', 'D5J99J', 'CXDH4G', '242GRC', '2436R7', '3HDNQQ', 'DSK9TN', 'FZ0LN4'] 
>>> re.findall(r'\s(\S{6})$',st,re.MULTILINE)[-1] 
'FZ0LN4'