2016-12-14 76 views
1

我解析出的过程中/proc/PID/stat。该文件具有的输入:Python的正则表达式捕获多组N次

25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0 

我想出了:

import re 

def get_stats(pid): 
    with open('/proc/{}/stat'.format(pid)) as fh: 
     stats_raw = fh.read() 
    stat_pattern = '(\d+\s)(\(.+\)\s)(\w+\s)(-?\d+\s?)' 
    return re.findall(stat_pattern, stats_raw) 

这将匹配前三组,但只返回一个字段中最后一组的(-?\d+\s?)

[('25473 ', '(firefox) ', 'S ', '25468 ')] 

我一直在寻找一种方式来进行小组赛最后一场只设置数量:

'(\d+\s)(\(.+\)\s)(\w+\s)(-?\d+\s?){49}' 
+0

你可以使用正则表达式的PyPI模块?然后你可以使用你的方法。否则,你需要两步。 –

+0

@WiktorStribiżew好知道该模块但这是另一个模块的一部分,它不会是理想的添加其他的依赖。尽管如果有人遇到这种情况,在回答中显示差异并不是一个坏主意。 – tijko

+1

好吧,然后用''(\ d + \ s)(\(。+ \)\ s)(\ w + \ s)((?: - ?\ d + \ s?){49})''比赛,第四组用空格分开。 –

回答

1

你不能用正则表达式re访问每个重复采集。您可以捕捉字符串的所有的剩余分成4组,然后用空格分开:

import re 
s = r'25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0' 
stat_pattern = r'(\d+)\s+(\([^)]+\))\s+(\w+)\s*(.*)' 
res = [] 
for m in re.finditer(stat_pattern, s): 
    res.append(m.group(1)) 
    res.append(m.group(2)) 
    res.append(m.group(3)) 
    res.extend(m.group(4).split()) 
print(res) 

输出:

['25473', '(firefox)', 'S', '25468', '25465', '25465', '0', '-1', '4194304', '149151169', '108282', '32', '15', '2791321', '436115', '846', '86', '20', '0', '84', '0', '9648305', '2937786368', '209665', '18446744073709551615', '93875088982016', '93875089099888', '140722931705632', '140722931699424', '140660842079373', '0', '0', '4102', '33572009', '0', '0', '0', '17', '1', '0', '0', '175', '0', '0', '93875089107104', '93875089109128', '93875116752896', '140722931707410', '140722931707418', '140722931707418', '140722931707879', '0'] 

如果你从字面上只需要得到49号到4组,使用

r'(\d+)\s+(\([^)]+\))\s+(\w+)\s*((?:-?\d+\s?){49})' 
           ^^^^^^^^^^^^^^^^^^ 

随着PyPi regex module,你可以使用r'(?P<o>\d+)\s+(?P<o>\([^)]+\))\s+(?P<o>\w+)\s+(?P<o>-?\d+\s?){49}'和运行regex.search(pattern, s)访问.captures("o")栈与您需要的值之后。

>>> import regex 
>>> s = '25473 (firefox) S 25468 25465 25465 0 -1 4194304 149151169 108282 32 15 2791321 436115 846 86 20 0 84 0 9648305 2937786368 209665 18446744073709551615 93875088982016 93875089099888 140722931705632 140722931699424 140660842079373 0 0 4102 33572009 0 0 0 17 1 0 0 175 0 0 93875089107104 93875089109128 93875116752896 140722931707410 140722931707418 140722931707418 140722931707879 0' 
>>> stat_pattern = r'(?P<o>\d+)\s+(?P<o>\([^)]+\))\s+(?P<o>\w+)\s+(?P<o>-?\d+\s?){49}' 
>>> m = regex.search(stat_pattern, s) 
>>> if m: 
    print(m.captures("o")) 

输出:

['25473', '(firefox)', 'S', '25468 ', '25465 ', '25465 ', '0 ', '-1 ', '4194304 ', '149151169 ', '108282 ', '32 ', '15 ', '2791321 ', '436115 ', '846 ', '86 ', '20 ', '0 ', '84 ', '0 ', '9648305 ', '2937786368 ', '209665 ', '18446744073709551615 ', '93875088982016 ', '93875089099888 ', '140722931705632 ', '140722931699424 ', '140660842079373 ', '0 ', '0 ', '4102 ', '33572009 ', '0 ', '0 ', '0 ', '17 ', '1 ', '0 ', '0 ', '175 ', '0 ', '0 ', '93875089107104 ', '93875089109128 ', '93875116752896 ', '140722931707410 ', '140722931707418 ', '140722931707418 ', '140722931707879 ', '0']