匹配相同数量的字符重复次数作为捕获组的重复次数

我想清除一些使用python和regex从我的键盘记录的输入。特别是当退格被用来修复一个错误。匹配相同数量的字符重复次数作为捕获组的重复次数

例1：

[in]: 'Helloo<BckSp> world' 
[out]: 'Hello world'

这可以用

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

例2进行：
然而，当我有几个退格，我不知道如何删除一模一样的号码之前的字符：

[in]: 'Helllo<BckSp><BckSp>o world' 
[out]: 'Hello world'

（这里I w蚂蚁在两个退格前删除'l'和'o'）。

我可以简单地使用re.sub(r'[^>]<BckSp>', '', line)几次，直到没有<BckSp>左侧，但我想找到一个更优雅/更快的解决方案。

有谁知道如何做到这一点？

来源

2016-12-27 Louis M

我认为你不能用正则表达式计算，并通过你们正则表达式循环的建议是 – Fallenhero

是使用正则表达式的要求（即你正在学习正则表达式），或者只是你提出的解决方案的最佳方式是什么？ –

是的我尝试使用正则表达式学习，因为我还不熟悉它。 –

由于是递归/子程序调用，没有原子团/占有欲量词在Python re不支持，你可以删除这些字符，随后在循环退格键：

import re 
s = "Helllo\b\bo world" 
r = re.compile("^\b+|[^\b]\b") 
while r.search(s): 
    s = r.sub("", s) 
print(s)

见Python demo

的"^\b+|[^\b]\b"图案会发现在该字符串开始（与^\b+）1+退格字符和[^\b]\b会发现其他任何焦炭比退格遵循的退格的所有非重叠发生。万一

同样的办法退格表示像字面<BckSp>一些enitity /标签：

import re 
s = "Helllo<BckSp><BckSp>o world" 
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S) 
while r.search(s): 
    s = r.sub("", s) 
print(s)

见another Python demo

来源

2016-12-27 10:39:54

OP已经考虑了一个循环，正在寻找更好的解决方案。 –

它看起来像Python不支持递归正则表达式。如果你可以使用另一种语言，你可以试试这个：

.(?R)?<BckSp>

参见：https://regex101.com/r/OirPNn/1

来源

2016-12-27 10:41:08 Fallenhero

那么，我们可以安装PyPi正则表达式模块，并在Python中使用这种方法。 –

这是不是很有效，但你可以做到这一点与re模块：

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

这种方式你不必数数，模式只使用重复。

(?: 
    [^<] # a character to remove 
    (?= # lookahead to reach the corresponding <BckSp> 
     [^<]* # skip characters until the first <BckSp> 
     ( # capture group 1: contains the <BckSp>s 
      (?=(\1?))\2 # emulate an atomic group in place of \1?+ 
         # The idea is to add the <BcKSp>s already matched in the 
         # previous repetitions if any to be sure that the following 
         # <BckSp> isn't already associated with a character 
      <BckSp> # corresponding <BckSp> 
     ) 
    ) 
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp> 

\1 # matches all the consecutive <BckSp> and ensures that there's no more character 
    # between the last character to remove and the first <BckSp>

你可以做同样的正则表达式的模块，但这个时候你并不需要效仿占有欲量词：

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

但与正则表达式模块，还可以使用递归（如@Fallenhero注意到了这一点）：

[^<](?R)?<BckSp>

demo

来源

2016-12-27 10:44:22

如果没有演示以外的任何解释，不能为此投票。 –

在情况下，标记是单个字符你可以只利用堆这将使您的结果在单次通过：

s = "Helllo\b\bo world" 
res = [] 

for c in s: 
    if c == '\b': 
     if res: 
      del res[-1] 
    else: 
     res.append(c) 

print(''.join(res)) # Hello world

在情况下，标记是字面上'<BckSp>'或一些其它字符串长度大于1的您可以使用replace将其替换为'\b'并使用上述解决方案。如果您知道输入中未出现'\b'，则此功能才有效。如果你不能指定替换字符，你可以使用split和处理结果：

s = 'Helllo<BckSp><BckSp>o world' 
res = [] 

for part in s.split('<BckSp>'): 
    if res: 
     del res[-1] 
    res.extend(part) 

print(''.join(res)) # Hello world

来源

2016-12-27 10:55:25 niemmi

简单而有效，但不是OP想学习正则表达式。 –

不错的做法。如果标记是''，你会有一个解决方法吗？也许用'\ b'取代它会是最简单的... –

@LouisM如果你知道输入中不存在的字符，替换将是最简单的选择。我已经添加了替代解决方案，您不能指定任何单个字符作为替代品。 – niemmi

稍微冗长，但你可以使用这个lambda function来算<BckSp>发生＃和使用子例程得到您的最终输出。

>>> bk = '<BckSp>' 

>>> s = 'Helllo<BckSp><BckSp>o world' 
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) 
Hello world 

>>> s = 'Helloo<BckSp> world' 
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) 
Hello world 

>>> s = 'Helloo<BckSp> worl<BckSp>d' 
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) 
Hello word 

>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k' 
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s) 
Hello work

来源

2016-12-27 11:01:21 anubhava

匹配相同数量的字符重复次数作为捕获组的重复次数

回答

相关问题