2017-06-19 51 views
0

我与Python和正则表达式的工作括号(还有其他括号)任何东西,我想变换像下面的字符串:的Python:正则表达式匹配里面

(1694439,805577453641105408,'\"@Bessemerband not reverse gear simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :(\"',2911510,NULL,NULL,NULL), 

到像下面的列表:

[ 
    [1694439, 805577453641105408, '\"@Bessemerband not reverse gear simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"', 2887640, NULL, NULL, NULL], 
    [1649240, 805577446758158336, '\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :(\"', 2911510, NULL, NULL, NULL] 
] 

这里的主要问题在于,正如您所看到的,文本内部还有一些圆括号,我不想分割。 我已经尝试过\([^)]+\)之类的东西,但很明显,它会在第一个)发现它。

任何线索如何解决这个问题?

+3

这不是正则表达式的设计目的。虽然有扩展名允许平衡括号,但没有这些扩展名,*抽象引理*指定了一个正则表达式不能这样做。 –

+2

正则表达式无法计数,因此它们无法使用匹配的引号和括号提取文本。你需要一个解析器。请参阅PLY,PyParsing,Lark等。 – phd

+0

尽管正如@WillemVanOnsem所述,正则表达式并不是为此设计的,但如果您知道第一列始终是一堆数字,您可以将其用作定位点。看看[正则表达式Lookbehind](http://www.rexegg.com/regex-lookarounds.html) – EndermanAPM

回答

0

这是您要查找的输出吗?

big = """(1694439,805577453641105408,'\"@Bessemerband not reverse gear simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :(\"',2911510,NULL,NULL,NULL),""" 
small = big.split('),') 
print(small) 

我在做什么是),分裂然后,只需通过循环和分裂像正常的逗号。我将表明当然可以优化的基本方法:

new_list = [] 

for x in small: 
    new_list.append(x.split(',')) 
print(new_list) 

现在这样做的缺点是,有一个空的列表,但你以后可以将其删除。

+0

您的解决方案唯一的问题是,可能会出现一种情况,其中括号内的字符串有“),”..无论如何,我找到了一个网站,正是我所需要的http://www.csvjson.com/sql2json – ParKein

0

这里是一个简单的正则表达式的解决方案,在不同的组捕捉每个逗号分隔值:

\(([^,]*),([^,]*),'((?:\\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*) 

用法:

input_string = r"""(1694439,805577453641105408,'\"@Bessemerband not reverse gear simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :(\"',2911510,NULL,NULL,NULL),""" 

import re 
result = re.findall(r"\(([^,]*),([^,]*),'((?:\\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)", input_string) 
0

嵌套括号在这里是没有问题的,因为它们是引号引起来。所有你需要做的是分别匹配报价部分:

import re 

pat = re.compile(r"[^()',]+|'[^'\\]*(?:\\.[^'\\]*)*'|(\()|(\))", re.DOTALL) 

s = r'''(1694439,805577453641105408,'\"@Bessemerband not reverse gear simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :(\"',2911510,NULL,NULL,NULL),''' 

result = [] 

for m in pat.finditer(s): 
    if m.group(1): 
     tmplst = [] 
    elif m.group(2): 
     result.append(tmplst)   
    else: 
     tmplst.append(m.group(0)) 

print(result) 

如果你的字符串也可以包含括号不括引号之间,你可以使用与regex module一个递归模式解决问题(用它和CSV模块是个好主意)或建立一个状态机。