无法处理此正则表达式

我有以下的“greekSymbols.txt”无法处理此正则表达式

Α α alpha 
Β β beta 
Γ γ gamma 
Δ δ delta 
Ε ε epsilon 
Ζ ζ zeta 
Η η eta 
Θ θ theta 
Ι ι iota 
Κ κ kappa 
Λ λ lambda 
Μ μ mu 
Ν ν nu 
Ξ ξ xi 
Ο ο omicron 
Π π pi 
Ρ ρ rho 
Σ σ sigma 
Τ τ tau 
Υ υ upsilon 
Φ φ phi 
Χ χ chi 
Ψ ψ psi 
Ω ω omega

我试图将其转换成Anki的纯文本文件选项卡作为分隔符。我将每行转换为两张牌，其中前面是符号（大写或小写），后面是名字。我有以下几点。

#!/usr/local/bin/python 

import re 

pattern = re.compile(r"(.)\s+(.)\s+(.+)", re.UNICODE) 

input = open("./greekSymbols.txt", "r") 

output = open("./greekSymbolsFormated.txt", "w+") 

line = input.readline() 

while line: 

    string = line.rstrip() 

    m = pattern.match(string) 

    if m: 
     output.write(m.group(1) + "\t" + m.group(3) + "\n") 
     output.write(m.group(2) + "\t" + m.group(3) + "\n") 
    else: 
     print("I was unable to process line '" + string + "' [" + str(m) + "]") 

    line = input.readline() 

input.close(); 
output.close();

不幸的是，我目前得到“我无法处理......”消息的每一行，通过str（M）是无的价值。我究竟做错了什么？

> localhost:Anki stephen$ python ./convertGreekSymbols.py 
I was unable to process line 'Α α alpha' [None] 
I was unable to process line 'Β β beta' [None] 
...

来源

2013-04-09 Stephen Cagle

我更新了由答案建议的正则表达式更改，但我仍然没有找到匹配项。我也删除了换行符，以防万一导致某些事情发生。 – 2013-04-09 06:13:49

你知道文件的编码吗？ – 2013-04-09 06:22:56

你并不真的需要这样的正则表达式：

with (open("./greekSymbols.txt") as infile, 
     open("./greekSymbolsFormated.txt", "w+") as outfile): 
    for line in infile: 
     up, low, name = line.split() 
     outfile.write("{0}\t{1}".format(up,name)) 
     outfile.write("{0}\t{1}".format(low,name))

如果你想坚持正则表达式，请尝试以下的正则表达式的你，而不是（这应该IMO工作，但是这或许是不够明确）：

pattern = re.compile(r"(\S+)\s+(\S+)\s+(.+)", re.UNICODE)

来源

2013-04-09 05:43:41

谢谢，这里的整个过程的一部分是学习RE，尽管我很感谢帮助。 – 2013-04-09 06:13:00

@StephenCagle：我添加了一个正则表达式的建议。我希望你的正则表达式能够工作 - 可能这个问题与UTF-8中多字节序列表示的一些字符有关（我假设你正在使用它），并且这些字符没有被一个'。，尽管我本来预计它会在基于字符的级别上工作，而不是基于字节的。但由于我不在Unix环境中，因此我无法在此处进行测试。 – 2013-04-09 06:26:20

在我看来，这是空白解析是错误的。难道不是(.)\s(.)\s(.+)，而不是\t？您的输入中似乎没有选项卡。

来源

2013-04-09 05:42:14 Dolda2000

我相信我有标签，它似乎粘贴到HTML删除它们？ – 2013-04-09 06:08:33

你有\ t其中没有标签，应该是\ S：

>>> matcher = re.compile(r"(.)\s(.)\t(.+)", re.UNICODE) 
>>> phi = "Φ φ phi" 
>>> matcher.match(phi) 
>>> matcher = re.compile(r"(.)\s(.)\s+(.+)", re.UNICODE) 
>>> matcher.match(phi) 
<_sre.SRE_Match object at 0x1018d8290> 
>>>

来源

2013-04-09 05:43:26

不能与你的逻辑争论，但我仍然得到错误？ – 2013-04-09 06:12:28

你可以用\ s +（我在上面更新）尝试。可能是你的标签是多个空白字符。如果这不起作用，你能粘贴一行到pastebin或者其他什么东西吗？ – 2013-04-09 06:22:18

这是终于可以正常工作的代码。看来原来的文件我已经是utf-8了，这是造成问题的原因。这是工作解决方案，它允许我为Anki创建一个分离的导入文件。

#!/usr/local/bin/python 

import re 
import codecs 

pattern = re.compile(r"(\S+)\s+(\S+)\s+(.+)", re.UNICODE) 

input = codecs.open("./greekSymbols.txt", "r", encoding="utf-8") 

output = codecs.open("./greekSymbolsFormated.txt", "w+", encoding="utf-8") 

line = input.readline() 

while line: 

    string = line.rstrip() 

    m = pattern.match(string) 

    if m: 
     output.write(unicode(m.group(1) + "\t" + m.group(3) + "\n")) 
     output.write(unicode(m.group(2) + "\t" + m.group(3) + "\n")) 
    else: 
     print("I was unable to process line '" + string + "' [" + str(m) + "]") 

    line = input.readline() 

input.close(); 
output.close();

来源

2013-04-09 08:17:28

无法处理此正则表达式

回答

相关问题