Python的正则表达式 - 从org-模式文件

得到物品，我有以下org-模式语法：Python的正则表达式 - 从org-模式文件

** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2

，我想提取的物品，如：

getitems "Hardware"

我应该得到：

- [ ] adapt a programmable motor to a tripod to be used for panning

如果我要 “读 - 健康”，我应该得到：

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2

我现在用的是以下模式：

pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL)

询问时输出 “读 - 技术” 是：

- [X] Introduction to Networking - Charles Severance 
    - [ ] A Tour of C++ - Bjarne Stroustrup 
    - [ ] C++ How to Program - Paul Deitel 
    - [X] Computer Systems - Randal Bryant 
    - [ ] The C programming language - Brian Kernighan 
    - [ ] Beginning Linux Programming -Matthew and Stones 
    ** Reading - Health [3/4] 
    - [ ] Patrick McKeown - The Oxygen Advantage 
    - [X] Total Knee Health - Martin Koban 
    - [X] Supple Leopard - Kelly Starrett 
    - [X] Convict Conditioning 1 and 2

我也试过：

pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL)

这除了最后一个之外，最后一个工作正常。

输出要求时，“读 - 健康”：

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett

正如你所看到的，它不会在最后一行匹配。

我使用python 2.7，并findall。

来源

2017-03-01 daleonpz

'\ * \ *阅读 - 健康（*？）（？：\ * \ *。 | $）' – JazZ

您可以用实现它

import re 

string = """ 
** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 
""" 

def getitems(section): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string) 
     return items.group('block') 
    except: 
     return None 

items = getitems('Reading - Technology') 
print(items)

查看working on ideone.com。

代码的心脏是（浓缩）的表达：

^\*{2}.+[\n\r]  # match the beginning of the line, followed by two stars, anything else in between and a newline 
(?P<block>   # open group "block" 
    (?:    # non-capturing group 
     (?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line 
     [\s\S]  # any character... 
    )+    # ...at least once 
)     # close group "block"

您的搜索字符串**后的实际代码插入。请参阅Reading - Technology的演示regex101.com。

作为后续，你还可只返回 选择的值，就像这样：

def getitems(section, selected=None): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string).group('block') 
     if selected: 
      rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE) 
      try: 
       selected_items = rxi.findall(items) 
       return selected_items 
      except: 
       return None 
     return items 
    except: 
     return None 

items = getitems('Reading - Health', selected=True) 
print(items)

来源

2017-03-01 22:01:55 Jan

谢谢，它改进了整体代码..也regex101.com是一个很棒的网站 – daleonpz

@daleonpz：增加了一个版本，只返回选定的值。 – Jan

不确定你需要整个比赛的正则表达式。我只是使用正则表达式来匹配**行，然后返回行，直到看到下一个**行。

喜欢的东西

pattern = re.compile("\*\* "+ head) 

start = False 
output = [] 
for line in my_file: 
    if pattern.match(line): 
     start = True 
     continue 
    elif line.startswith("**"): # but doesn't match pattern 
     break 

    if start: 
     output.append(line) 

# now `output` should have the lines you want

来源

2017-03-01 21:15:46 turbulencetoo

正则表达式在匹配结构化数据方面很出色，就像一条总是有特定格式的行。当你必须在你关心的行之间匹配一堆随机文本时，使用它变得非常复杂，这就是为什么我通常会避免你想要做的方法。 – turbulencetoo

乍一看，在我的答案'pattern.match'也可能只是一个'line.startswith（“**”+头）' – turbulencetoo

如果您确定该字符*没有出现在你的项目，你可以使用：

re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?")

来源

2017-03-01 21:26:32

谢谢它的作品:) – daleonpz

Python的正则表达式 - 从org-模式文件

回答

相关问题