2017-03-01 37 views
2

得到物品,我有以下org-模式语法:Python的正则表达式 - 从org-模式文件

** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 

,我想提取的物品,如:

getitems "Hardware" 

我应该得到:

- [ ] adapt a programmable motor to a tripod to be used for panning 

如果我要 “读 - 健康”,我应该得到:

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 

我现在用的是以下模式:

pattern = re.compile("\*\* "+ head + " (.+?)\*?$", re.DOTALL) 

询问时输出 “读 - 技术” 是:

- [X] Introduction to Networking - Charles Severance 
    - [ ] A Tour of C++ - Bjarne Stroustrup 
    - [ ] C++ How to Program - Paul Deitel 
    - [X] Computer Systems - Randal Bryant 
    - [ ] The C programming language - Brian Kernighan 
    - [ ] Beginning Linux Programming -Matthew and Stones 
    ** Reading - Health [3/4] 
    - [ ] Patrick McKeown - The Oxygen Advantage 
    - [X] Total Knee Health - Martin Koban 
    - [X] Supple Leopard - Kelly Starrett 
    - [X] Convict Conditioning 1 and 2 

我也试过:

pattern = re.compile("\*\* "+ head + " (.+?)[\*|\z]", re.DOTALL) 

这除了最后一个之外,最后一个工作正常。

输出要求时,“读 - 健康”:

- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 

正如你所看到的,它不会在最后一行匹配。

我使用python 2.7,并findall。

+0

'\ * \ *阅读 - 健康(*?)(?:\ * \ *。 | $)' – JazZ

回答

1

您可以用实现它

import re 

string = """ 
** Hardware [0/1] 
- [ ] adapt a programmable motor to a tripod to be used for panning 
** Reading - Technology [1/6] 
- [X] Introduction to Networking - Charles Severance 
- [ ] A Tour of C++ - Bjarne Stroustrup 
- [ ] C++ How to Program - Paul Deitel 
- [X] Computer Systems - Randal Bryant 
- [ ] The C programming language - Brian Kernighan 
- [ ] Beginning Linux Programming -Matthew and Stones 
** Reading - Health [3/4] 
- [ ] Patrick McKeown - The Oxygen Advantage 
- [X] Total Knee Health - Martin Koban 
- [X] Supple Leopard - Kelly Starrett 
- [X] Convict Conditioning 1 and 2 
""" 

def getitems(section): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string) 
     return items.group('block') 
    except: 
     return None 

items = getitems('Reading - Technology') 
print(items) 

查看working on ideone.com


代码的心脏是(浓缩)的表达:

^\*{2}.+[\n\r]  # match the beginning of the line, followed by two stars, anything else in between and a newline 
(?P<block>   # open group "block" 
    (?:    # non-capturing group 
     (?!^\*{2}) # a neg. lookahead, making sure no ** follows at the beginning of a line 
     [\s\S]  # any character... 
    )+    # ...at least once 
)     # close group "block" 

您的搜索字符串**后的实际代码插入。请参阅Reading - Technology的演示regex101.com


作为后续,你还可只返回 选择的值,就像这样:

def getitems(section, selected=None): 
    rx = re.compile(r'^\*{2} ' + re.escape(section) + r'.+[\n\r](?P<block>(?:(?!^\*{2})[\s\S])+)', re.MULTILINE) 
    try: 
     items = rx.search(string).group('block') 
     if selected: 
      rxi = re.compile(r'^ - \[X\]\ (.+)', re.MULTILINE) 
      try: 
       selected_items = rxi.findall(items) 
       return selected_items 
      except: 
       return None 
     return items 
    except: 
     return None 

items = getitems('Reading - Health', selected=True) 
print(items) 
+0

谢谢,它改进了整体代码..也regex101.com是一个很棒的网站 – daleonpz

+0

@daleonpz:增加了一个版本,只返回选定的值。 – Jan

0

不确定你需要整个比赛的正则表达式。我只是使用正则表达式来匹配**行,然后返回行,直到看到下一个**行。

喜欢的东西

pattern = re.compile("\*\* "+ head) 

start = False 
output = [] 
for line in my_file: 
    if pattern.match(line): 
     start = True 
     continue 
    elif line.startswith("**"): # but doesn't match pattern 
     break 

    if start: 
     output.append(line) 

# now `output` should have the lines you want 
+0

正则表达式在匹配结构化数据方面很出色,就像一条总是有特定格式的行。当你必须在你关心的行之间匹配一堆随机文本时,使用它变得非常复杂,这就是为什么我通常会避免你想要做的方法。 – turbulencetoo

+0

乍一看,在我的答案'pattern.match'也可能只是一个'line.startswith(“**”+头)' – turbulencetoo

1

如果您确定该字符*没有出现在你的项目,你可以使用:

re.compile(r"\*\* "+head+r" \[\d+/\d+\]\n([^*]+)\*?") 
+0

谢谢它的作品:) – daleonpz