2017-02-24 56 views
6

我正在编写一个类RecurringInterval,它基于dateutil.rrule对象 - 表示一个重复的时间间隔。我已经为它定义了一个自定义的,可读的__str__方法,并且还想定义一个parse方法(类似于rrulestr()函数)将字符串解析回对象。在Python中,如何解析表示一组关键字参数的字符串,以使命令无关紧要

这里是parse方法和一些测试用例去用它:

import re 
from dateutil.rrule import FREQNAMES 
import pytest 

class RecurringInterval(object): 
    freq_fmt = "{freq}" 
    start_fmt = "from {start}" 
    end_fmt = "till {end}" 
    byweekday_fmt = "by weekday {byweekday}" 
    bymonth_fmt = "by month {bymonth}" 

    @classmethod 
    def match_pattern(cls, string): 
     SPACES = r'\s*' 

     freq_names = [freq.lower() for freq in FREQNAMES] + [freq.title() for freq in FREQNAMES]  # The frequencies may be either lowercase or start with a capital letter 
     FREQ_PATTERN = '(?P<freq>{})?'.format("|".join(freq_names)) 

     # Start and end are required (their regular expressions match 1 repetition) 
     START_PATTERN = cls.start_fmt.format(start=SPACES + r'(?P<start>.+?)') 
     END_PATTERN = cls.end_fmt.format(end=SPACES + r'(?P<end>.+?)') 

     # The remaining tokens are optional (their regular expressions match 0 or 1 repetitions) 
     BYWEEKDAY_PATTERN = cls.optional(cls.byweekday_fmt.format(byweekday=SPACES + r'(?P<byweekday>.+?)')) 
     BYMONTH_PATTERN = cls.optional(cls.bymonth_fmt.format(bymonth=SPACES + r'(?P<bymonth>.+?)')) 

     PATTERN = SPACES + FREQ_PATTERN \ 
       + SPACES + START_PATTERN \ 
       + SPACES + END_PATTERN \ 
       + SPACES + BYWEEKDAY_PATTERN \ 
       + SPACES + BYMONTH_PATTERN \ 
       + SPACES + "$"     # The character '$' is needed to make the non-greedy regular expressions parse till the end of the string 

     return re.match(PATTERN, string).groupdict() 

    @staticmethod 
    def optional(pattern): 
     '''Encloses the given regular expression in an optional group (i.e., one that matches 0 or 1 repetitions of the original regular expression).''' 
     return '({})?'.format(pattern) 


'''Tests''' 
def test_match_pattern_with_byweekday_and_bymonth(): 
    string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by weekday Monday, Tuesday by month January, February" 

    groups = RecurringInterval.match_pattern(string) 
    assert groups['freq'] == "Weekly" 
    assert groups['start'].strip() == "2017-11-03 15:00:00" 
    assert groups['end'].strip() == "2017-11-03 16:00:00" 
    assert groups['byweekday'].strip() == "Monday, Tuesday" 
    assert groups['bymonth'].strip() == "January, February" 

def test_match_pattern_with_bymonth_and_byweekday(): 
    string = "Weekly from 2017-11-03 15:00:00 till 2017-11-03 16:00:00 by month January, February by weekday Monday, Tuesday " 

    groups = RecurringInterval.match_pattern(string) 
    assert groups['freq'] == "Weekly" 
    assert groups['start'].strip() == "2017-11-03 15:00:00" 
    assert groups['end'].strip() == "2017-11-03 16:00:00" 
    assert groups['byweekday'].strip() == "Monday, Tuesday" 
    assert groups['bymonth'].strip() == "January, February" 


if __name__ == "__main__": 
    # pytest.main([__file__]) 
    pytest.main([__file__+"::test_match_pattern_with_byweekday_and_bymonth"])  # This passes 
    # pytest.main([__file__+"::test_match_pattern_with_bymonth_and_byweekday"])  # This fails 

但如果你在“正确”的顺序指定参数解析器的作品,它是“死板”,因为它没有按不允许以任意顺序给出可选参数。这就是为什么第二次测试失败。

什么是使解析器以任何顺序解析“可选”字段的方法,这样两个测试都能通过? (我正在考虑使用正则表达式的所有排列组合来创建一个迭代器,并在每个排列上尝试re.match,但这看起来不是一个优雅的解决方案)。

+2

你可以减少你的代码吗?看起来像一个有趣的问题,但目前对我而言,这只是“代码墙”。 [mcve]会很好。 –

+0

当然,原始代码片段几乎包含了所有的[dateutil.rrule](http://dateutil.readthedocs.io/en/stable/rrule.html)参数,但是我删除了那些未在测试中使用的参数,以减少行数。 –

+0

我没有时间或见解来回答,但有一个upvote。 –

回答

3

在这一点上,您的语言已经变得越来越复杂了,现在是时候消除正则表达式并学习如何使用正确的解析库。我使用pyparsing将它们放在一起,我已经对它进行了大量注释,试图解释发生了什么,但如果有什么不清楚的地方,我会试着解释。

from pyparsing import Regex, oneOf, OneOrMore 

# Boring old constants, I'm sure you know how to fill these out... 
months  = ['January', 'February'] 
weekdays = ['Monday', 'Tuesday'] 
frequencies = ['Daily', 'Weekly'] 

# A datetime expression is anything matching this regex. We could split it down 
# even further to get day, month, year attributes in our results object if we felt 
# like it 
datetime_expr = Regex(r'(\d{4})-(\d\d?)-(\d\d?) (\d{2}):(\d{2}):(\d{2})') 

# A from or till expression is the word "from" or "till" followed by any valid datetime 
from_expr = 'from' + datetime_expr.setResultsName('from_') 
till_expr = 'till' + datetime_expr.setResultsName('till') 

# A range expression is a from expression followed by a till expression 
range_expr = from_expr + till_expr 

# A weekday is any old weekday 
weekday_expr = oneOf(weekdays) 
month_expr = oneOf(months) 
frequency_expr = oneOf(frequencies) 

# A by weekday expression is the words "by weekday" followed by one or more weekdays 
by_weekday_expr = 'by weekday' + OneOrMore(weekday_expr).setResultsName('weekdays') 
by_month_expr = 'by month' + OneOrMore(month_expr).setResultsName('months') 

# A recurring interval, then, is a frequency, followed by a range, followed by 
# a weekday and a month, in any order 
recurring_interval = frequency_expr + range_expr + (by_weekday_expr & by_month_expr) 

# Let's parse! 
if __name__ == '__main__': 
    res = recurring_interval.parseString('Daily from 1111-11-11 11:00:00 till 1111-11-11 12:00:00 by weekday Monday by month January February') 

    # Note that setResultsName causes everything to get packed neatly into 
    # attributes for us, so we can pluck all the bits and pieces out with no 
    # difficulty at all 
    print res 
    print res.from_ 
    print res.till 
    print res.weekdays 
    print res.months 
+1

我同意。正则表达式是一个很好的工具,对于简单的解析非常有用。但是,当它变得太复杂时,正则表达式只会增加复杂度。 *我们有一个解析问题。只需使用正则表达式。现在有两个问题* –

+0

从术语'((by_weekday_expr + by_month_expr)|(by_month_expr + by_weekday_expr)')看来,这个方法仍然需要你列出可选参数的可能排列。在这个简单的例子中,只有两个,但实际应用中更多的是10,这将导致10! = 3,628,800个排列。我想这些可以生成和“加入”,但它仍然不是很优雅,可能会使代码变慢,不是吗? –

+1

@KurtPeek事实上,我想的很多,然后阅读文档有点困难。看看编辑 - '&'是我们需要的东西。 – ymbirtt

1

这里有很多选择,每个选项都有不同的缺点。

一种方法是使用重复交替,就像(by weekday|by month)*

(?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?)(?:\s+by weekday (?P<byweekday>.+?)|\s+by month (?P<bymonth>.+?))*$ 

这将匹配形式week monthmonth week的字符串,但也week weekmonth week month

另一种选择是请使用lookahead,如(?=.*by weekday)?(?=.*by month)?

(?P<freq>Weekly)?\s+from (?P<start>.+?)\s+till (?P<end>.+?(?=$| by))(?=.*\s+by weekday (?P<byweekday>.+?(?=$| by))|)(?=.*\s+by month (?P<month>.+?(?=$| by))|) 

然而,这需要一个已知的分隔符(我使用“by”)来了解匹配程度。此外,它会默默地忽略任何额外的字符(这意味着它将匹配形式为by weekday [some gargabe] by month的字符串)。