2016-12-03 74 views
0

我需要提取的日期,并从该字符串的位置。有没有更高效的方法?这也不太容易出错,例如在时间前面的话可能并不总是来自于。在Python中你如何从字符串中提取某些字符?

text = 'Join us for a guided tour of the Campus given by the 
Admissions staff. The tour will take place from 3:15-4:00 PM EST 
and leaves from the Admissions Office in x House. No registration required.' 

length = len(text) 

for x in range (length): 
    if text[x] == 'f' : 
     if text[x+1] == 'r' : 
      if text[x+2] == 'o': 
       if text[x+3] == 'm': 
        fprint(text[x:(x+17)]) 
        fbreak 

=从3:15-4:00 PM

+1

使用正则表达式 – xbonez

+0

哦,是的,只是比较与NNN行代码中的正则表达式的解决方案的简单建议在下面的答案之一。 –

回答

3

为了提取从时间范围的开始时间,使用该正则表达式:

(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b 

参见regex demo

详细

  • (?i) - 不区分大小写匹配ON
  • \b - 领先字边界
  • (\d{1,2}:\d{2}) - 第1组捕获1或2位,:和2位数字
  • (?:-\d{1,2}:\d{2})? - 匹配的1点或0的出现的一个可选的非捕获组:
    • - - 连字符
    • \d{1,2} - 1或2位数字
    • : - 冒号
    • \d{2} - 2位数字
  • (\s*[pa]m) - 组2捕获的序列:
    • \s* - 0+空格
    • [pa] - pa(或PA
    • m - mM
  • \b - 尾随字边界。

Python demo

import re 
rx = r"(?i)\b(\d{1,2}:\d{2})(?:-\d{1,2}:\d{2})?(\s*[pa]m)\b" 
s = "Join us for a guided tour of the Campus given by the \nAdmissions staff. The tour will take place from 3:15-4:00 PM EST or from 7:30 AM EST \nand leaves from the Admissions Office in x House. No registration required.' " 
matches = ["{}{}".format(x.group(1),x.group(2)) for x in re.finditer(rx, s)] 
print(matches) 

由于结果是2个独立的组,我们需要遍历所有的比赛和Concat的两个组值。

+0

谢谢!你会如何修改rx以包含上午7:30的时间,因为有时只有一个开始时间?另外你将如何提取事件的位置?例如'x house'是不可能的,因为一个位置可以变化多大? – Tank

+0

使用*可选非捕获组*,'(?:...)?'。我修改了表达式,添加了解释,并更新了答案中的演示链接和代码。 –

+0

再次感谢您。我只需要开始时间,以便7:30-8:30上午我可以创建一个只能给我7:30的rx吗?我似乎无法创造一个忘记8:30的节目。 – Tank

0

您可以使用正则表达式:

r"from [^A-Za-z]+"

文本这将检查以“从”开始的地方,有没有经过任何字母(除上午或下午)。在文本你只要返回

从3:15-4:00 PM

你可以使用它的方式如下:

import re print(re.search("from [^A-Za-z]+(?:AM|PM)", text))

0

您不限于仅使用正则表达式来解析字符串内容。

除了使用正则表达式,您可以使用下面描述的解析技术。这与在编译器中使用的技术相似。该技术

对于开始你可以看看这个例子的


简单的例子。它只会在文本中找到时间。

TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \ 
     'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \ 
     'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \ 
     'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \ 
     'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \ 
     'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \ 
     'No registration required. ' 

TIME_SEPARATORS = ':-' 

time_text_start = None 
time_text_end = None 
time_text = '' 

index = 0 
for char in TEXT: 
    if time_text_start is None: 
     if char.isdigit(): 
      time_text_start = index 
    if (time_text_start is not None) and (time_text_end is None): 
     if (not char.isdigit()) and (not char.isspace()) and (char not in TIME_SEPARATORS): 
      time_text_end = index 
      time_text = TEXT[time_text_start: time_text_end].strip() 

      print(time_text) 

      # Now we will clear our variables to be able to find next time_text data in the text 
      time_text_start = None 
      time_text_end = None 
      time_text = '' 
    index += 1 

此代码将下一个打印:

3:15-4:00 
7:30 
17:30 
9:30-11:00 
15:00-16:25 

真正的代码

现在你可以看一下真正的代码。它会查找您需要的所有数据:时间,周期,时间标准和位置。

文本中的位置必须位于单词“in”和“home”之间的时间之后。

要添加更多搜索条件,您可以修改EventsDataFinder类的def find(self, text_to_process)方法。

要更改格式(例如只返回全部时间只有结束时间),您可以修改EventsDataFinder类的def _prepare_event_data(time_text, time_period, time_standard, event_place)方法。 PS:据我所知,这些类对于初学者来说可能很难理解。所以我试图让这个代码尽可能简单。但没有类,代码将很难理解。所以有几个。

class TextUnit: 
    text = '' 
    start = None 
    end = None 
    absent = False 

    def fill_from_text(self, text): 
     self.text = text[self.start: self.end].strip() 

    def clear(self): 
     self.text = '' 
     self.start = None 
     self.end = None 
     self.absent = False 


class EventsDataFinder: 
    time_standards = { 
     'est', 
     'utc', 
     'dst', 
     'edt' 
    } 
    time_standard_text_len = 3 

    period = { 
     'am', 
     'pm' 
    } 
    period_text_len = 2 

    time_separators = ':-' 

    event_place_start_indicator = ' in ' 
    event_place_end_indicator = ' house' 

    fake_text_end = '.' 

    def find(self, text_to_process): 
     ''' 
     This method will parse given text and will return list of tuples. Each tuple will contain time of the event 
     in the desired format and location of the event. 
     :param text_to_process: text to parse 
     :return: list of tuples. For example [('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB')] 
     ''' 
     text = text_to_process.replace('\n', '') 
     text += self.fake_text_end 

     time_text = TextUnit() 
     time_period = TextUnit() 
     time_standard = TextUnit() 
     event_place = TextUnit() 

     result_events = list() 

     index = -1 
     for char in text: 
      index += 1 

      # Time text 
      if time_text.start is None: 
       if char.isdigit(): 
        time_text.start = index 
      if (time_text.start is not None) and (time_text.end is None): 
       if (not char.isdigit()) and (not char.isspace()) and (char not in self.time_separators): 
        time_text.end = index 
        time_text.fill_from_text(text) 

      # Time period 
      # If time_text is already found: 
      if (time_text.end is not None) and \ 
        (time_period.end is None) and (not time_period.absent) and \ 
        (not char.isspace()): 
       potential_period = text[index: index + self.period_text_len].lower() 
       if potential_period in self.period: 
        time_period.start = index 
        time_period.end = index + self.period_text_len 
        time_period.fill_from_text(text) 
       else: 
        time_period.absent = True 

      # Time standard 
      # If time_period is already found or does not exist: 
      if (time_period.absent or ((time_period.end is not None) and (index >= time_period.end))) and \ 
        (time_standard.end is None) and (not time_standard.absent) and \ 
        (not char.isspace()): 
       potential_standard = text[index: index + self.time_standard_text_len].lower() 
       if potential_standard in self.time_standards: 
        time_standard.start = index 
        time_standard.end = index + self.time_standard_text_len 
        time_standard.fill_from_text(text) 
       else: 
        time_standard.absent = True 

      # Event place 
      # If time_standard is already found or does not exist: 
      if (time_standard.absent or ((time_standard.end is not None) and (index >= time_standard.end))) and \ 
        (event_place.end is None) and (not event_place.absent): 
       if self.event_place_end_indicator.startswith(char.lower()): 
        potential_event_place = text[index: index + len(self.event_place_end_indicator)].lower() 
        if potential_event_place == self.event_place_end_indicator: 
         event_place.end = index 
         potential_event_place_start = text.rfind(self.event_place_start_indicator, 
                   time_text.end, 
                   event_place.end) 
         if potential_event_place_start > 0: 
          event_place.start = potential_event_place_start + len(self.event_place_start_indicator) 
          event_place.fill_from_text(text) 
         else: 
          event_place.absent = True 

      # Saving result and clearing temporary data holders 
      # If event_place is already found or does not exist: 
      if event_place.absent or (event_place.end is not None): 
       result_events.append(self._prepare_event_data(time_text, 
                   time_period, 
                   time_standard, 
                   event_place)) 
       time_text.clear() 
       time_period.clear() 
       time_standard.clear() 
       event_place.clear() 

     # This code will save data of the last incomplete event (all that was found). If it exists of course. 
     if (time_text.end is not None) and (event_place.end is None): 
      result_events.append(self._prepare_event_data(time_text, 
                  time_period, 
                  time_standard, 
                  event_place)) 

     return result_events 

    @staticmethod 
    def _prepare_event_data(time_text, time_period, time_standard, event_place): 
     ''' 
     This method will prepare found data to be saved in a desired format 
     :param time_text: text of time 
     :param time_period: text of period 
     :param time_standard: text of time standard 
     :param event_place: location of the event 
     :return: will return ready to save tuple. For example ('3:15 PM EST', 'AA A AAA') 
     ''' 
     event_time = time_text.text # '3:15-4:00' 
     split_time = event_time.split('-') # ['3:15', '4:00'] 
     if 1 < len(split_time): 
      # If it was, for example, '3:15-4:00 PM EST' in the text 
      start_time = split_time[0].strip() # '3:15' 
      end_time = split_time[1].strip() # '4:00' 
     else: 
      # If it was, for example, '3:15 PM EST' in the text 
      start_time = event_time # '3:15' 
      end_time = '' # '' 
     period = time_period.text.upper() # 'PM' 
     standard = time_standard.text.upper() # 'EST' 
     event_place = event_place.text # 

     # Removing empty time fields (for example if there is no period or time standard in the text) 
     time_data_separated = [start_time, period, standard] 
     new_time_data_separated = list() 
     for item in time_data_separated: 
      if item: 
       new_time_data_separated.append(item) 
     time_data_separated = new_time_data_separated 

     event_time_interval = ' '.join(time_data_separated) 
     result = (event_time_interval, event_place) 

     return result 


TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \ 
     'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \ 
     'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \ 
     'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \ 
     'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \ 
     'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \ 
     'No registration required. ' 

edf = EventsDataFinder() 

print(edf.find(TEXT)) 

比方说,我们有下一个文本:

加入我们的招生工作人员给出的校园导游。 参观将从3:15-4:00 PM EST和叶从 招生办公室在AA A AAA众议院。

旅游会从7:30 AM UTC地方,从乙BBB府 招生办公室离开。

巡演将于17:30 UTC,并从招生 办公室ÇCCCÇ家叶。

巡演将于9:30-11:00从 招生办公室在DDD家叶。

巡演将于15:00-16:25从 招生办公室在EE EE家叶。

无需注册。

所以这段代码将打印:

[('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB'), ('17:30 UTC', 'C CCC C'), ('9:30 AM', 'DDD'), ('15:00', 'EE EE')] 
+0

非常感谢!这非常有趣。如何添加更多event_place_start_indicator ='in'和 event_place_end_indicator ='house' 我试图在'in'和'house'之间添加一个或之后,但它似乎没有工作。 – Tank

+0

当前代码是针对单个指标制作的(如您所见,'in'和'house')。要使用多种选择,您需要使用以下内容:'more event_place_start_indicator = {'in','on','at'}'和'event_place_end_indicator = {'house','city','country'}'。但在这种情况下,您需要更改'def find(self,text_to_process)'方法的行为:您需要更改以“#Event place”注释开头的代码块。你必须自己做(或问别人)。我花了很多时间回答这个问题。我还有其他的事情要做。祝你好运! – KromviellBlack

+0

完美!谢谢。 – Tank

相关问题