您不限于仅使用正则表达式来解析字符串内容。
除了使用正则表达式,您可以使用下面描述的解析技术。这与在编译器中使用的技术相似。该技术
对于开始你可以看看这个例子的
简单的例子。它只会在文本中找到时间。
TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
'No registration required. '
TIME_SEPARATORS = ':-'
time_text_start = None
time_text_end = None
time_text = ''
index = 0
for char in TEXT:
if time_text_start is None:
if char.isdigit():
time_text_start = index
if (time_text_start is not None) and (time_text_end is None):
if (not char.isdigit()) and (not char.isspace()) and (char not in TIME_SEPARATORS):
time_text_end = index
time_text = TEXT[time_text_start: time_text_end].strip()
print(time_text)
# Now we will clear our variables to be able to find next time_text data in the text
time_text_start = None
time_text_end = None
time_text = ''
index += 1
此代码将下一个打印:
3:15-4:00
7:30
17:30
9:30-11:00
15:00-16:25
真正的代码
现在你可以看一下真正的代码。它会查找您需要的所有数据:时间,周期,时间标准和位置。
文本中的位置必须位于单词“in”和“home”之间的时间之后。
要添加更多搜索条件,您可以修改EventsDataFinder
类的def find(self, text_to_process)
方法。
要更改格式(例如只返回全部时间只有结束时间),您可以修改EventsDataFinder
类的def _prepare_event_data(time_text, time_period, time_standard, event_place)
方法。 PS:据我所知,这些类对于初学者来说可能很难理解。所以我试图让这个代码尽可能简单。但没有类,代码将很难理解。所以有几个。
class TextUnit:
text = ''
start = None
end = None
absent = False
def fill_from_text(self, text):
self.text = text[self.start: self.end].strip()
def clear(self):
self.text = ''
self.start = None
self.end = None
self.absent = False
class EventsDataFinder:
time_standards = {
'est',
'utc',
'dst',
'edt'
}
time_standard_text_len = 3
period = {
'am',
'pm'
}
period_text_len = 2
time_separators = ':-'
event_place_start_indicator = ' in '
event_place_end_indicator = ' house'
fake_text_end = '.'
def find(self, text_to_process):
'''
This method will parse given text and will return list of tuples. Each tuple will contain time of the event
in the desired format and location of the event.
:param text_to_process: text to parse
:return: list of tuples. For example [('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB')]
'''
text = text_to_process.replace('\n', '')
text += self.fake_text_end
time_text = TextUnit()
time_period = TextUnit()
time_standard = TextUnit()
event_place = TextUnit()
result_events = list()
index = -1
for char in text:
index += 1
# Time text
if time_text.start is None:
if char.isdigit():
time_text.start = index
if (time_text.start is not None) and (time_text.end is None):
if (not char.isdigit()) and (not char.isspace()) and (char not in self.time_separators):
time_text.end = index
time_text.fill_from_text(text)
# Time period
# If time_text is already found:
if (time_text.end is not None) and \
(time_period.end is None) and (not time_period.absent) and \
(not char.isspace()):
potential_period = text[index: index + self.period_text_len].lower()
if potential_period in self.period:
time_period.start = index
time_period.end = index + self.period_text_len
time_period.fill_from_text(text)
else:
time_period.absent = True
# Time standard
# If time_period is already found or does not exist:
if (time_period.absent or ((time_period.end is not None) and (index >= time_period.end))) and \
(time_standard.end is None) and (not time_standard.absent) and \
(not char.isspace()):
potential_standard = text[index: index + self.time_standard_text_len].lower()
if potential_standard in self.time_standards:
time_standard.start = index
time_standard.end = index + self.time_standard_text_len
time_standard.fill_from_text(text)
else:
time_standard.absent = True
# Event place
# If time_standard is already found or does not exist:
if (time_standard.absent or ((time_standard.end is not None) and (index >= time_standard.end))) and \
(event_place.end is None) and (not event_place.absent):
if self.event_place_end_indicator.startswith(char.lower()):
potential_event_place = text[index: index + len(self.event_place_end_indicator)].lower()
if potential_event_place == self.event_place_end_indicator:
event_place.end = index
potential_event_place_start = text.rfind(self.event_place_start_indicator,
time_text.end,
event_place.end)
if potential_event_place_start > 0:
event_place.start = potential_event_place_start + len(self.event_place_start_indicator)
event_place.fill_from_text(text)
else:
event_place.absent = True
# Saving result and clearing temporary data holders
# If event_place is already found or does not exist:
if event_place.absent or (event_place.end is not None):
result_events.append(self._prepare_event_data(time_text,
time_period,
time_standard,
event_place))
time_text.clear()
time_period.clear()
time_standard.clear()
event_place.clear()
# This code will save data of the last incomplete event (all that was found). If it exists of course.
if (time_text.end is not None) and (event_place.end is None):
result_events.append(self._prepare_event_data(time_text,
time_period,
time_standard,
event_place))
return result_events
@staticmethod
def _prepare_event_data(time_text, time_period, time_standard, event_place):
'''
This method will prepare found data to be saved in a desired format
:param time_text: text of time
:param time_period: text of period
:param time_standard: text of time standard
:param event_place: location of the event
:return: will return ready to save tuple. For example ('3:15 PM EST', 'AA A AAA')
'''
event_time = time_text.text # '3:15-4:00'
split_time = event_time.split('-') # ['3:15', '4:00']
if 1 < len(split_time):
# If it was, for example, '3:15-4:00 PM EST' in the text
start_time = split_time[0].strip() # '3:15'
end_time = split_time[1].strip() # '4:00'
else:
# If it was, for example, '3:15 PM EST' in the text
start_time = event_time # '3:15'
end_time = '' # ''
period = time_period.text.upper() # 'PM'
standard = time_standard.text.upper() # 'EST'
event_place = event_place.text #
# Removing empty time fields (for example if there is no period or time standard in the text)
time_data_separated = [start_time, period, standard]
new_time_data_separated = list()
for item in time_data_separated:
if item:
new_time_data_separated.append(item)
time_data_separated = new_time_data_separated
event_time_interval = ' '.join(time_data_separated)
result = (event_time_interval, event_place)
return result
TEXT = 'Join us for a guided tour of the Campus given by the Admissions staff. The tour will take place ' \
'from 3:15-4:00 PM EST and leaves from the Admissions Office in AA A AAA House. No registration required.\n' \
'The tour will take place from 7:30 AM UTC and leaves from the Admissions Office in B BBB House.\n' \
'The tour will take place 17:30 UTC and leaves from the Admissions Office in C CCC C House.\n' \
'The tour will take place 9:30-11:00 AM and leaves from the Admissions Office in DDD House.\n' \
'The tour will take place 15:00-16:25 and leaves from the Admissions Office in EE EE House.\n' \
'No registration required. '
edf = EventsDataFinder()
print(edf.find(TEXT))
比方说,我们有下一个文本:
加入我们的招生工作人员给出的校园导游。 参观将从3:15-4:00 PM EST和叶从 招生办公室在AA A AAA众议院。
旅游会从7:30 AM UTC地方,从乙BBB府 招生办公室离开。
巡演将于17:30 UTC,并从招生 办公室ÇCCCÇ家叶。
巡演将于9:30-11:00从 招生办公室在DDD家叶。
巡演将于15:00-16:25从 招生办公室在EE EE家叶。
无需注册。
所以这段代码将打印:
[('3:15 PM EST', 'AA A AAA'), ('7:30 AM UTC', 'B BBB'), ('17:30 UTC', 'C CCC C'), ('9:30 AM', 'DDD'), ('15:00', 'EE EE')]
使用正则表达式 – xbonez
哦,是的,只是比较与NNN行代码中的正则表达式的解决方案的简单建议在下面的答案之一。 –