2017-07-03 79 views
-1
使用正则表达式提取与开始和结束匹配字符串文本的所有相关部分

我已经发布了关于similar question Python中字符提取使用正则表达式,但我有一个非贪婪量词另一个问题,所以我用一个不同的例子问一个问题。问题是我需要使用Python中的正则表达式提取字符串文本的所有相关部分,并使用两个特定的匹配项。具体而言,这里是一个例子文本:通过在Python

example = """ 
    The Bank does offer a hybrid loan. Hybrid loans are loans that start as a 
    fixed rate mortgage but after a set number of years automatically adjust 
    to an adjustable rate mortgage. The Bank offers a three year fixed rate mortgage 
    after which the interest rate will adjust annually. Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 Item 2. Properties 15-16 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
    """ 

,并和我想提取“之间”从开始起匹配“项目1.”的文本的部分和结束匹配“项目2.”,所以最后的结果应该是这样的:

final_result_1 = """ 
    ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897. 
    """ 

final_result_2 = """ 
    Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 
    """ 

最终结果的顺序应该是在最终结果的文本的长度方面,所以“final_result_1”是两个中最长的文本部分,'final_result_2'是最短的一个。你可以参考上一个问题here的答案。先谢谢你!

+0

我很想帮忙,但这个问题是非常令人迷惑。你能否创建一些简短的示例文本并解释一下你想要输出的内容? –

+0

@krcoder,你需要从文本中排除“ITEM 2”,对不对? –

+0

@code_byter,这是真的,以及'final_result_2'被排除的'Item 2'。 – krcoder

回答

1

我相信你需要使用

import re; 
example = """ 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
""" 
matches = re.findall('(ITEM\ 1[\s\S]*)ITEM\ 2', example,re.IGNORECASE); 
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list. 
matches.sort(key = len, reverse = True) 
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller) 

编辑:(OP欲养而不能什么)

import re; 
example = """ 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
""" 
pat = re.compile('(ITEM\ 1[\s\S]*)ITEM\ 2',re.IGNORECASE); 
matches = pat.findall(example) 
print(matches) 
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list. 
matches.sort(key = len, reverse = True) 
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller) 
print(matches) 

代码测试

最后编辑:

import re; 
example = """ 
    The Bank does offer a hybrid loan. Hybrid loans are loans that start as a 
    fixed rate mortgage but after a set number of years automatically adjust 
    to an adjustable rate mortgage. The Bank offers a three year fixed rate mortgage 
    after which the interest rate will adjust annually. Item 1. Business 3-13 Item 1a. 
    Risk Factors 13-15 Item 1b. Unresolved Staff Comments 15 Item 2. Properties 15-16 
    The forward-looking statements are made as of the date of this report, 
    and the Company assumes no obligation to update the forward-looking statements 
    or to update the reasons why actual results could differ from those projected 
    in the forward-looking statements. PART 1. ITEM 1. BUSINESS 
    General Farmers & Merchants Bancorp, Inc. (Company) is a bank holding company 
    incorporated under the laws of Ohio in 1985 and elected to become a financial 
    holding company under the Federal Reserve in 2014. Our primary subsidiary, 
    The Farmers & Merchants State Bank (Bank) is a\n community bank operating 
    in Northwest Ohio since 1897.ITEM 2. PROPERTIES Our principal office is located in Archbold, Ohio. 
    The Bank operates from the facilities at 307 North Defiance Street. 
    In addition, the Bank owns the property from 200 to 208 Ditto Street, 
    Archbold, Ohio, which it uses for Bank parking and a community mini-park area. 
""" 
pat = re.compile('(ITEM\ 1[\s\S]*?)ITEM\ 2',re.IGNORECASE); 
matches = pat.findall(example) 
print(matches) 
#Here, matches consists of all the matches in a list. You can sort them by size of string at each index of the list. 
matches.sort(key = len, reverse = True) 
#Now matches contains a list of the matched strings in reverse order of length (from bigger to smaller) 

#To check if it works: 
for match in matches: 
    print(match) 
    print('\n') 

何不你现在试试吗? :)

+0

谢谢你的答案,但这不是我所期望的。正如我上面提到的,关于示例文本的一个具体问题是应该实现非贪婪匹配,因为在整个文本中有多个开始('item 1')和end('item 2')匹配。 – krcoder

+0

我明白了。让我看看我能做些什么。 –

+0

具体地,在上面的例子中的文字,在第四行,有开始与所述第一启动匹配“项目1.商业3-13 ...”,和在第五行中,在第一端的起始匹配'项目2属性15-16 ...'。 – krcoder