2014-11-05 60 views
0

我有大量的Pdfs。这些是每月生成的出版物,我希望自动化这些文档的翻录和解析以获取要导入到数据库的联系信息 。使用perl解析段落

假设每个文本块都有一个START和END标记。我需要在开始标签后跳过“(Parantheses)”和PARAgraph,然后在PARTNER_COMPANY,“标题”和各种形式的联系信息之间抓取“Comapny”,直到下一个END TAG。 联系信息的字符串可能会有所不同。有些人可能拥有比其他人更多的信息,但我仍然需要遵循特定标题的统一格式。 对于变体,状态,国家和邮政编码可能位于由。分隔的同一行。 \ n的其他变体可能会受到\ n的限制。 当程序到达文件的“Dated”部分时,日期需要被解析为某种格式(见下文)。 一些文本块将提供所有这些联系信息,而其他块不会。我想解析,直到结束标记。

样本数据

START 

Company_1_ANY type of character 

(Parantheses) 

PARAgraph 

DATE: Dated this 5 day of NOvermber 2014 - parse date to yyyy-mm-dd format(2014-11-05) 


PARTNER_COMPANY_1 

Title - title_1 

Contact for enquiries: - CONTACT PERSON 

HOMER Simpson 

Telephone: (123) 123-1234 

FAX: (111) 346-0000 

Address: 

P.O. Box 123454, ANYTown, 12345-1234 

STATE, USA 

END 



START 

COMPANY_2_ANY type of character 

(Parantheses) 


PARAGRAPH of random text 

Dated this 5 day of November 2014 - 2014-11-05 

PARTNER_COMPANY_2 

Title - Title_2 






address: 

190 RAndom Avenue, Any town 

STATE_2 12345-0987 

Country - USA 

Contact: 

JOsh E 

Telephone: (234) 111-1111 

END 

CODE

my @name; 

while (<>) { 
    if (/START/gism) { 
    while (<>) { 
     last if /END/; 
     chomp; 
     push @name, $_; 

    } 
    print "\[email protected]\n"; 
    @name =() 
    } 
    else { 
    print ''; 
    } 
} 

我的结果

Company_1_ANY type of character (Parantheses) PARAgraph DATE: Dated this 5 day of NOvermber 2014 - parse date to yyyy-mm-dd format(2014-11-05) PARTNER_COMPANY_1 Title - title_1 Contact for enquiries: - CONTACT PERSON HOMER Simpson Telephone: (123) 123-1234 FAX: (111) 346-0000 Address: P.O. Box 123454, ANYTown, 12345-1234 STATE, USA 
COMPANY_2_ANY type of character (Parantheses)  PARAGRAPH of random text Dated this 5 day of November 2014 - 2014-11-05 PARTNER_COMPANY_2 Title - Title_2   address: 190 RAndom Avenue, Any town STATE_2 12345-0987 Country - USA Contact: JOsh E Telephone: (234) 111-1111 

所需的输出

Company,DATE,PARTNER_COMPANY,Title,CONTACT PERSON,Telephone,FAX,Address,City,STATE,ZIP,Country 

Company_1,2014-11-05,PARTNER_COMPANY_1,title_1,HOMER Simpson,(123) 123-1234,(111) 346-0000,P.O. Box 123454,ANYTown,12345-1234,USA 

COMPANY_2,2014-11-05,PARTNER_COMPANY_2,Title_2,JOsh E,(234) 111-1111,,190 RAndom Avenue,Any town,STATE_2,12345-0987,USA 

我得到我想要开始和结束之间什么,但我不知道如何界定elemtents在我的阵列。另外,我无法弄清楚如何过滤掉不需要的,即PARAGRAPH。我还想修改分隔符之间的内容。我知道一个模块可能对此有用,但为了更好地理解如何创建散列和/或密钥,有没有更好的方法?

另外,在DESIRED OUTPUT行中,不考虑给出的换行符。该行应继续用逗号分隔。这个线程只会让文本有一定的长度,直到换行。

+0

感谢格式你sa请输入正确的代码,就像你的代码一样! – 2014-11-05 22:14:39

+0

@sputnick是否有效? – JDE876 2014-11-05 23:04:58

+0

是的。空行是真正的空行?不是一个错误的格式输入? – 2014-11-05 23:07:44

回答

0

以脚本为基础,需要更多的工作才能完全满足您的需求。它将信息存储在Perl Data Structure (DS)中:一个HASH。处理完成后,你只需要遍历DS产生想要的输出:

#!/usr/bin/env perl 

use strict; use warnings; # always put this in your scripts 
use Data::Dumper; # to print the data structure (DS) like in my OUTPUT section 

my $h = []; # $h is a reference to a void ARRAY 
my $witness1 = my $witness2 = 0; # setting the 2 variables with '0' 
my $key = -1; 

# using the magic 'diamond operator <>' to loop through the input file 
while (<DATA>) { 
    next if /^$/; # skip this line if it's a blank line 

    $key++ if /^START/; # iterating $key if the current line begins with 'START' 

    # setting HASH values, $& is the matching part 
    $h->[$key]->{Company} = $& if /^Company_.*/i; 
    $h->[$key]->{Partner_Company} = $& if /^PARTNER_COMPANY.*/i; 
    $h->[$key]->{Title} = $& if /^TITLE\s+-\s+\K.*/i; 

    # if there's 'CONTACT PERSON' string in the current line 
    if (/CONTACT\s+PERSON/) { 
     $witness1 = 1; 
     next; 
    } 

    # witness1 tell us that we still are in the 'CONTACT PERSON' part 
    if ($witness1) { 
     $h->[$key]->{Name} = chomp($_); 
     $witness1 = 0; 
    } 

    $h->[$key]->{Tel} = $& if /^Telephone: \K.*/i; 
    $h->[$key]->{Fax} = $& if /^FAX: \K.*/i; 

    if (/^Address:/i) { 
     $witness2 = 1; 
     next; 
    } 

    # witness2 tell us that we still are in the 'ADDRESS' part 
    if ($witness2 and !/^END/) { 
     $h->[$key]->{Address} .= $_; 
    } 

    if (/^END/) { 
     $witness2 = 0; 
    } 
} 

print Dumper $h; 

__DATA__ 
START 

Company_1_ANY type of character 

(Parantheses) 

PARAgraph 

DATE: Dated this 5 day of NOvermber 2014 - parse date to yyyy-mm-dd format(2014-11-05) 


PARTNER_COMPANY_1 

Title - title_1 

Contact for enquiries: - CONTACT PERSON 

HOMER Simpson 

Telephone: (123) 123-1234 

FAX: (111) 346-0000 

Address: 

P.O. Box 123454, ANYTown, 12345-1234 

STATE, USA 

END 



START 

COMPANY_2_ANY type of character 

(Parantheses) 


PARAGRAPH of random text 

Dated this 5 day of November 2014 - 2014-11-05 

PARTNER_COMPANY_2 

Title - Title_2 






address: 

190 RAndom Avenue, Any town 

STATE_2 12345-0987 

Country - USA 

Contact: 

JOsh E 

Telephone: (234) 111-1111 

END 

文件:

了解引用,我建议你一些指点:

+0

你可以在你的代码上添加注释以便我更好地理解。我在解决这个问题时遇到了一些麻烦。 – JDE876 2014-11-13 22:02:02

+0

POST相应编辑 – 2014-11-13 22:40:54