2014-11-21 90 views
0

我给了一个.txt文件中的数据,我需要将它们格式化为可以上传到数据库中的数据。文字以任何方式锚定。根据标签,数据需要转储到特定的txt文件和制表符分隔。在我的生活中,我做了很少的Perl,但是我知道Perl可以很容易地处理这种类型的应用程序,我只是失去了从哪里开始。在Java,SQL和R之外,我毫无用处。这是一个条目我有接近这1000个处理)的例子:Perl - 将带有标签的文本文件解析为新的文本文件

<PaperTitle>True incidence of all complications following immediate and delayed breast reconstruction.</PaperTitle> 
<Abstract>BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p &lt; 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.</Abstract> 
<BookTitle>Book1</BookTitle> 
<Publisher>Publisher01, Boston</Publisher> 
<Edition>1st</Edition> 
<EditorList> 
    <Editor> 
     <LastName>Lewis</LastName> 
     <ForeName>Philip M</ForeName> 
     <Initials>PM</Initials> 
    </Editor> 
    <Editor> 
     <LastName>Kiffer</LastName> 
     <ForeName>Michael</ForeName> 
     <Initials>M</Initials> 
    </Editor> 
</EditorList> 
<Page>19-28</Page> 
<Year>2008</Year> 
<AuthorList> 
       <Author ValidYN="Y"> 
        <LastName>Sullivan</LastName> 
        <ForeName>Stephen R</ForeName> 
        <Initials>SR</Initials> 
       </Author> 
       <Author ValidYN="Y"> 
        <LastName>Fletcher</LastName> 
        <ForeName>Derek R D</ForeName> 
        <Initials>DR</Initials> 
       </Author> 
       <Author ValidYN="Y"> 
        <LastName>Isom</LastName> 
        <ForeName>Casey D</ForeName> 
        <Initials>CD</Initials> 
       </Author> 
       <Author ValidYN="Y"> 
        <LastName>Isik</LastName> 
        <ForeName>F Frank</ForeName> 
        <Initials>FF</Initials> 
       </Author> 
</AuthorList> 
// 

PaperTitle,摘要和页面,需要进入Papers.txt文件

PaperTitle,BOOKTITLE ,版,出版商,以及年需要进入Book.txt文件

PaperTitle,所有的编辑数据姓,名,缩写需要进入Editors.txt

PaperTitle,所有作者信息姓,名,首字母缩写需要进入Authors.tx t

//标记条目的结尾。所有文件都需要制表符分隔。 虽然我不会拒绝完成的代码,但我希望至少有一些想法能够让我至少解析出其中一个文件(如Book.txt)的代码的正确方向,我很可能会想到它从那里出来。 。非常感谢”

+0

我会通过查看使用配置::一般模块来处理解析和文本:: CSV_XS模块生成输出文件开始。 – 2014-11-21 22:57:11

+1

这听起来像你需要'XML :: Twig'。请显示这些数据会导致的文件内容。 – Borodin 2014-11-21 22:58:34

回答

-1

请检查这一个: 使用严格的; 使用警告; 使用CWD;

#Get Directory 
my $dir = getcwd(); 

#Grep files from the directory 
opendir(DIR, $dir) || die "Couldn't open/read the $dir: $!"; 
my @AllFiles = grep(/\.txt$/i, readdir(DIR)); 
closedir(DIR); 

#Check files are available 
if(scalar(@AllFiles) ne '') 
{ 
    #Create Text Files as per Requirement 
    open(PAP, ">$dir/Papers.txt") || die "Couldn't able to create the file: $!"; 
    open(BOOK, ">$dir/Book.txt") || die "Couldn't able to create the file: $!"; 
    open(EDT, ">$dir/Editors.txt") || die "Couldn't able to create the file: $!"; 
    open(AUT, ">$dir/Authors.txt") || die "Couldn't able to create the file: $!"; 
} 
else { die "File Not found...$dir\n"; } #Die if not found files 
foreach my $input (@AllFiles) 
{ 
    print "Processing file $input\n"; 
    open(IN, "$dir/$input") || die "Couldn't able to open the file: $!"; 
    local $/; $_=<IN>; my $tmp=$_; 
    close(IN); 
    #Loop from <PaperTitle> to // end slash 
    while($tmp=~m/(<PaperTitle>((?:(?!\/\/).)*)\/\/)/gs) 
    { 
     my $LoopCnt = $1; 
     my ($pptle) = $LoopCnt=~m/<PaperTitle>([^<>]*)<\/PaperTitle>/g; 
     my ($abstr) = $LoopCnt=~m/<Abstract>([^<>]*)<\/Abstract>/gs; 
     my ($pgrng) = $LoopCnt=~m/<Page>([^<>]*)<\/Page>/g; 
     my ($bktle) = $LoopCnt=~m/<BookTitle>([^<>]*)<\/BookTitle>/g; 
     my ($edtns) = $LoopCnt=~m/<Edition>([^<>]*)<\/Edition>/g; 
     my ($publr) = $LoopCnt=~m/<Publisher>([^<>]*)<\/Publisher>/g; 
     my ($years) = $LoopCnt=~m/<Year>([^<>]*)<\/Year>/g; 

     my ($EditorNames, $AuthorNames) = ""; 
     $LoopCnt=~s#<EditorList>((?:(?!<\/EditorList>).)*)</EditorList># 
     my $edtList = $1; my @Edlines = split/\n/, $edtList; 
     my $i ='1'; \#Editor Count to check 
     foreach my $EdsngLine(@Edlines) 
     { 
      if($EdsngLine=~m/<LastName>([^<>]*)<\/LastName>/) 
      { $EditorNames .= $i."".$1."\t"; $i++; } 
      elsif($EdsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/) 
      { $EditorNames .= $1."\t"; } 
      elsif($EdsngLine=~m/<Initials>([^<>]*)<\/Initials>/) 
      { $EditorNames .= $1."\t"; } 
     } 
     #esg; 
     $LoopCnt=~s#<AuthorList>((?:(?!<\/AuthorList>).)*)</AuthorList># 
     my $autList = $1; my @Autlines = split/\n/, $autList; 
     my $j ='1'; \#Author Count to check 
     foreach my $AutsngLine(@Autlines) 
     { 
      if($AutsngLine=~m/<LastName>([^<>]*)<\/LastName>/) 
      { $AuthorNames .= $j."".$1."\t"; $j++; } 
      elsif($AutsngLine=~m/<ForeName>([^<>]*)<\/ForeName>/) 
      { $AuthorNames .= $1."\t"; } 
      elsif($AutsngLine=~m/<Initials>([^<>]*)<\/Initials>/) 
      { $AuthorNames .= $1."\t"; } 
     } 
     #esg; 

     #Print the output in the crossponding text files 
     print PAP "$pptle\t$abstr\t$pgrng\t//\n"; 
     print BOOK "$pptle\t$bktle\t$edtns\t$publr\t$years\t//\n"; 
     print EDT "$pptle\t$EditorNames//\n"; 
     print AUT "$pptle\t$AuthorNames//\n"; 
    } 
} 

print "Process Completed...\n"; 

#Don't forget to close the files 
close(PAP); 
close(BOOK); 
close(EDT); 
close(AUT); 
#End 
+1

使用正则表达式解析XML是没有任何借口的。 – Borodin 2014-11-22 17:20:42

+0

@Borodin:我会对使用XML模块感兴趣。你能否完成代码,然后在我的程序中进一步行动。提前致谢。 – ssr1012 2014-11-23 07:10:48

+0

谢谢@Borodin和ssr1012在这里的帮助。我应该指定另一件事。我将不得不在许多文件上运行这个脚本(例如:BC_Book,EC_Book,CC_Book等)。共15个文件。我想连接数据,或者每次脚本编译时添加到文件中,但是这里每次都创建新文件。我应该能够自己跟踪代码,但我正在懒惰/陷入这个项目的其他方面。额外的帮助在这里将不胜感激! – BigData 2014-11-25 19:44:22

0

这个例子可以帮助你它使用XML::Twig我建议提取的字段。 Papers.txt输出文件。记录分隔符设置为"//\n",使整个数据块一次性读出,且块进行解析,它被包裹在<Paper>...</Paper>标记之前,使其有效的XML。

use strict; 
use warnings; 
use 5.010; 
use autodie; 

use XML::Twig; 

my $twig = XML::Twig->new; 

open my $fh, '<', 'papers.txt'; 
local $/ = "//\n"; 

while (<$fh>) { 
    $twig->parse("<Paper>\n$_\n</Paper>\n"); 
    my $root = $twig->root; 
    say $root->field($_) for qw/ PaperTitle Abstract Page/; 
    say '---'; 
} 

输出

True incidence of all complications following immediate and delayed breast reconstruction. 
BACKGROUND: Improved self-image and psychological well-being after breast reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p < 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome. 
19-28 
--- 
+0

谢谢@Borodin在这里寻求帮助。这距离我可以使用代码实现我自己的完整程序的地方还很遥远。我仍然理解你在这里做了什么,我感谢你的帮助。 – BigData 2014-11-25 19:29:42