2010-07-17 77 views
1

我有一个奇怪的问题,使用awk正则表达式匹配替换XML文件中的一些文本。这个awk正则表达式替换有什么问题?

xml文件很简单。每个xml的节点中都有一段文本,awk程序会将该文本替换为从文本文件rtxt中选取的另一段文本。但由于某些原因,替换42.xml中文本的rtxt中的文本(标记为'42')不会产生适当的替换。

toxml.awk写入标准输出。它首先打印xml,因为它读取了它,然后是最终的替换结果。

我实际上有一个这些XML文件的集合,我用一个更长的rtxt文本替换文本。恰巧这个特定的替换(对于42.xml)不起作用。替代元素中的文本被替换,另一个标签嵌套在现有的标签中。


toxml.awk

BEGIN{ 
    srcfile = "rtxt" 
    FS = "|" 

    while (getline <srcfile) { 
    xmlfile = $1 ".xml" 
    rep = "<narrative>" $2 "</narrative>" 

    ## read in the xml file in one go. 
    ## (the last tag will be missing.) 
    RS = "</topic>" 
    FS = "</topic>" 

    getline <xmlfile 
    #print $0 
    close(xmlfile) 

    ## replace 
    subs = gsub(/<narrative>.*<\/narrative>/, rep, $0) 

    ## append the closing tag 
    subs = gsub(/[ \n\r\t]+$/, "\n</topic>", $0) 
    print $0 

    ## restore them before reading rtxt. 
    RS = "\n" 
    FS = "|" 
    } 

    close(srcfile) 
} 

rtxt

42 |信息显示为Java培训机构的详细信息,以及IT公司,提供Java解决方案也被认为是不相关的。 Java是Sun Microsystems开发的流行编程语言。我有兴趣了解这种编程语言,并学习编程。为了保持相关性,结果应提供有关不同版本的Java和Java中不同概念的Java &的历史信息。它很好,如果我找到学习Java的教程。仅与Sun Microsystems相关但不与Java相关的结果被认为是不相关的。我喜欢找到讨论这种编程语言的文章以及各种版本的概念。


42.xml

<?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE topic SYSTEM "topic.dtd"> 
<topic id="2009042" ct_no="227"> 

    <title>sun java</title> 

    <castitle>//article[about(.//language, java) or about(.,sun)]//sec[about(.//language, java)]</castitle> 

    <phrasetitle>"sun java"</phrasetitle> 

    <description>Find information about Sun Microsystem's Java language</description> 

    <narrative>Java is a popular programming language developed at Sun Microsystems. I am interested to know about this programming language, and also to learn programming with it. To be relevant, a result should give information on history of Java &amp; on different versions of Java, and on different concepts in Java. Its good if I find tutorials for learning Java. Results related only to Sun Microsystems but not Java are considered non-relevant. Results showing details of training institutes for Java, and IT companies which provide Java solutions are also considered non-relevant. I like to find articles that discuss this programming language and various concepts &amp; versions of it. </narrative> 

</topic> 

+4

我不知道别人,但我不起来的只是下载和解压文件回答一个问题。改为使用pastebin服务。 – 2010-07-17 06:11:42

+0

即使我认为这不会是正确的方式问在Stackoverflow,但我没有其他方式来演示问题。提供4个文件的链接将是另一个混​​乱。我需要找人查看文本文件并运行程序。我必须找到其他方法。 – rup 2010-07-17 06:27:10

+0

有一点似乎很明显,就是你的getline需要循环。 – 2010-07-17 09:50:20

回答

0

只是一个开始

#!/bin/bash 

awk 'BEGIN{FS="|"} 
FNR==NR{ nar[$1]=$2; next } 
END{ 
    for(i=2;i<ARGC;i++){ 
    xmlfile=ARGV[i] 
    split(xmlfile,fname,".") 
    print "Doing file: "xmlfile 
    print "---------------------------------" 
    while((getline line < xmlfile) > 0) { 
     if (line ~ /<narrative>/){ 
      line="<narrative>"nar[fname[1]]"</narrative>" 
     } 
     print line 
    } 
    } 
}' rtxt 42.xml 71.xml 
+0

修改它。看一看。 – rup 2010-07-17 08:31:47