2016-03-08 70 views
1

我试图解析.docx文件到xml。我可以解析它并将xml渲染为一个单独的页面。但我真正想要的只是在template,metadata中显示<body>。我怎样才能做到这一点?我尝试使用BodyContentHandler,但它摆脱了xml tagsGrails - Tika内容操作

谢谢。

编辑

我做了一个简单的代码在controller但我把事情搞糟了。这是我以前的。我从我的temp文件夹中获取文件,并将其发送给我的tikaService(我从GitHub复制的服务。)

Controller

def parse(Document documentInstance) { 
    def file = new File(documentInstance.fullPath) 
    def result = tikaService.parseFile(file) 
    render(view:"parse", text: result, contentType: "text/xml", encoding: "UTF-8") 
} 

Service

class TikaService { 

    static transactional = false 

    String parseFile(File file, TikaConfig tikaConfig, Metadata metadata){ 
     SAXTransformerFactory factory = SAXTransformerFactory.newInstance() 
     TransformerHandler handler = factory.newTransformerHandler() 
     handler.transformer.setOutputProperty(OutputKeys.METHOD, "xml") 
     handler.transformer.setOutputProperty(OutputKeys.INDENT, "yes") 

     StringWriter sw = new StringWriter() 
     handler.result = new StreamResult(sw) 

     Parser parser = new AutoDetectParser(tikaConfig) 
     ParseContext pc = new ParseContext() 
     try { 
      parser.parse(new FileInputStream(file), handler, metadata, pc) 
      return sw.toString() 
     } catch (Exception e) { 
      log.error("Failed to parse file ${file.absolutePath}", e) 
      throw e 
     } 
    } 

    String parseFile(File file){ 
     TikaConfig tikaConfig = new TikaConfig() 
     Metadata tikaMeta = new Metadata() 
     return parseFile(file, tikaConfig, tikaMeta) 
    } 
} 

如果我使用render我得到

this

当我打电话,结果从parse.gsp${result}我得到

this

我希望我做的意义解释这一点。谢谢。

编辑2

XML

<?xml version="1.0" encoding="UTF-8"?> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta name="Revision-Number" content="0"/> 
<meta name="Last-Printed" content="1601-01-01T00:00:00Z"/> 
<meta name="cp:revision" content="0"/> 
<meta name="meta:print-date" content="1601-01-01T00:00:00Z"/> 
<meta name="meta:creation-date" content="2013-03-20T15:29:13Z"/> 
<meta name="dcterms:modified" content="1601-01-01T00:00:00Z"/> 
<meta name="meta:save-date" content="1601-01-01T00:00:00Z"/> 
<meta name="dc:creator" content="ingo "/> 
<meta name="Last-Modified" content="1601-01-01T00:00:00Z"/> 
<meta name="Author" content="ingo "/> 
<meta name="dcterms:created" content="2013-03-20T15:29:13Z"/> 
<meta name="date" content="1601-01-01T00:00:00Z"/> 
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/> 
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/> 
<meta name="modified" content="1601-01-01T00:00:00Z"/> 
<meta name="creator" content="ingo "/> 
<meta name="Creation-Date" content="2013-03-20T15:29:13Z"/> 
<meta name="meta:author" content="ingo "/> 
<meta name="Content-Type" content="application/msword"/> 
<meta name="Last-Save-Date" content="1601-01-01T00:00:00Z"/> 
<title/> 
</head> 
<body> 
<p class="überschrift_1"><b>Tika Parser Test </b></p> 
<p class="standard">This is a simple test document</p> 
</body> 
</html> 

EDIT 3

控制器

import javax.xml.transform.OutputKeys 
import javax.xml.transform.sax.SAXTransformerFactory 
import javax.xml.transform.sax.TransformerHandler 
import javax.xml.transform.stream.StreamResult 

import org.apache.tika.config.TikaConfig 
import org.apache.tika.metadata.Metadata 
import org.apache.tika.parser.AutoDetectParser 
import org.apache.tika.parser.ParseContext 
import org.apache.tika.parser.Parser 
import org.apache.tika.sax.BodyContentHandler 
import org.apache.tika.sax.ToXMLContentHandler 
import org.apache.tika.sax.ToHTMLContentHandler 

    def parse(Document documentInstance) { 
     def file = new File(documentInstance.fullPath) 
     BodyContentHandler handler = new BodyContentHandler(new ToHTMLContentHandler()) 
     AutoDetectParser parser = new AutoDetectParser() 
     FileInputStream inputstream = new FileInputStream(file) 

     Metadata metadata = new Metadata() 
     parser.parse(inputstream, handler, metadata) 
    } 

错误

Namespace http://www.w3.org/1999/xhtml not declared

+1

目前为止你有什么代码? –

+0

@EmmanuelRosa,我编辑了我的帖子。谢谢。 –

+0

您可以将XML作为文本发布,以便可以复制粘贴? –

回答

0

首先,看起来像蒂卡文档给出的例子是错误的:

link to the bug ticket

这里有一个修复的问题:

link to solution

ToHTMLContentHandler toHtmlContentHandler = new ToHTMLContentHandler(outputStream, "UTF-8"); 
WriteOutContentHandler handler = new WriteOutContentHandler(toHtmlContentHandler, (int) 4000000); 
ContentHandler bodyHandler = new BodyContentHandler(handler); 

希望这有助于!