1
我试图解析.docx
文件到xml
。我可以解析它并将xml
渲染为一个单独的页面。但我真正想要的只是在template
,metadata
中显示<body>
。我怎样才能做到这一点?我尝试使用BodyContentHandler
,但它摆脱了xml tags
。Grails - Tika内容操作
谢谢。
编辑
我做了一个简单的代码在controller
但我把事情搞糟了。这是我以前的。我从我的temp
文件夹中获取文件,并将其发送给我的tikaService
(我从GitHub复制的服务。)
Controller
def parse(Document documentInstance) {
def file = new File(documentInstance.fullPath)
def result = tikaService.parseFile(file)
render(view:"parse", text: result, contentType: "text/xml", encoding: "UTF-8")
}
Service
class TikaService {
static transactional = false
String parseFile(File file, TikaConfig tikaConfig, Metadata metadata){
SAXTransformerFactory factory = SAXTransformerFactory.newInstance()
TransformerHandler handler = factory.newTransformerHandler()
handler.transformer.setOutputProperty(OutputKeys.METHOD, "xml")
handler.transformer.setOutputProperty(OutputKeys.INDENT, "yes")
StringWriter sw = new StringWriter()
handler.result = new StreamResult(sw)
Parser parser = new AutoDetectParser(tikaConfig)
ParseContext pc = new ParseContext()
try {
parser.parse(new FileInputStream(file), handler, metadata, pc)
return sw.toString()
} catch (Exception e) {
log.error("Failed to parse file ${file.absolutePath}", e)
throw e
}
}
String parseFile(File file){
TikaConfig tikaConfig = new TikaConfig()
Metadata tikaMeta = new Metadata()
return parseFile(file, tikaConfig, tikaMeta)
}
}
如果我使用render
我得到
当我打电话,结果从parse.gsp
与${result}
我得到
我希望我做的意义解释这一点。谢谢。
编辑2
XML
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="Revision-Number" content="0"/>
<meta name="Last-Printed" content="1601-01-01T00:00:00Z"/>
<meta name="cp:revision" content="0"/>
<meta name="meta:print-date" content="1601-01-01T00:00:00Z"/>
<meta name="meta:creation-date" content="2013-03-20T15:29:13Z"/>
<meta name="dcterms:modified" content="1601-01-01T00:00:00Z"/>
<meta name="meta:save-date" content="1601-01-01T00:00:00Z"/>
<meta name="dc:creator" content="ingo "/>
<meta name="Last-Modified" content="1601-01-01T00:00:00Z"/>
<meta name="Author" content="ingo "/>
<meta name="dcterms:created" content="2013-03-20T15:29:13Z"/>
<meta name="date" content="1601-01-01T00:00:00Z"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.OfficeParser"/>
<meta name="modified" content="1601-01-01T00:00:00Z"/>
<meta name="creator" content="ingo "/>
<meta name="Creation-Date" content="2013-03-20T15:29:13Z"/>
<meta name="meta:author" content="ingo "/>
<meta name="Content-Type" content="application/msword"/>
<meta name="Last-Save-Date" content="1601-01-01T00:00:00Z"/>
<title/>
</head>
<body>
<p class="überschrift_1"><b>Tika Parser Test </b></p>
<p class="standard">This is a simple test document</p>
</body>
</html>
EDIT 3
控制器
import javax.xml.transform.OutputKeys
import javax.xml.transform.sax.SAXTransformerFactory
import javax.xml.transform.sax.TransformerHandler
import javax.xml.transform.stream.StreamResult
import org.apache.tika.config.TikaConfig
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.AutoDetectParser
import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.Parser
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.sax.ToXMLContentHandler
import org.apache.tika.sax.ToHTMLContentHandler
def parse(Document documentInstance) {
def file = new File(documentInstance.fullPath)
BodyContentHandler handler = new BodyContentHandler(new ToHTMLContentHandler())
AutoDetectParser parser = new AutoDetectParser()
FileInputStream inputstream = new FileInputStream(file)
Metadata metadata = new Metadata()
parser.parse(inputstream, handler, metadata)
}
错误
Namespace http://www.w3.org/1999/xhtml not declared
目前为止你有什么代码? –
@EmmanuelRosa,我编辑了我的帖子。谢谢。 –
您可以将XML作为文本发布,以便可以复制粘贴? –