2016-11-16 56 views
0

我正在使用poi从docx文件中提取内容。 处理文件时,所有照片都会丢失。 我检查此文件的格式,并且发现,该结构是异常:在pox中的docx文件中提取嵌入段落中的内容

<w:r> 
<w:p xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"> 
<w:r> 
<w:drawing> 
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1"> 
<wp:simplePos x="0" y="0"/> 
<wp:positionH relativeFrom="column"> 
<wp:align>center</wp:align> 
</wp:positionH> 
<wp:positionV relativeFrom="paragraph"> 
<wp:posOffset>2540</wp:posOffset> 
</wp:positionV> 
<wp:extent cx="5352176" cy="1837188"/> 
<wp:wrapTopAndBottom/> 
<wp:docPr id="9" name="media/GIUACAFYtDB.png"/> 
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"> 
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"> 
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"> 
<pic:nvPicPr> 
<pic:cNvPr id="0" name="media/GIUACAFYtDB.png"/> 
<pic:cNvPicPr/> 
</pic:nvPicPr> 
<pic:blipFill> 
<a:blip r:embed="rId9"/> 
<a:stretch> 
<a:fillRect/> 
</a:stretch> 
</pic:blipFill> 
<pic:spPr> 
<a:xfrm> 
<a:off x="0" y="0"/> 
<a:ext cx="5352176" cy="1837188"/> 
</a:xfrm> 
<a:prstGeom prst="rect"/> 
</pic:spPr> 
</pic:pic> 
</a:graphicData> 
</a:graphic> 
</wp:anchor> 
</w:drawing> 
</w:r> 
</w:p> 
</w:r> 

段落元件是一个运行元件的内部。我把它称为嵌入段落,并且我找不到使用poi来解析嵌入段落的方法。 我该如何处理这些数据?

+0

https://brattahlid.wordpress.com/2012/05/08/is-docx-really-an-open-standard/ 在这篇文章中,它表示微软的Word并不完全支持openxml。但poi基于openxml架构。是否有任何其他解决方案来解决Microsoft Docx文件? – TimYi

回答

0
public static List<XWPFPictureData> extractPictureData(XWPFRun wrun) { 
    List<XWPFPicture> pictures = wrun.getEmbeddedPictures(); 
    List<XWPFPictureData> result = new ArrayList<>(); 
    if(pictures != null && !pictures.isEmpty()) { 
     for (XWPFPicture picture : pictures) { 
      XWPFPictureData data = picture.getPictureData(); 
      if(data != null) { 
       result.add(data); 
      } 
     } 
     return result; 
    } 
    CTR ctr = wrun.getCTR(); 
    if(ctr.validate()) {  
     return result; 
    } 
    //this run does not obey openxml protocal. 
    XWPFDocument document = wrun.getDocument(); 
    String xpath = "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " + 
      ".//w:drawing"; 
    XmlObject[] drawings = ctr.selectPath(xpath); 
    for (XmlObject drawing : drawings) { 
     String blipPath = "declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " + 
       ".//a:blip"; 
     XmlObject[] blips = drawing.selectPath(blipPath); 
     if(blips.length == 0) { 
      continue; 
     } 
     XmlObject blip = blips[0]; 
     XmlObject blipId = 
       blip.selectAttribute("http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
         , "embed"); 
     if(blipId == null) { 
      continue; 
     } 
     String id = ((SimpleValue)blipId).getStringValue(); 
     POIXMLDocumentPart relatedPart = document.getRelationById(id); 
     if (relatedPart instanceof XWPFPictureData) { 
      XWPFPictureData pictureData = (XWPFPictureData) relatedPart; 
      result.add(pictureData); 
     } 
    } 
    return result; 
} 

它不能解决所有问题,但它现在解决了我的问题。 我试图访问低级别的XmlObject,并为嵌入段落构建一个XWPFParagraph对象,但是faild。所以我只使用低级别的xml处理代码。