2017-04-10 63 views
0

我有XML消息xmlStr,必须将其分割成更小的XML消息,这些消息小于或等于maxSizeBytes。这是通过将文档的根和第一个孩子作为较小XML的基础,并将一些数量的元素放入新形成的(较小的)XML消息中来完成的。按字节大小预测分割XML文件

<?xml version="1.0"?> 
<Bas> 
    <Hdr> 
    <Smt>...</Smt> 
    <Smt>...</Smt> 
    <Smt>...</Smt> 
    </Hdr> 
</Bas> 

目前,我测量整个邮件大小int smtNodesPerMessage = (int)Math.Ceiling((double)ASCIIEncoding.ASCII.GetByteCount(xmlStr)/(double)maxSizeBytes);,其次是 考虑将smtNodesPerMessage节点分成更小的XML:

//doc is original XDocument message 
XDocument splitXML = new XDocument(new XElement(doc.Root.Name,            
            doc.Root.Descendants("Hdr"))); 
splitXML.Root.Add(batchOfSmt); 

我很快就发现,是较小的XML文件的字节大小大比maxSizeBytes,由于XDocument为每个消息添加额外的字符,增加字节大小。

+0

有趣。让我们知道你是如何去的 – MickyD

+0

代码可能是为每条消息添加xml标识:<?xml version =“1.0”?> – jdweng

+0

@jdweng,我确实,'splitXML.Declaration = doc.Declaration;'但不在上面的代码。 – newprint

回答

2

基本算法是:其中一个具有空Hdr元素文件

  • 获取大小。请注意,默认编码是UTF-8。所以我用Encoding.Default.GetByteCount来计算文档的大小和它的元素。
  • 克隆将检查是否子文件大小之前对每个子文档
  • 对于eash Smt元素这个空HDR文件将超过最大值

代码注释

var doc = XDocument.Load("data.xml"); 
var hdr = xdoc.Root.Element("Hdr"); 
var elements = hdr.Elements().ToList(); 
hdr.RemoveAll(); // we can remove child elements, because they are stored in a list 
hdr.Value = ""; // otherwise xdoc will compact empty element to <Hdr/> 

// calculating size of sub-document 'template' 
var sb = new StringBuilder(); 
using (XmlWriter writer = XmlWriter.Create(sb)) 
    doc.Save(writer); 
var outerSizeInBytes = Encoding.Default.GetByteCount(sb.ToString()); 

var maxSizeInBytes = 100; 
var subDocumentIndex = 0; // used just for naming sub-document files 
var subDocumentSizeBytes = outerSizeInBytes; // initial size of any sub-document 
var subDocument = new XDocument(doc); // clone 'template' 

foreach (var smt in elements) 
{ 
    var currentElementSizeBytes = Encoding.Default.GetByteCount(smt.ToString()); 

    if (maxSizeInBytes < subDocumentSizeBytes + currentElementSizeBytes 
     && subDocumentSizeBytes != outerSizeInBytes) // case when first element is too big 
    { 
     subDocument.Save($"doc{++subDocumentIndex}.xml"); 
     subDocument = new XDocument(doc); 
     subDocumentSizeBytes = outerSizeInBytes; 
    } 

    subDocument.Root.Element("Hdr").Add(smt); 
    subDocumentSizeBytes += currentElementSizeBytes; 
} 

// if current sub-document has elements added, save it too 
if (outerSizeInBytes < subDocumentSizeBytes) 
    subDocument.Save($"doc{++subDocumentIndex}.xml"); 

当来源是,最大大小为250字节时,您将获得三个文档

<?xml version="1.0"?> 
<Bas> 
    <Hdr> 
    <Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt> 
    <Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt> 
    <Smt>It has survived not only five centuries, 
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt> 
    <Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt> 
    </Hdr> 
</Bas> 

DOC1(223个字节):

<?xml version="1.0" encoding="utf-8"?> 
<Bas> 
    <Hdr> 
    <Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt> 
    <Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt> 
    </Hdr> 
</Bas> 

DOC2(259个字节,单元素):

<?xml version="1.0" encoding="utf-8"?> 
<Bas> 
    <Hdr> 
    <Smt>It has survived not only five centuries, 
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt> 
    </Hdr> 
</Bas> 

doc3的(128个字节,最后一个)

<?xml version="1.0" encoding="utf-8"?> 
<Bas> 
    <Hdr> 
    <Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt> 
    </Hdr> 
</Bas> 
+1

如果您使用ascii.GetBytesCount - 最好是将xml编码声明为ascii(在xml声明中)。 – Evk

+0

@Evk同意,我只是从问题中复制字节计算方法。其实我相信Unicode应该用在那里 –

+0

是的,我认为应该使用UTF-8(如果没有指定其他编码,则默认使用xml)。 – Evk