2014-09-30 66 views
1

XML文件:如何使用XML/SGML实体将UTF-16转换为ASCII/ANSI?

<?xml version="1.0" encoding="utf-8"?> 
<response> 
<center> 
<b>Need to decode this -> </b> 
</center> 
</response> 

我当前的代码:

procedure TForm1.Button1Click(Sender: TObject); 
var 
    Doc: IXMLDocument; 
    S: AnsiString; 
    SW: WideString; 
    I: Integer; 
begin 
    Doc := TXMLDocument.Create(nil); 
    Doc.LoadFromFile('example.xml'); 
    SW := Doc.DocumentElement.ChildNodes['center'].ChildNodes['b'].NodeValue; 
    S := ''; 
    for I := 1 to Length(SW) do 
    if Ord(SW[I]) > $04FF then 
     S := S + IntToHex(Ord(SW[I]), 4) + ' ' 
    else 
     S := S + SW[I]; 
    Memo1.Text := s; 
end; 

SW在UTF-16(WideString的)进行编码,并包含该字符序列#$D83D#$DE09,但我需要它作为一个XML/SGML实体像'&#128521;'。我如何编码?

使用的字符是这样的:http://www.fileformat.info/info/unicode/char/1f609/index.htm

+1

不是真的明确。所以基本上,您不满意XML DOM实现如何解码基本多语言平面以外的字符并将其编码为两个UTF-16单元?并且想要将它重新编码为SGML字符实体? – 2014-09-30 02:29:16

+0

真的不明白,我忘了添加XML文档...我现在将它添加 – user3802199 2014-09-30 02:32:04

+0

添加XML文档 – user3802199 2014-09-30 02:33:53

回答

0

使用ANSI德尔福你必须手动处理UTF-16代理对(或使用一些第三方库)。

这应该在ANSI UND Unicode的德尔福工作:

uses 
    {$IFDEF UNICODE} 
    Xml.XMLDoc, Xml.XMLIntf, System.AnsiStrings, System.Character; 
    {$ELSE} 
    XMLDoc, XMLIntf; 
    {$ENDIF} 

{$R *.dfm} 

type 
{$IFDEF UNICODE} 
    ValueString = UnicodeString; 
{$ELSE} 
    ValueString = WideString; 
{$ENDIF} 

procedure Check(ATrue: Boolean; const AMessage: string); 
begin 
    if not ATrue then 
    raise Exception.Create(AMessage); 
end; 

function IsHighSurrogate(AChar: WideChar): Boolean; 
begin 
{$IFDEF UNICODE} 
    Result := TCharacter.IsHighSurrogate(AChar); 
{$ELSE} 
    Result := (AChar >= #$D800) and (AChar <= #$DBFF); 
{$ENDIF} 
end; 

function ConvertToUtf32(AHigh, ALow: WideChar): Integer; 
begin 
    {$IFDEF UNICODE} 
    Result := Ord(TCharacter.ConvertToUtf32(AHigh, ALow)); 
    {$ELSE} 
    Check(AHigh >= #$D800, 'Invalid high surrogate code point'); 
    Check(AHigh <= #$DBFF, 'Invalid high surrogate code point'); 
    Check(ALow >= #$DC00, 'Invalid low surrogate code point'); 
    Check(ALow <= #$DFFF, 'Invalid low surrogate code point'); 
    // This will return the ordinal value of the Unicode character represented by the two surrogate code points 
    Result := $010000 + ((Ord(AHigh) - $D800) shl 10) or (Ord(ALow) - $DC00); 
    {$ENDIF} 
end; 

function MakeEntity(AValue: Integer): AnsiString; 
begin 
    Result := Format(AnsiString('&#%d;'), [AValue]); 
end; 

function UnicodeToAsciiWithEntities(const AInput: ValueString): AnsiString; 
var 
    C: WideChar; 
    I: Integer; 
begin 
    Result := ''; 
    I := 1; 
    while I <= Length(AInput) do 
    begin 
    C := AInput[I]; 
    if C < #$0080 then 
     Result := Result + AnsiChar(C) 
    else 
    if IsHighSurrogate(C) then 
    begin 
     Check((I + 1) <= Length(AInput), 'String truncated after high surrogate'); 
     Result := Result + MakeEntity(ConvertToUtf32(C, AInput[I + 1])); 
     // Skip low surrogate 
     Inc(I); 
    end 
    else 
     Result := Result + MakeEntity(Ord(C)); 
    Inc(I); 
    end; 
end; 

procedure TForm1.Button1Click(Sender: TObject); 
begin 
    Memo1.Lines.Text := string(UnicodeToAsciiWithEntities(LoadXMLDocument(
    'example.xml').DocumentElement.ChildNodes['center'].ChildNodes['b'].NodeValue 
)); 
end; 

我没有德尔福7在这里,所以一些小的调整可能是必要的,该代码在XE2和2007年

+0

XML文档声明其编码为UTF-8 – 2014-09-30 16:13:31

+0

而不是转换整个XML内容'UCS4String'和废物2-4x的记忆,我会离开它,因为'UnicodeString',并通过它只是循环寻找替代品和转换他们在需要的时候去实体。查看'System.Character'函数,如'IsSurrogatePair()'和'ConvertToUtf32()'。 – 2014-09-30 17:51:12

+0

@DavidHeffernan确实如此,但这并不重要,因为无论如何,XML解析器将其转换为Delphi的内部表示形式(WideString for Delphi 7),不是吗? – 2014-10-01 09:02:38