2012-07-09 45 views
3

我有一个包含两种不同编码的大文件。 “main”文件是UTF-8,但某些字符如is32xx(isoxxx中的)或<9F>(isoxxx中的)使用ISO-8859-1编码。我可以用这个来代替无效字符:使用两个单独的编码在Ruby上加载文件

string.encode("iso8859-1", "utf-8", {:invalid => :replace, :replace => "-"}).encode("utf-8") 

的问题是,我需要这个错误编码的字符,所以更换为“ - ”是没用的我。我怎样才能修复与红宝石的文档中错误的编码字符?

编辑:我已经试过了:fallback选项,但没有成功(其中再没替换):

string.encode("iso8859-1", "utf-8", 
    :fallback => {"\x80" => "123"} 
) 
+0

备用将只有没有其他选项。看到我之前发布的链接。 – phoet 2012-07-10 07:45:32

+0

不,我已经尝试了没有额外的选项,并没有工作:( – Fu86 2012-07-10 13:28:33

回答

1

我用下面的代码(红宝石1.8.7)。它测试每个char> = 128 ASCII以检查它是否是有效utf-8序列的开始。如果不是,则认为它是iso8859-1并将其转换为utf-8。

由于您的文件很大,所以此过程可能非常缓慢!

class String 
    # Grants each char in the final string is utf-8-compliant. 
    # based on http://php.net/manual/en/function.utf8-encode.php#39986 
    def utf8 
    ret = '' 

    # scan the string 
    # I'd use self.each_byte do |b|, but I'll need to change i 
    a = self.unpack('C*') 
    i = 0 
    l = a.length 
    while i < l 
     b = a[i] 
     i += 1 

     # if it's ascii, don't do anything. 
     if b < 0x80 
     ret += b.chr 
     next 
     end 

     # check whether it's the beginning of a valid utf-8 sequence 
     m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe] 
     n = 0 
     n += 1 until n > m.length || (b & m[n]) == m[n-1] 

     # if not, convert it to utf-8 
     if n > m.length 
     ret += [b].pack('U') 
     next 
     end 

     # if yes, check if the rest of the sequence is utf8, too 
     r = [b] 
     u = false 

     # n bytes matching 10bbbbbb follow? 
     n.times do 
     if i < l 
      r << a[i] 
      u = (a[i] & 0xc0) == 0x80 
      i += 1 
     else 
      u = false 
     end 
     break unless u 
     end 

     # if not, converts it! 
     ret += r.pack(u ? 'C*' : 'U*') 
    end 

    ret 
    end 

    def utf8! 
    replace utf8 
    end 
end 

# let s be the string containing your file. 
s2 = s.utf8 

# or 
s.utf8! 
+0

好吧,这可能工作,但这是真的吗?这是解决这个问题的唯一解决方案吗?这对修复一些不好的字符有很大的“开销” – Fu86 2012-07-10 07:36:42

+0

不幸的是,没有测试每个坏字符都是不可能的,因为它们可以是合法的utf8序列的一部分;所以,顺便说一句,上面的代码在1.9.3上不起作用;我正在考虑修复它。 – 2012-07-11 20:31:48

1

这是我以前的代码的一个非常快的版本,与Ruby 1.8和1.9兼容。

我可以用正则表达式识别无效的utf8字符,我只转换它们。

class String 

    # Regexp for invalid UTF8 chars. 
    # $1 will be valid utf8 sequence; 
    # $3 will be the invalid utf8 char. 
    INVALID_UTF8 = Regexp.new(
    '(([\xc0-\xdf][\x80-\xbf]{1}|' + 
    '[\xe0-\xef][\x80-\xbf]{2}|' + 
    '[\xf0-\xf7][\x80-\xbf]{3}|' + 
    '[\xf8-\xfb][\x80-\xbf]{4}|' + 
    '[\xfc-\xfd][\x80-\xbf]{5})*)' + 
    '([\x80-\xff]?)', nil, 'n') 

    if RUBY_VERSION >= '1.9' 
    # ensure each char is utf8, assuming that 
    # bad characters are in the +encoding+ encoding 
    def utf8_ignore!(encoding) 

     # avoid bad characters errors and encoding incompatibilities 
     force_encoding('ascii-8bit') 

     # encode only invalid utf8 chars within string 
     gsub!(INVALID_UTF8) do |s| 
     $1 + $3.force_encoding(encoding).encode('utf-8').force_encoding('ascii-8bit') 
     end 

     # final string is in utf-8 
     force_encoding('utf-8') 
    end 

    else 
    require 'iconv' 

    # ensure each char is utf8, assuming that 
    # bad characters are in the +encoding+ encoding 
    def utf8_ignore!(encoding) 

     # encode only invalid utf8 chars within string 
     gsub!(INVALID_UTF8) do |s| 
     $1 + Iconv.conv('utf-8', encoding, $3) 
     end 

    end 
    end 

end 

# "\xe3" = "ã" in iso-8859-1 
# mix valid with invalid utf8 chars, which is in iso-8859-1 
a = "ãb\xe3" 

a.utf8_ignore!('iso-8859-1') 

puts a #=> ãbã 
相关问题