如何检测终端中的unicode字符串宽度？

我正在研究一个基于终端的程序，它有unicode支持。在某些情况下，我需要确定一个字符串在打印之前会消耗多少终端列。不幸的是，有些字符是2列（中文等），但是我发现this answer表明检测全角字符的好方法是通过调用ICU库中的u_getIntPropertyValue（）。如何检测终端中的unicode字符串宽度？

现在我试图解析我的UTF8字符串的字符，并将它们传递给此函数。我现在遇到的问题是，u_getIntPropertyValue（）需要一个UTF-32代码点。

什么是从utf8字符串获取这个最好的方法？我目前正在尝试使用boost :: locale（在我的程序中的其他地方使用）执行此操作，但是我无法获得干净的转换。来自boost :: locale的我的UTF32字符串前面加上zero-width character来表示字节顺序。显然，我可以跳过字符串的前四个字节，但有没有更清晰的方法来做到这一点？

这是我目前的丑陋的解决方案：

inline size_t utf8PrintableSize(const std::string &str, std::locale loc) 
{ 
    namespace ba = boost::locale::boundary; 
    ba::ssegment_index map(ba::character, str.begin(), str.end(), loc); 
    size_t widthCount = 0; 
    for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it) 
    { 
     ++widthCount; 
     std::string utf32Char = boost::locale::conv::from_utf(it->str(), std::string("utf-32")); 

     UChar32 utf32Codepoint = 0; 
     memcpy(&utf32Codepoint, utf32Char.c_str()+4, sizeof(UChar32)); 

     int width = u_getIntPropertyValue(utf32Codepoint, UCHAR_EAST_ASIAN_WIDTH); 
     if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE)) 
     { 
      ++widthCount; 
     } 

    } 
    return widthCount; 
}

来源

2016-05-23 KyleL

如果您已经使用ICU，为什么不使用它的UTF8到UTF32转换呢？ –

我对ICU不熟悉。我试图使用boost :: locale来隔离大多数复杂性。有没有一种简单的方法可以直接从ICU获得这个utf32代码点？ – KyleL

我对它并不熟悉，但我知道它拥有任何人从unicode库中想要的一切。花一些时间与谷歌，你会发现它。 –

UTF-32是单个字符的“代码点”的直接表示形式。因此，您只需从UTF-8字符中提取这些字符并将其提供给u_getIntPropertyValue即可。

我把你的代码，并修改它使用u8_to_u32_iterator，这似乎是刚做这个：

#include <boost/regex/pending/unicode_iterator.hpp> 

inline size_t utf8PrintableSize(const std::string &str, std::locale loc) 
{ 
    size_t widthCount = 0; 
    for(boost::u8_to_u32_iterator<std::string::iterator> it(input.begin()), end(input.end()); it!=end; ++it) 
    { 
     ++widthCount; 

     int width = u_getIntPropertyValue(*it, UCHAR_EAST_ASIAN_WIDTH); 
     if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE)) 
     { 
      ++widthCount; 
     } 

    } 
    return widthCount; 
}

来源

2016-05-23 19:10:20

谢谢你的助推实施。有趣的是，这是正则表达式库的一部分，而不是区域设置。 – KyleL

@牛米是正确的：有一个简单的方法，直接用ICS做到这一点。更新后的代码如下。我怀疑我可能只是使用UnicodeString并绕过整个提升语言环境的使用情况。

inline size_t utf8PrintableSize(const std::string &str, std::locale loc) 
{ 
    namespace ba = boost::locale::boundary; 
    ba::ssegment_index map(ba::character, str.begin(), str.end(), loc); 
    size_t widthCount = 0; 
    for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it) 
    { 
     ++widthCount; 

     //Note: Some unicode characters are 'full width' and consume more than one 
     // column on output. We will increment widthCount one extra time for 
     // these characters to ensure that space is properly allocated 
     UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(it->str())); 
     UChar32 codePoint = ucs.char32At(0); 

     int width = u_getIntPropertyValue(codePoint, UCHAR_EAST_ASIAN_WIDTH); 
     if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE)) 
     { 
      ++widthCount; 
     } 

    } 
    return widthCount; 
}

来源

2016-05-23 18:51:58 KyleL

不要忘记处理零宽度字符！ – o11c

@ o11c你知道如何检查吗？我用我的可能误导的谷歌搜索翻起空白。 – KyleL

像{“Mn”，“Me”}或Default_Ignorable_Code_Point'中的'General_Category' - 后者包括格式化字符，软连字符等等。但是，您还必须为Hangul组合做更复杂的事情，这取决于什么前面的字符是。 – o11c

如何检测终端中的unicode字符串宽度？

回答

相关问题