基于字节而不是字符数的子空间

我正在创建一个输入系统，其中字段最大值只能是200字节。我通过以下计算剩余字节数（这可能方法，但为辩论，太！）：基于字节而不是字符数的子空间

var totalBytes = 200; 
var $newVal = $(this).val(); 
var m = encodeURIComponent($newVal).match(/%[89ABab]/g); 
var bytesLeft = totalBytes - ($newVal.length + (m ? m.length : 0));

这似乎很好地工作，但是如果有人在一大块数据的粘贴，我希望能够切分输入并只显示200字节。我想在伪代码，看起来像这样：

$newText = substrBytes($string, 0, 200);

任何帮助或指导，将不胜感激。

编辑：一切会在这里为UTF-8 BTW :)

编辑2：我知道，我可以循环的每一个字符和评价，我觉得我希望有可能是一个多一点优雅来照顾这一点。

谢谢！

来源

2012-04-18 Slazlaa

什么是对待你的输入以字节为单位，而不是文本的原因是什么？请参阅http://www.w3schools.com/jsref/jsref_obj_string.asp上的字符串方法 – jazzytomato 2012-04-18 11:38:32

我可能错了，但我的印象是这需要字符编码之间的某种类似iconv的转换。听起来不容易。 – 2012-04-18 11:42:35

插入的系统要求文本有效负载大小不大于200字节。 – Slazlaa 2012-04-18 11:43:12

谷歌搜索得到a blog article，完成一个自己动手的输入框。我在这里复制代码，因为它喜欢明确的答案而不是链接，但是信用额度为McDowell。

/** 
* codePoint - an integer containing a Unicode code point 
* return - the number of bytes required to store the code point in UTF-8 
*/ 
function utf8Len(codePoint) { 
    if(codePoint >= 0xD800 && codePoint <= 0xDFFF) 
    throw new Error("Illegal argument: "+codePoint); 
    if(codePoint < 0) throw new Error("Illegal argument: "+codePoint); 
    if(codePoint <= 0x7F) return 1; 
    if(codePoint <= 0x7FF) return 2; 
    if(codePoint <= 0xFFFF) return 3; 
    if(codePoint <= 0x1FFFFF) return 4; 
    if(codePoint <= 0x3FFFFFF) return 5; 
    if(codePoint <= 0x7FFFFFFF) return 6; 
    throw new Error("Illegal argument: "+codePoint); 
} 

function isHighSurrogate(codeUnit) { 
    return codeUnit >= 0xD800 && codeUnit <= 0xDBFF; 
} 

function isLowSurrogate(codeUnit) { 
    return codeUnit >= 0xDC00 && codeUnit <= 0xDFFF; 
} 

/** 
* Transforms UTF-16 surrogate pairs to a code point. 
* See RFC2781 
*/ 
function toCodepoint(highCodeUnit, lowCodeUnit) { 
    if(!isHighSurrogate(highCodeUnit)) throw new Error("Illegal argument: "+highCodeUnit); 
    if(!isLowSurrogate(lowCodeUnit)) throw new Error("Illegal argument: "+lowCodeUnit); 
    highCodeUnit = (0x3FF & highCodeUnit) << 10; 
    var u = highCodeUnit | (0x3FF & lowCodeUnit); 
    return u + 0x10000; 
} 

/** 
* Counts the length in bytes of a string when encoded as UTF-8. 
* str - a string 
* return - the length as an integer 
*/ 
function utf8ByteCount(str) { 
    var count = 0; 
    for(var i=0; i<str.length; i++) { 
    var ch = str.charCodeAt(i); 
    if(isHighSurrogate(ch)) { 
     var high = ch; 
     var low = str.charCodeAt(++i); 
     count += utf8Len(toCodepoint(high, low)); 
    } else { 
     count += utf8Len(ch); 
    } 
    } 
    return count; 
}

来源

2012-04-18 12:03:22

这段代码非常有趣，但结果似乎会根据保存源代码的文件的编码而改变。但这对表单来说可能很好。 – 2012-04-19 09:41:25

尽管这段代码实际上并没有我正在寻找的切片函数，但它完美地处理了字节计数。感谢你！ :) – Slazlaa 2012-04-20 09:34:36

JavaScript中的字符串在内部用UTF-16表示，所以每个字符实际上占用两个字节。所以你的问题更像是“在UTF-8中获取str的字节长度”。

你几乎不需要一个符号的一半，所以它可能会减少198或199个字节。

我这里还有两种不同的解决方案：

// direct byte size counting 
function cutInUTF8(str, n) { 
    var len = Math.min(n, str.length); 
    var i, cs, c = 0, bytes = 0; 
    for (i = 0; i < len; i++) { 
     c = str.charCodeAt(i); 
     cs = 1; 
     if (c >= 128) cs++; 
     if (c >= 2048) cs++; 
     if (n < (bytes += cs)) break; 
    } 
    return str.substr(0, i); 
} 

// using internal functions, but is not very fast due to try/catch 
function cutInUTF8(str, n) { 
    var encoded = unescape(encodeURIComponent(str)).substr(0, n); 
    while (true) { 
     try { 
      str = decodeURIComponent(escape(encoded)); 
      return str; 
     } catch(e) { 
      encoded = encoded.substr(0, encoded.length-1); 
     } 
    } 
}

来源

2012-04-18 12:04:28 kirilloid

第一个函数在if（n 2012-04-19 09:32:58

修正了两个函数 – kirilloid 2012-04-19 11:57:02

这是一个很好的答案，只是不如上述答案深入。如果我能将它们标记为正确的话，我会的！谢谢！ – Slazlaa 2012-04-20 09:35:33

基于字节而不是字符数的子空间

回答

相关问题