FileUpload服务器控件和Unicode字符

我正在使用FileUpload服务器控件上载先前从MS Word保存（作为网页;过滤）的HTML文档。字符集是windows-1252。该文档具有智能引号（卷曲）以及常规引号。它还具有一些空白区域（显然），当深入查看除了正常TAB或SPACE以外的其他字符。FileUpload服务器控件和Unicode字符

在捕获StreamReader中的文件内容时，这些特殊字符会被转换为问号。我假设它是因为默认的encoidng是UTF-8而文件是Unicode。

我继续使用Unicode编码创建StreamReader，然后用正确的（我实际上在stackoverflow中找到的代码）替换所有不需要的字符。这似乎工作....只是我不能将字符串转换回UTF-8以显示它在asp：文字。代码在那里，它应该工作....但输出（ConvertToASCII）是不可读的。此外

protected void btnUpload_Click(object sender, EventArgs e) 
    { 
     StreamReader sreader; 
     if (uplSOWDoc.HasFile) 
     { 
      try 
      { 
       if (uplSOWDoc.PostedFile.ContentType == "text/html" || uplSOWDoc.PostedFile.ContentType == "text/plain") 
       { 
        sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode); 
        string sowText = sreader.ReadToEnd(); 
        sowLiteral.Text = ConvertToASCII(sowText); 
        lblUploadResults.Text = "File loaded successfully."; 
       } 
       else 
        lblUploadResults.Text = "Upload failed. Just text or html files are allowed."; 
      } 
      catch(Exception ex) 
      { 
       lblUploadResults.Text = ex.Message; 
      } 
     } 
    } 

    private string ConvertToASCII(string source) 
    { 
     if (source.IndexOf('\u2013') > -1) source = source.Replace('\u2013', '-'); 
     if (source.IndexOf('\u2014') > -1) source = source.Replace('\u2014', '-'); 
     if (source.IndexOf('\u2015') > -1) source = source.Replace('\u2015', '-'); 
     if (source.IndexOf('\u2017') > -1) source = source.Replace('\u2017', '_'); 
     if (source.IndexOf('\u2018') > -1) source = source.Replace('\u2018', '\''); 
     if (source.IndexOf('\u2019') > -1) source = source.Replace('\u2019', '\''); 
     if (source.IndexOf('\u201a') > -1) source = source.Replace('\u201a', ','); 
     if (source.IndexOf('\u201b') > -1) source = source.Replace('\u201b', '\''); 
     if (source.IndexOf('\u201c') > -1) source = source.Replace('\u201c', '\"'); 
     if (source.IndexOf('\u201d') > -1) source = source.Replace('\u201d', '\"'); 
     if (source.IndexOf('\u201e') > -1) source = source.Replace('\u201e', '\"'); 
     if (source.IndexOf('\u2026') > -1) source = source.Replace("\u2026", "..."); 
     if (source.IndexOf('\u2032') > -1) source = source.Replace('\u2032', '\''); 
     if (source.IndexOf('\u2033') > -1) source = source.Replace('\u2033', '\"'); 


     byte[] sourceBytes = Encoding.Unicode.GetBytes(source); 
     byte[] targetBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, sourceBytes); 
     char[] asciiChars = new char[Encoding.ASCII.GetCharCount(targetBytes, 0, targetBytes.Length)]; 
     Encoding.ASCII.GetChars(targetBytes, 0, targetBytes.Length, asciiChars, 0); 

     string result = new string(asciiChars); 

     return result; 

    }

，正如我以前说过，有一些更“透明”字，似乎相当于使Word文档已编号压痕，我不知道如何捕捉：

请看看下面他们的unicode价值取代他们....所以如果你有任何提示，请让我知道。

非常感谢！

来源

2011-03-15 allendehl

根据StreamReader on MSDN：

StreamReader对象试图通过查看的前三个字节流来检测编码。它会自动识别UTF-8, 小端Unicode和大端 Unicode文本，如果文件以开头，则为相应的字节顺序标记。否则，使用用户提供的编码。

因此，如果您上传的文件的字符集为windows-1252，那么你的行：

sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);

是不正确，因为文件的内容不采用Unicode编码。相反，使用：

sreader = new StreamReader(uplSOWDoc.FileContent, 
        Encoding.GetEncoding("Windows-1252"), true);

其中的final boolean parameter is to detect the BOM。

来源

2011-03-15 22:21:12

谢谢兄弟!!! ...做到了！ – allendehl 2011-03-16 17:21:24

不客气。 – 2011-03-16 17:33:47

sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);

恭喜，您是被“Encoding.Unicode”咬住的第一百万编码器。

没有“Unicode编码”这样的东西。 Unicode是字符集，它有许多不同的编码。

Encoding.Unicode实际上是特定的编码UTF-16LE，其中字符被编码为UTF-16“编码单元”，然后每个16位编码单元以小端顺序写入字节。这是用于Windows NT的本地内存中Unicode字符串格式，但您几乎不希望将其用于读取或写入文件。作为一个2字节的单位编码，它不是ASCII兼容的，并且它对于存储或连线的效率并不高。

这些天来，UTF-8是一种更常见的用于Unicode文本的编码。但是，微软将UTF-16LE误称为“Unicode”，继续混淆并愚弄只想“支持Unicode”的用户。由于Encoding.Unicode是一种非ASCII兼容编码，试图以ASCII超集编码（例如UTF-8或Windows默认代码页，如1252西欧版）读取文件会造成一切困难，而非只是非ASCII字符。

在这种情况下，你的编码文件存储在为Windows代码页1252因此，与阅读：

sreader= new StreamReader(uplSOWDoc.FileContent, Encoding.GetEncoding(1252));

我会离开它。不要试图“转换为ASCII”。这些聪明的引号是非常好的字符，应该像任何其他Unicode字符一样受到支持;如果您在显示智能引号时遇到问题，那么您可能也会损坏所有其他非ASCII字符。最好解决导致这种情况发生的问题，而不是仅仅为了几个常见情况而避免它。

来源

2011-03-17 01:23:35 bobince

FileUpload服务器控件和Unicode字符

回答

相关问题