清理非字母字符的字符串ma

我想清理C++中的字符串。我想清除所有非字母字符，并且保留所有种类的英文和非英文字母。我的一个测试的代码看起来像这样清理非字母字符的字符串ma

int main() 
{ 
string test = "Danish letters: Æ Ø Å !!!!!!??||~"; 
cout << "Test = " << test << endl; 

for(int l = 0;l<test.size();l++) 
{ 
    if(!isalpha(test.at(l)) && test.at(l) != ' ') 
    { 
     test.replace(l,1," nope"); 
    } 
} 

cout << "Test = " << test << endl; 

return 0;

}

这使我的输出：

Test = Danish letters: Æ Ø Å !!!!!!??||~ 
Test = Danish letters nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope"

所以我的问题是，我怎么删除“!!!!! ！|| ||“而不是”ÆØÅ“？

我也试了测试，如

test.at(l)!='Å'

，但我我不能编译，如果我宣布“A”为char。

我读过关于unicode和utf8的内容，但我不太明白。

请帮我:)

来源

2016-10-01 user2994461

那么，你需要不断阅读关于Unicode和UTF8直到你了解它，然后一切都应该是一清二楚。 –

您可能想看看标题为[如何从字符串中去除所有非字母数字字符]的SO问题（http://stackoverflow.com/questions/6319872/how-to-strip-all-non-alphanumeric-characters-从-A-字符串在-C）。我也有兴趣看看[std :: isalnum]（http://en.cppreference.com/w/cpp/string/byte/isalnum）是否适用于你的情况。 – 2016-10-01 20:49:29

@RawN：这两个链接仅适用于ASCII，这个问题（隐含地）是关于非ASCII的。 –

char用于ASCII字符集，而你正试图使上具有非ASCII字符的字符串操作。

您对Unicode字符进行操作，所以你需要使用宽字符串操作：

int main() 
{ 
    wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~"; 
    wcout << L"Test = " << test << endl; 

    for(int i = 0; i < test.size(); i++) { 

     if(!iswalpha(test.at(i)) && test.at(i) != ' ') { 

      test.replace(i,1,L" nope"); 
     } 
    } 

    wcout << L"Test = " << test << endl; 

    return 0; 
}

您也可以使用QT和使用QString，所以相同的代码和平将成为：

QString test = "Danish letters: Æ Ø Å !!!!!!??||~"; 
qDebug() << "Test =" << test; 

for(int i = 0; i < test.size(); i++) { 

    if(!test.at(i).isLetterOrNumber() && test.at(i) != ' ') { 

     test.replace(i, 1, " nope"); 
    } 
} 

qDebug() << "Test = " << test;

来源

2016-10-01 22:13:37

是的，这段代码只留下英文和非英文字符，因为我们正在使用iswalpha。 –

哇，我的表情符号很糟糕的想法。从头开始：C++宽泛函数和类只能在基本的多语言平面上工作，并且在给定补充平面中的字符时失败，其中当前包含73000个字符，其中一些必须是字母字符。 iswalpha是_broken_。 https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Multilingual_Plane –

@MooingDuck宽字符API与*实现定义的*固定宽度编码一起工作，可能与Unicode无关。它可以像Windows一样基于UTF-16，其效果是不能正确处理BMP以外的字符，或者可以使用类似于Linux上的UTF-32，这使得可以完全支持Unicode。或者它可以使用完全不同的字符集。 – nwellnhof

这是一个代码示例，您可以使用不同的语言环境和实验进行游戏，以便获得想要的内容。您可以尝试使用u16string，u32string等。使用语言环境在开始时有点混乱。大多数人用ASCII编程。

在主函数调用一个我写

#include <iostream> 
#include <string> 
#include <codecvt> 
#include <sstream> 
#include <locale> 

wstring test = L"Danish letters: Æ Ø Å !!!!!!??||~ Πυθαγόρας ὁ Σάμιος"; 
removeNonAlpha(test); 


wstring removeNonAlpha(const wstring &input) { 
    typedef codecvt<wchar_t, char, mbstate_t> Cvt; 
    locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8")); 
    wcout.imbue(utf8locale); 
    wcout << input << endl; 
    wstring res; 
    std::locale loc2("en_US.UTF8"); 
    for(wstring::size_type l = 0; l<input.size(); l++) { 
     if(isalpha(input[l], loc2) || isspace(input[l], loc2)) { 
     cout << "is char\n"; 
     res += input[l]; 
     } 
     else { 
     cout << "is not char\n"; 
     } 
    } 
    wcout << L"Hello, wide to multybyte world!" << endl; 
    wcout << res << endl; 
    cout << std::isalpha(L'Я', loc2) << endl; 
    return res; 
}

来源

2016-10-01 23:31:05

'wchar_t'不保证足够宽以表示Unicode代码点。在Windows上它是16位，代表一个UTF-16代码单元，而不是代码点。 – roeland

清理非字母字符的字符串ma

回答

相关问题