Tokenize NSString在Objective-C中出现两次

我在Objective-C中没有太多经验，如果这真的很明显，对不起。Tokenize NSString在Objective-C中出现两次

我需要的是将NSString拆分为令牌。令牌由空格或另一个符号（不是字母）分隔。问题是我想保留分隔符，除非它们是空格。

示例短语：“a b c，d's，e f。”从这个我想获得：

"a" 
"b" 
"c" 
"," 
"d" 
"'" 
"s" 
"," 
"e" 
"f" 
"."

有了这个代码：

NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet]; 
[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]; 

NSArray *parse_array = [intext componentsSeparatedByCharactersInSet:separators];

我只得到信件。如果我只是过滤空白区域和NL，我会将字母和符号连在一起。我需要的是按顺序执行两个解析（首先是空格和Nl，然后是标点符号），但我真的不知道如何在objective-c中执行解析。任何人都可以给我一个提示吗？

谢谢！

来源

2011-06-27 Miguel E

嗯，你可以做这样的事情，从一个字符串中删除所有的空白：

NSArray * t = [string componentsSeparatedByCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]; 
string = [t componentsJoinedByString:@""];

然后你可以只遍历字符，把它们变成NSStrings：

NSMutableArray *tokens = [NSMutableArray array]; 
for (NSUInteger i = 0; i < [string length]; ++i) { 
    unichar character = [string characterAtIndex:i]; 
    NSString *token = [NSString stringWithFormat:@"%C", character]; 
    [tokens addObject:token]; 
} 
NSLog(@"%@", tokens);

或者如果你之前不想去掉空白，你可以在循环中进行：

NSMutableArray *tokens = [NSMutableArray array]; 
for (NSUInteger i = 0; i < [string length]; ++i) { 
    unichar character = [string characterAtIndex:i]; 
    if ([[NSCharacterSet whitespaceCharacterSet] characterIsMember:character]) { 
    continue; 
    } 
    NSString *token = [NSString stringWithFormat:@"%C", character]; 
    [tokens addObject:token]; 
} 
NSLog(@"%@", tokens);

来源

2011-06-27 18:32:15

对不起，误导你，但我的例句只有字母，但目的是要用它来解析单词。我将添加一些缓冲区并调整解决方案。谢谢！ –

我知道它与这段代码一起工作。这适用于字母或文字：

//parse the phrase into tokens. Punctuation will be tokenized too. 
NSMutableArray *tokens = [NSMutableArray array]; 
NSInteger last_word_start = -1; 
// 
for (NSUInteger i = 0; i < [myPhrase length]; ++i) 
{ 
    unichar character = [myPhrase characterAtIndex:i]; 
    if ([[NSCharacterSet whitespaceCharacterSet] characterIsMember:character]) 
    { 
     if (last_word_start >= 0) 
      [tokens addObject:[myPhrase substringWithRange:NSMakeRange(last_word_start, i-last_word_start)]]; 
     last_word_start = -1; 
    } 
    else 
    { 
     if ([[NSCharacterSet punctuationCharacterSet] characterIsMember:character]) 
     { 
      if (last_word_start >= 0) 
       [tokens addObject:[myPhrase substringWithRange:NSMakeRange(last_word_start, i-last_word_start)]]; 
      [tokens addObject:[NSString stringWithFormat:@"%C", character]]; 
      last_word_start = -1; 
     } 
     else 
     { 
      if (last_word_start == -1) 
       last_word_start = i; 
     } 
    } 
} 
//save pending letters 
if (last_word_start >= 0) 
    [tokens addObject:[myPhrase substringWithRange:NSMakeRange(last_word_start, [myPhrase length]-last_word_start)]]; 
NSLog(@"Tokens for phrase '%@':",myPhrase); 
NSLog(@"%@", tokens);

谢谢！

来源

2011-06-28 09:57:18

看看我的开源可可字符串标记化/分析工具：ParseKit：

http://parsekit.com

ParseKit包含一个非常强大的/灵活tokenizer类：PKTokenizer。默认情况下，PKTokenizer将默默使用空白标记而不报告它们。（在这种情况下，这是你想要的，但如果你没有这种行为可以配置。）

下面是你可以使用PKTokenizer对于这个特殊的任务：

// create the tokenizer with your string 
NSString *inStr = @"a b c,d's, e f."; 
PKTokenizer *t = [PKTokenizer tokenizerWithString:inStr]; 

// configure the tokenizer to not allow apostrophes inside words (that's the default) 
[t.wordState setWordChars:NO from:'\'' to:'\'']; 

// loop thru the input and concat the non-whitespace chars 
PKToken *eof = [PKToken EOFToken]; 
PKToken *tok = nil; 

NSMutableArray *outStrs = [NSMutableArray array]; 
while ((tok = [t nextToken]) != eof) { 
    [outStrs addObject:tok.stringValue]; 
}

outStrs包含：

“一” “b” “c” 的 “” “d” “'” “S” “，” “e” “f” “。”

对于这个特定的任务，ParseKit可能有点矫枉过正。但是，如果你有几个类似的任务，这可能值得检查，因为它可以节省你的时间/痛苦。

来源

2011-10-13 16:36:56

Tokenize NSString在Objective-C中出现两次

回答

相关问题