2017-10-13 81 views
3

我在R中使用tokenizers包来标记文本,但非字母数字符号(如“@”或“&”)已丢失,我需要保留它们。下面是我使用的功能:如何在R中标记单词时保留非字母数字符号?

tokenize_ngrams("My number & email address [email protected]", lowercase = FALSE, n = 3, n_min = 1,stopwords = character(), ngram_delim = " ", simplify = FALSE) 

我知道tokenize_character_shinglesstrip_non_alphanum参数,可保持标点符号,但标记化应用到字符,而不是言辞。

任何人都知道如何处理这个问题?

回答

3

如果您是怎么运用不同的封装ngram,这有保留那些非阿尔法

> library(ngram) 
> print(ngram("My number & email address [email protected]",n = 2), output = 'full') 
number & | 1 
email {1} | 

My number | 1 
& {1} | 

address [email protected] | 1 
NULL {1} | 

& email | 1 
address {1} | 

email address | 1 
[email protected] {1} | 

> print(ngram_asweka("My number & email address [email protected]",1,3), output = 'full') 
[1] "My number &"     "number & email"     
[3] "& email address"    "email address [email protected]" 
[5] "My number"      "number &"      
[7] "& email"      "email address"     
[9] "address [email protected]"  "My"        
[11] "number"       "&"        
[13] "email"       "address"      
[15] "[email protected]"    
> 

另一个美丽的包quanteda提供了更多的灵活性remove_punct paramater两个有用的功能。

> library(quanteda) 
> tokenize(text, ngrams = 1:3) 
tokenizedTexts from 1 document. 
Component 1 : 
[1] "My"        "number"       
[3] "&"        "email"       
[5] "address"      "[email protected]"    
[7] "My_number"      "number_&"      
[9] "&_email"      "email_address"     
[11] "[email protected]"  "My_number_&"     
[13] "number_&_email"     "&_email_address"    
[15] "[email protected]" 

> 
相关问题