2017-10-11 58 views
1

使用R编程,我需要从文件中获取令牌ngram = 2。从r编程中的同一行取得令牌

的问题是,它结合了线,有的令牌有一部分在行结束,并在下一行

Req_tok <-jobs %>% unnest_tokens(ngram,POSITION, token = "ngrams", n = 2) 
在文件工作

开始另一部分,我有前两个行:

it architect 

it helpdesk support agents 

我得到这样的标记:

it architect 
architect it 
it helpdesk 
and so on .... 

怎么做才能不去像KENS “建筑师它”

我要来标记每行分别

回答

0

就在添加collapse = FALSEunnest_tokens

library(tidytext) 
library(dplyr) 

jobs %>% 
    unnest_tokens(ngram, POSITION, token = "ngrams", n = 2, collapse = FALSE) 

结果:

   ngram 
1  it architect 
2  it helpdesk 
2.1 helpdesk support 
2.2 support agents 

记住要转换您的字符串向量转换为字符,如果它是因子变量,否则将unnest_token排你一个错误。

数据:

jobs = data.frame(POSITION = c("it architect", "it helpdesk support agents"), stringsAsFactors = FALSE)