2017-10-13 65 views
0

的对面这很可能是一个愚蠢的问题,但我GOOGLE和谷歌搜索并找不到解决方案。我认为这是因为我不知道用我的问题来搜索的正确方法。unnest_tokens

我有一个数据框,我已经在R中转换为整洁的文本格式来摆脱停用词。我现在想将那个数据框'不整洁'回到原来的格式。

unnest_tokens的反向/反向命令是什么?

编辑:这里是我正在使用的数据的样子。我试图复制西尔格和罗宾逊的书Tidy Text的分析,但使用意大利歌剧的librettos。

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line) 
sample_df 

character line 
FIGARO Cinque... dieci.... venti... trenta... trentasei...quarantatre 
SUSANNA Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello. 
CONTE  Susanna, mi sembri agitata e confusa. 
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia! 

我把它变成整洁的文本,所以我可以摆脱停止词:

tribble <- sample_df %>% 
      unnest_tokens(word, line) 
# Get rid of stop words 
# I had to make my own list of stop words for 18th century Italian opera 
itstopwords <- data_frame(text=mystopwords) 
names(itstopwords)[names(itstopwords)=="text"] <- "word" 
tribble2 <- tribble %>% 
      anti_join(itstopwords) 

现在我有这样的事情:

text word 
FIGARO cinque 
FIGARO dieci 
FIGARO venti 
FIGARO trenta 
... 

我想它找回来转换为字符名称和相关行的格式来查看其他事物。基本上,我希望文本的格式与之前的格式相同,但要删除停用词。

+0

你好,请阅读[这](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)和编辑你的问题。了解更多关于你的数据是什么样的以及你做了什么会使其他用户能够帮助你。 – shea

回答

1

不是一个愚蠢的问题!答案取决于你正在尝试做什么,但如果我想通过使用来自purrr的map函数在经过一些处理后的整理表单中恢复原始形式,那么这将是我典型的方法。

首先,让我们从原始文本转到整理格式。

library(tidyverse) 
library(tidytext) 


tidy_austen <- janeaustenr::austen_books() %>% 
    group_by(book) %>% 
    mutate(linenumber = row_number()) %>% 
    ungroup() %>% 
    unnest_tokens(word, text) 

tidy_austen 
#> # A tibble: 725,055 x 3 
#>     book linenumber  word 
#>     <fctr>  <int>  <chr> 
#> 1 Sense & Sensibility   1  sense 
#> 2 Sense & Sensibility   1   and 
#> 3 Sense & Sensibility   1 sensibility 
#> 4 Sense & Sensibility   3   by 
#> 5 Sense & Sensibility   3  jane 
#> 6 Sense & Sensibility   3  austen 
#> 7 Sense & Sensibility   5  1811 
#> 8 Sense & Sensibility   10  chapter 
#> 9 Sense & Sensibility   10   1 
#> 10 Sense & Sensibility   13   the 
#> # ... with 725,045 more rows 

文本现在是整洁!但是我们可以把它弄乱,回到某种原始形式。我通常使用来自tidyr的nest来处理这个问题,然后使用purrr的一些map函数。

nested_austen <- tidy_austen %>% 
    nest(word) %>% 
    mutate(text = map(data, unlist), 
     text = map_chr(text, paste, collapse = " ")) 

nested_austen 
#> # A tibble: 62,272 x 4 
#>     book linenumber    data 
#>     <fctr>  <int>   <list> 
#> 1 Sense & Sensibility   1 <tibble [3 x 1]> 
#> 2 Sense & Sensibility   3 <tibble [3 x 1]> 
#> 3 Sense & Sensibility   5 <tibble [1 x 1]> 
#> 4 Sense & Sensibility   10 <tibble [2 x 1]> 
#> 5 Sense & Sensibility   13 <tibble [12 x 1]> 
#> 6 Sense & Sensibility   14 <tibble [13 x 1]> 
#> 7 Sense & Sensibility   15 <tibble [11 x 1]> 
#> 8 Sense & Sensibility   16 <tibble [12 x 1]> 
#> 9 Sense & Sensibility   17 <tibble [11 x 1]> 
#> 10 Sense & Sensibility   18 <tibble [15 x 1]> 
#> # ... with 62,262 more rows, and 1 more variables: text <chr> 

是什么文字看起来像在年底,在这种特殊情况下?

nested_austen %>% 
    select(text) 
#> # A tibble: 62,272 x 1 
#>                 text 
#>                 <chr> 
#> 1            sense and sensibility 
#> 2              by jane austen 
#> 3                1811 
#> 4               chapter 1 
#> 5 the family of dashwood had long been settled in sussex their estate 
#> 6 was large and their residence was at norland park in the centre of 
#> 7  their property where for many generations they had lived in so 
#> 8 respectable a manner as to engage the general good opinion of their 
#> 9 surrounding acquaintance the late owner of this estate was a single 
#> 10 man who lived to a very advanced age and who for many years of his 
#> # ... with 62,262 more rows