2016-11-22 54 views
0

在此先感谢!我一直在尝试这几天,我有点卡住了。我试图循环访问一个文本文件(作为列表导入),并从文本文件创建一个数据框。如果列表中的项目在文本中具有星期几,并且将填充到第一列(V1)中,则数据框将开始一个新行。我想将其余的评论放在第二列(V2)中,我可能必须将字符串连接在一起。我试图用grepl()来使用条件语句,但是在设置初始数据框后,我对逻辑有些迷失。通过文本循环创建数据帧

这里是我使成R的示例文本(这是数据的Facebook从文本文件)。 []表示列表号。这是一个很长的文件(50K +行),但我有日期列设置。

[1] 星期四8月25日,2016年下午3点57分EDT

[2] 足球时间!我们需要制定计划!我发短信给我的家伙,虽然去年没有接触过。所以我们会看到我的结局!你有什么烹饪?

[3]周日,2016年8月14日在9:17 EDT

[4]迈克尔·杰森共享后。

[5]这只鸟是比大多数政治职位的我看了最近这里

[6]周日,2016年8月14日在上午08时44 EDT

[7]迈克尔聪明很多和库尔特现在是朋友。在一周的某一天在数据帧开始一个新行,而列表的其余部分被连接成数据帧的第二列

的最终结果将是数据帧。因此最终数据名声将是

行1([1]在V1和[2]在V2)

行2([3]在V1和[4],[5]在V2)

行3([6]在V1和[7]在V2)

这里是我的代码开始,我可以得到V1至正确填充,但不是数据帧的第二列中。

### Read in the text file 
temp <- readLines("C:/Program Files/R/Text Mining/testa.txt") 

### Remove empty lines from the text file 
temp <- temp[temp!=""] 

### Create the temp char file as a list file 
tmp <- as.list(temp) 

### A days vector for searching through the list of days. 
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday","Friday", "Saturday") 
df <- {} 

### Loop through the list 
for (n in 1:length(tmp)){ 

    ### Search to see if there is a day in the list item 
    for(i in 1:length(days)){ 
      if(grepl(days[i], tmp[n])==1){ 
    ### Bind the row to the df if there is a day in the list item 
        df<- rbind(df, tmp[n]) 
      } 
    } 
### I know this is wrong, I am trying to create a vector to concatenate and add to the data frame, but I am struggling here.  
d <- c(d, tmp[n]) 
} 
+0

使用'dput'请分享您的数据。 –

回答

1

下面是一个使用tidyverse一个选项:

library(tidyverse) 

text <- "[1] Thursday, August 25, 2016 at 3:57pm EDT 

[2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking??? 

[3]Sunday, August 14, 2016 at 9:17am EDT 

[4]Michael shared Jason post. 

[5]This bird is a lot smarter than the majority of political posts I have read recently here 

[6]Sunday, August 14, 2016 at 8:44am EDT 

[7]Michael and Kurt are now friends." 

df <- data_frame(lines = read_lines(text)) %>% # read data, set up data.frame 
    filter(lines != '') %>% # filter out empty lines 
    # set grouping by cumulative number of rows with weekdays in them 
    group_by(grp = cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), lines))) %>% 
    # collapse each group to two columns 
    summarise(V1 = lines[1], V2 = list(lines[-1])) 

df 
## # A tibble: 3 × 3 
##  grp           V1  V2 
## <int>          <chr> <list> 
## 1  1 [1] Thursday, August 25, 2016 at 3:57pm EDT <chr [1]> 
## 2  2 [3]Sunday, August 14, 2016 at 9:17am EDT <chr [2]> 
## 3  3 [6]Sunday, August 14, 2016 at 8:44am EDT <chr [1]> 

这种方法使用了V2列表列,这可能是在保护你的数据而言,最好的办法,但如果你使用pastetoString需要。


大致相当于基础R:

df <- data.frame(V2 = readLines(textConnection(text)), stringsAsFactors = FALSE) 

df <- df[df$V2 != '', , drop = FALSE] 

df$grp <- cumsum(grepl(paste(weekdays(1:7, abbreviate = FALSE), collapse = '|'), df$V2)) 

df$V1 <- ave(df$V2, df$grp, FUN = function(x){x[1]}) 

df <- aggregate(V2 ~ grp + V1, df, FUN = function(x){x[-1]}) 

df 
## grp           V1 
## 1 1 [1] Thursday, August 25, 2016 at 3:57pm EDT 
## 2 2 [3]Sunday, August 14, 2016 at 9:17am EDT 
## 3 3 [6]Sunday, August 14, 2016 at 8:44am EDT 
##                                         V2 
## 1 [2] Football time!! We need to make plans!!!! I texted my guy, though haven't been in touch sense last year. So we'll see on my end!!! What do you have cooking??? 
## 2          [4]Michael shared Jason post., [5]This bird is a lot smarter than the majority of political posts I have read recently here 
## 3                                [7]Michael and Kurt are now friends. 
+0

非常感谢您的回复!我收到周日函数的错误。 grepl错误(粘贴(平日(1:7,缩写= FALSE),崩溃=“|”)):参数“x”丢失,没有默认 如果我只是尝试使用平日,类似的错误,现在阅读它。 –

+0

'paste'调用是为''grepl'创建一个字符串'“星期五|星期六|星期日|星期一|星期二|星期三|星期四'''。你可以任何你喜欢的方式创建字符串。 '平日(Sys.Date()+ 1:7)'应该工作;老实说,当你传递一个数字时,调用什么方法有点含糊不管。并确保'grepl'的''''参数(行的列)也在那里;它可能会引发同样的错误。 – alistaire

+0

是的!万分感谢!我只是在学习R,但是这个代码很棒!我喜欢你如何使用汇总并将第二列连接到列表中。我正在读tibbles,cumsum和总结,充分理解代码,再次感谢这真棒! –