2016-05-17 70 views
0

我有一个来自调查问卷的数据库。这个数据库有一些复杂和长的文本,这对我来说我也必须在分析后面用它们作为变量。从数据框中的变量中提取文本并创建新的向量

数据帧类型我分析就是一个例子为以下之一:

cnt <-as.factor(c("Country 1", "Country 2", "Country 3", "Country 1", "Country 2", "Country 3")) 
bnk <-as.factor(c("bank 1", "bank 2", "bank 3", "bank 1", "bank 2", "bank 3")) 
qst <-as.factor(c(" Q.1 - some long question?", " Q.1 - some long question?", " Q.1 - some long question?", "Q.27 <U+FFFD> another long question?","Q.27 <U+FFFD> another long question?","Q.27 <U+FFFD> another long question?")) 
ans <-as.numeric(c(1,1,2,1,2,3)) 
df <-data.frame(cnt, bnk, qst,ans) 
names(df) <- c("Country", "Institute", "Question", "Answer") 
head(df) 

    Country Institute        Question Answer 
1 Country 1 bank 1   Q.1 - some long question?  1 
2 Country 2 bank 2   Q.1 - some long question?  1 
3 Country 3 bank 3   Q.1 - some long question?  2 
4 Country 1 bank 1 Q.27 <U+FFFD> another long question?  1 
5 Country 2 bank 2 Q.27 <U+FFFD> another long question?  2 
6 Country 3 bank 3 Q.27 <U+FFFD> another long question?  3 

正如你可以在变量“问题”看,不管问题是什么,有规律可循的:所有文字与Q.number

开始

只是为了您的信息,不同的问题的数字是49

有几件事情(或步骤),我想在这里做的:

  1. 首先,我想创建一个新的向量,我可以索引这个问题。因此,例如我的数据框变成这个样子:

DF < -mutate(DF,QS = C( “Q1”, “Q1”, “Q1”, “Q27”, “Q27”,“Q27 “))

Country Institute        Question Answer qs 
1 Country 1 bank 1   Q.1 - some long question?  1 q1 
2 Country 2 bank 2   Q.1 - some long question?  1 q1 
3 Country 3 bank 3   Q.1 - some long question?  2 q1 
4 Country 1 bank 1 Q.27 <U+FFFD> another long question?  1 q27 
5 Country 2 bank 2 Q.27 <U+FFFD> another long question?  2 q27 
6 Country 3 bank 3 Q.27 <U+FFFD> another long question?  3 q27 
  • 然后,我想创建一个类似于一个新的载体是步骤1,但指数化仅包括数字。这是因为我想将这个额外的向量作为我想用作标签的因子来处理每个问题中不包含“Q”的部分。为此,我想我需要在变量“Question”中进行搜索并进行相关提取。
  • 因此,归根结底,数据帧必须是这个样子:

    Country Institute        Question Answer qs qs_inx     labels 
    1 Country 1 bank 1   Q.1 - some long question?  1 q1  1 some long question? 
    2 Country 2 bank 2   Q.1 - some long question?  1 q1  1 some long question? 
    3 Country 3 bank 3   Q.1 - some long question?  2 q1  1 some long question? 
    4 Country 1 bank 1 Q.27 <U+FFFD> another long question?  1 q2  2 another long question? 
    5 Country 2 bank 2 Q.27 <U+FFFD> another long question?  2 q2  2 another long question? 
    6 Country 3 bank 3 Q.27 <U+FFFD> another long question?  3 q2  2 another long question? 
    
    +0

    使用'sub',如'DF%>%突变(qs_idx =子( '(Q。\\ d +)。*', '\\ 1',问题), 适量= as.integer (sub('Q。','',qs_idx)))'。如果你喜欢,可以使用'tidyr :: extract_numeric'或'stringr :: str_extract'。 – alistaire

    回答

    1

    如果正确理解你想要的df$Question两个副本,但每个副本使用不同的标签。

    df$qs_inx <- df$Question 
    df$labels <- df$Question 
    
    levels(df$qs_inx) <- sub('[ ]*Q\\.([0-9]+).*', 'q\\1', levels(df$Question)) 
    levels(df$labels) <- sub('[ ]*Q\\.(.*)', '\\1', levels(df$Question)) 
    
    +0

    谢谢你,工作。但是,请您提供一些关于'[] * Q \\。([0-9] +)。'','q \\ 1'和'[] * Q \\(。* )','\\ 1'。很高兴知道。否则,任何参考也会很好。 – msh855

    +0

    @ msh855这些是正则表达式https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html –

    相关问题