2012-01-11 35 views
1

我试图在大会演讲中确定最常用的词语,并且必须由国会议员将其分开。我刚开始学习R和tm包。我有一个可以找到最常用词的代码,但是我可以使用什么样的代码来自动识别和存储演讲者?R在tm包中划分文本 - 识别扬声器

文字是这样的:

OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN 

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon. 
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings. 
[....] 

    STATEMENT OF SENATOR MEL MARTINEZ, RANKING MEMBER 
[....] 

我希望能够通过人们获得这些名称,或独立的文本。希望您能够帮助我。非常感谢。

回答

0

说你想分割文件是否正确,以便每个扬声器有一个文本对象?然后使用正则表达式来为每个对象抓住说话者的名字?然后,您可以编写一个函数来收集每个对象的词频等,并将它们放在一个表格中,其中的行或列名称是演讲者的名字。

如果是的话,你可能会说,x是你的文字,然后用strsplit(x, "STATEMENT OF")分割上,然后grep()str_extract()的话语句返回的2名或3个字后,参议员(他们总是只有两个名字在你的例?)。

看看这里以获得更多关于使用这些功能,一般在R文本操作:http://en.wikibooks.org/wiki/R_Programming/Text_Processing

UPDATE下面是一个更完整的答案...

#create object containing all text 
x <- c("OPENING STATEMENT OF SENATOR HERB KOHL, CHAIRMAN 

    The Chairman. Good afternoon to everybody, and thank you 
very much for coming to this hearing this afternoon. 
    In today's tough economic climate, millions of seniors have 
lost a big part of their retirement and investments in only a 
matter of months. Unlike younger Americans, they do not have 
time to wait for the markets to rebound in order to recoup a 
lifetime of savings. 

STATEMENT OF SENATOR BIG APPLE KOHL, CHAIRMAN 

I am trying to identify the most frequently used words in the 
congress speeches, and have to separate them by the congressperson. 
I am just starting to learn about R and the tm package. I have a code 
that can find the most frequent words, but what kind of a code can I 
use to automatically identify and store the speaker of the speech 

STATEMENT OF SENATOR LITTLE ORANGE, CHAIRMAN 

Would it be correct to say that you want 
to split the file so you have one text object 
per speaker? And then use a regular expression 
to grab the speaker's name for each object? Then 
you can write a function to collect word frequencies, 
etc. on each object and put them in a table where the 
row or column names are the speaker's names.") 

# split object on first two words 
y <- unlist(strsplit(x, "STATEMENT OF")) 

#load library containing handy function 
library(stringr) 

# use word() to return words in positions 3 to 4 of each string, which is where the first and last names are 
    z <- word(y[2:4], 3, 4) # note that the first line in the character vector y has only one word and this function gives and error if there are not enough words in the line 
    z # have a look at the result... 
    [1] "HERB KOHL,"  "BIG APPLE"  "LITTLE ORANGE," 

毫无疑问一个正则表达式向导可以想出更快更简洁的方法!

无论如何,从这里你可以运行一个函数来计算矢量y(即每个说话人的语音)的每一行上的单词频率,然后创建另一个结合单词频率结果和名称的对象,以便进一步分析。

+1

谢谢,我想这可能会奏效。 – appletree 2012-01-11 06:28:33

+0

@appletree,我已经扩展了我的答案,我希望有所帮助。我有一个正则表达式的解决方案,但无法使其工作。也许有人会告诉我们它是如何完成的... – Ben 2012-01-11 07:40:32

0

这是我如何使用本的例子(使用qdap解析,并创建一个数据帧,然后转换为Corpus 3文档处理它;注意,qdap是专为喜欢这份成绩单的数据和Corpus可能不最好的数据格式):

library(qdap) 
dat <- unlist(strsplit(x, "\\n")) 

locs <- grep("STATEMENT OF ", dat) 
nms <- sapply(strsplit(dat[locs], "STATEMENT OF |,"), "[", 2) 
dat[locs] <- "SPLIT_HERE" 
corp <- with(data.frame(person=nms, dialogue = 
    Trim(unlist(strsplit(paste(dat[-1], collapse=" "), "SPLIT_HERE")))), 
    df2tm_corpus(dialogue, person)) 

tm::inspect(corp) 

## A corpus with 3 text documents 
## 
## The metadata consists of 2 tag-value pairs and a data frame 
## Available tags are: 
## create_date creator 
## Available variables in the data frame are: 
## MetaID 
## 
## $`SENATOR BIG APPLE KOHL` 
## I am trying to identify the most frequently used words in the congress speeches, and have to separate them by the congressperson. I am just starting to learn about R and the tm package. I have a code that can find the most frequent words, but what kind of a code can I use to automatically identify and store the speaker of the speech 
## 
## $`SENATOR HERB KOHL` 
## The Chairman. Good afternoon to everybody, and thank you very much for coming to this hearing this afternoon.  In today's tough economic climate, millions of seniors have lost a big part of their retirement and investments in only a matter of months. Unlike younger Americans, they do not have time to wait for the markets to rebound in order to recoup a lifetime of savings. 
## 
## $`SENATOR LITTLE ORANGE` 
## Would it be correct to say that you want to split the file so you have one text object per speaker? And then use a regular expression to grab the speaker's name for each object? Then you can write a function to collect word frequencies, etc. on each object and put them in a table where the row or column names are the speaker's names.