2017-07-26 32 views
1

我目前正在尝试阅读希腊戏剧,这些戏剧作为XML文件在线提供给对话和演讲者专栏的数据框中。 我运行以下命令来下载XML并解析对话和扬声器。用演讲者和对话解析古希腊戏剧的XML

library(XML) 
library(RCurl) 
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186" 
html <- getURL(url, followlocation = TRUE) 
doc <- htmlParse(html, asText=TRUE) 
plain.text <- xpathSApply(doc, "//p", xmlValue) 
speakersc <- xpathSApply(doc, "//speaker", xmlValue) 
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE) 
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE) 

但是,我后来遇到了一个问题。对话将产生300行(对于剧中的300条不同线),但发言者将产生297. 问题的原因是由于下面转载的XML的结构,其中<speaker>标记不被重复用于继续对话被舞台方向打断。因为我必须将对话 与<p>标记分开,所以它会产生两个对话行,但只有一个扬声器行,而不会相应地复制扬声器。

<speaker>克里昂</speaker>

<stage>到保护。 </stage>

-<p>

可以为自己,无论你请,

<milestone n="445" unit="line" ed="p"/>

自由和清晰重收费。

<stage>退出警卫。 </stage>

</p>

</sp>

-<sp>

<stage>要安提戈涅。 </stage>

<p>然而,你告诉我 - 不是简要地,但是简要地说 - 你知道一个诏书禁止这个吗? </p>

</sp>

我如何解析XML这样的数据将正确地产生相同数量的对话行的相同数目对应的扬声器行的?

对于上面的例子,我希望得到的数据框要么包含Creon对话框中对应于舞台方向前后的两行对话的两行,要么将一行将Creon的对话视为一行忽略由于舞台方向的分离。

非常感谢您的帮助。

回答

1

考虑使用XPath的前瞻性following-sibling寻找下一个<p>标签时,扬声器是空的,同时还能通过<sp>迭代这是父<speaker><p>

# ALL SP NODES 
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue) 

# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs 
dfList <- lapply(seq_along(sp), function(i){ 
    data.frame(
    speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue), 
    dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ', 
               //body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue) 
) 

# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE 
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="") 
}) 

# SPECIFIC ROWS IN OP'S HIGHLIGHT 
finaldf[85,] 
# speakers 
# 85 Creon 
# 
# dialogue 
# 85 You can take yourself wherever you please,free and clear of a heavy 
# charge.Exit Guard. You, however, tell me—not at length, but 
# briefly—did you know that an edict had forbidden this? 

finaldf[86,] 
# speakers          dialogue 
# 87 Antigone I knew it. How could I not? It was public. 

Dataframe Output

+0

非常感谢您的帮助。该代码完美地工作,并产生正确的东西,我需要一个小的修改})移动到创建finaldf对象的上方。非常感谢您的工作! – jmlawler

0

另一种选择是在解析XML之前读取URL并进行一些更新,在这种情况下,用空格替换里程碑标记以避免将单词混合在一起,删除阶段标记,然后修复没有扬声器的sp节点

x <- readLines(url) 
x <- gsub("<milestone[^>]*>", " ", x) # add space 
x <- gsub("<stage>[^>]*stage>", "", x) # no space 
x <- paste(x, collapse = "") 
x <- gsub("</p></sp><sp><p>", "", x) # fix sp without speaker 

现在XML具有相同数量的sp和扬声器标签。

doc <- xmlParse(x) 
summary(doc) 
    p    sp   speaker   div2  placeName 
299    297    297    51    25 ... 

最后,得到sp节点和解析扬声器和段落。

sp <- getNodeSet(doc, "//sp") 
s1 <- sapply(sp, xpathSApply, ".//speaker", xmlValue) 
# collapse the 1 node with 2 <p> 
p1 <- lapply(sp, xpathSApply, ".//p", xmlValue) 
p1 <- trimws(sapply(p1, paste, collapse= " ")) 
speakers <- data.frame(speaker=s1, dialogue = p1) 

    speaker                 dialogue 
1 Antigone Ismene, my sister, true child of my own mother, do you know any evil o... 
2 Ismene To me no word of our friends, Antigone, either bringing joy or bringin... 
3 Antigone I knew it well, so I was trying to bring you outside the courtyard gat... 
4 Ismene Hear what? It is clear that you are brooding on some dark news.   
5 Antigone Why not? Has not Creon destined our brothers, the one to honored buri... 
6 Ismene Poor sister, if things have come to this, what would I profit by loose... 
7 Antigone Consider whether you will share the toil and the task.     
8 Ismene What are you hazarding? What do you intend?        
9 Antigone Will you join your hand to mine in order to lift his corpse?    
10 Ismene You plan to bury him—when it is forbidden to the city?  
... 
+0

您的代码对我来说工作得非常好 - 感谢您的解决方案!感谢您的帮助。 – jmlawler