2016-11-06 113 views
4

我试图从下面的段落结构提取这种类型的信息:NLP - 在Python(spaCy)信息提取

women_ran men_ran kids_ran walked 
     1  2  1  3 
     2  4  3  1 
     3  6  5  2 

text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."] 

我使用Python的spaCy我的NLP图书馆。我更新NLP的工作,并希望得到一些指导,以便从这些句子中提取这些表格信息的最佳方式是什么。

如果仅仅是确定是否有个人跑步或行走,我只是使用sklearn来适应分类模型,但我需要提取的信息显然比这更细化(我试图检索每个子类别和值)。任何指导将不胜感激。

回答

7

你会想为此使用依赖分析。您可以使用the displaCy visualiser查看您的例句的可视化。

你可以实现你需要几个不同的方式的规则 - 就像如何总有多种方式来编写XPath查询,DOM选择等

像这样的东西应该工作:

nlp = spacy.load('en') 
docs = [nlp(t) for t in text] 
for i, doc in enumerate(docs): 
    for j, sent in enumerate(doc.sents): 
     subjects = [w for w in sent if w.dep_ == 'nsubj'] 
     for subject in subjects: 
      numbers = [w for w in subject.lefts if w.dep_ == 'nummod'] 
      if len(numbers) == 1: 
       print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text)) 

对于text你的例子你应该:

document.sentence: 0.0, subject: men, action: ran, numbers: 2 
document.sentence: 0.0, subject: child, action: ran, numbers: 1 
document.sentence: 0.1, subject: people, action: walking, numbers: 3 
document.sentence: 1.0, subject: person, action: walking, numbers: One 
+0

我没写过一个XPath查询或DOM选择。你能解释一下吗? – kathystehl

+1

@kathystehl XPath指定XML(HTML)文档中的位置。所以XPath查询是一种在XML或HTML中查找特定元素的方法。参见[wikipedia](https://en.wikipedia.org/wiki/XPath)。 DOM选择器是HTML文档中的任何CSS元素'id'或'class'(DOM是您在javascript中使用的HTML/XML文档/树的数据结构等)。所以你可以通过id和class来筛选元素。在NLP中,依赖关系解析器将非结构化文本转换为类似于HTML的树数据结构,其中的标记可以像DOM选择器过滤器和XPath查询一样进行查询。 – hobs