2017-06-13 88 views
0

我有一个XML文件,它看起来像这样如何解析python中的xml文件?

<?xml version='1.0' encoding='UTF8'?> 
<Reviews> 
    <Review rid="0" book_title="O-Apanhador-no-Campo-de-Centeio" score="4.0"> 
    <sentences> 
     <sentence id="0:0:0" place="title" polarity="neutral"> 
     <text>Está provado:</text> 
     <tokens> 
      <word id="1" form="Está" base="estar" postag="v-fin" morf="PR 3S IND VFIN" extra="fmc * vK mv" head="0" deprel="STA" srl="PRED" obj="O" opinion="O" from="0" to="4"/> 
      <word id="2" form="provado" base="provar" postag="v-fin" morf="PCP M S" extra="vH jh" head="1" deprel="Cs" srl="ATR" obj="O" opinion="O" from="5" to="12"/> 
      <word id="3" form=":" base="--" postag="pu" morf="--" extra="--" head="0" deprel="PU" srl="" obj="O" opinion="O" from="12" to="13"/> 
     </tokens> 
     </sentence> 
     <sentence id="0:0:1" place="title" polarity="neutral"> 
     <text>Pode existir um livro bom sem uma história boa.</text> 
     <tokens> 
      <word id="1" form="Pode" base="poder" postag="v-fin" morf="PR 3S IND VFIN" extra="fmc * aux" head="0" deprel="STA" srl="" obj="O" opinion="O" from="0" to="4"/> 
      <word id="2" form="existir" base="existir" postag="v-inf" morf="--" extra="mv" head="1" deprel="Oaux" srl="PRED" obj="O" opinion="O" from="5" to="12"/> 
      <word id="3" form="um" base="um" postag="pron-indef" morf="M S" extra="--" head="4" deprel="DN" srl="" obj="O" opinion="O" from="13" to="15"/> 
      <word id="4" form="livro" base="livro" postag="n" morf="M S" sem="sem-r" extra="--" head="1" deprel="S" srl="TH" obj="O" opinion="O" from="16" to="21"/> 
      <word id="5" form="bom" base="bom" postag="adj" morf="M S" extra="np-close" head="4" deprel="DN" srl="" obj="O" opinion="O" from="22" to="25"/> 
      <word id="6" form="sem" base="sem" postag="prp" morf="--" extra="--" head="2" deprel="fA" srl="" obj="O" opinion="O" from="26" to="29"/> 
      <word id="7" form="uma" base="um" postag="pron-indef" morf="F S" extra="--" head="8" deprel="DN" srl="" obj="O" opinion="O" from="30" to="33"/> 
      <word id="8" form="história" base="história" postag="n" morf="F S" sem="per domain sem-r" extra="--" head="6" deprel="DP" srl="COM-ADV" obj="O" opinion="O" from="34" to="42"/> 
      <word id="9" form="boa" base="bom" postag="adj" morf="F S" extra="jh np-close" head="8" deprel="DN" srl="" obj="O" opinion="O" from="43" to="46"/> 
      <word id="10" form="." base="--" postag="pu" morf="--" extra="--" head="0" deprel="PU" srl="" from="46" to="47"/> 
     </tokens> 

我想文本字段和极性提取到一个单独的CSV文件。

我用这个成功提取极性,但我不能提取文本

with open('output1.csv', 'w') as f: 
    writer = csv.writer(f) 
    writer.writerow(('text', 'polarity')) 
    root = lxml.etree.fromstring(xmlstr) 
    for sent in root.iter('sentence'): 
     row = sent.get('text'), sent.get('polarity') 
     writer.writerow(row) 

其中xmlstr是XML文件的内容的字符串。

如何从文件中提取文本字段!

注:这是一个包含我与 sentiment analysis in portuguese

工作文件的链接任何一个可以帮助!?

感谢

回答

0

试试这个方法:

import xml.etree.ElementTree 
import csv 
e = xml.etree.ElementTree.parse('ReLiPalavras.xml').getroot() 
with open('output1.csv', 'w') as f: 
    writer = csv.writer(f) 
    writer.writerow(('text', 'polarity')) 
    for sent in e.iter('sentence'): 
     row = sent[0].text.encode('utf-8'), sent.get('polarity') 
     writer.writerow(row) 

然后你会得到text元素含量和output1.csv文件

+0

它给我这个错误: ParseError:格式不正确(标记无效):6号线,第17列 它有什么问题 –

0

属性polarity我跟着这个解决方案:

trainset = list() 
xmldoc = etree.parse('ReLiPalavras.xml') 

for sentence_node in xmldoc.iter('sentence'): 
    sentence = list() 
    #for word_node in sentence_node.iter('word'): 
    # tag = 'O' 
    # if word_node.get('obj') != 'O': 
    #  tag = 'OBJ' 
    sentence.append({ 
     'sent': sentence_node[0].text, 
     'polarity': sentence_node.get('polarity')}) 
    if len(sentence) != 0: 
     trainset.append(sentence) 

Thi s创建了一个词典列表。

with open('names.csv', 'w', encoding='utf-8') as csvfile: 
    fieldnames = ['sent', 'polarity'] 
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=',') 

    writer.writeheader() 
    for d in trainset: 
     writer.writerow(d[0]) 

,然后通过它进入这个csv文件

,它正是我想要的