2015-10-12 40 views
0

用JS解析HTML文本 - Extra节点?每个人都有

我正在构建一个软件,它可以对给定的HTML文本进行一些文本解析,并且当我从HTML中保存所有段落时,我会找到一个额外的节点。

我创建了

<p id="original_content_js"> Original content via JS:<br> </p> 

从解析保存接收到的数据,并将其与使被解析的数据(原文)。

这是HTML代码:

<p id="original_content_js"> 
Original content via JS:<br> 
</p> 

<div id="original_text">  

     <h3>Molly's Sheep</h3> 
     <p> 
      Molly had a little sheep. <br> 
      Molly didn't like her sheep. Ir was too hairy.<br> 
      So Molly took a big knife, and cut all of her sheep's fur.<br> 
      Now Molly's sheep is cold.<br> 
     </p> 
     <p> 
      But what Molly did not know, was that her sheep is a magical sheep;<br> 
      Molly's sheep grows hair instantly, magically!<br> 
      Oh, how wonderful, Molly's sheep,<br> 
      Making hair, each and each<br> 
      Hair grows quickly after cut,<br> 
      That's what the story's all about. 
     <p>  
    </div> 

这是解析代码:

var html_text_name = "original_text"; 
var html_text = document.getElementById(html_text_name); 
var text_paragaphs = html_text.getElementsByTagName("p"); 
for (var x=0; x<text_paragaphs.length; x++){ 
    document.getElementById("original_content_js").innerHTML += "ABC" + 
    text_paragaphs[x].innerHTML + "CBA <br>"; 
} 

而结果我进入original_content_js段落:

Original content via JS: 
ABC Molly had a little sheep. 
Molly didn't like her sheep. Ir was too hairy. 
So Molly took a big knife, and cut all of her sheep's fur. 
Now Molly's sheep is cold. 
CBA 
ABC But what Molly did not know, was that her sheep is a magical sheep; 
Molly's sheep grows hair instantly, magically!  
Oh, how wonderful, Molly's sheep, 
Making hair, each and each 
Hair grows quickly after cut, 
That's what the story's all about. CBA 
ABC CBA 

所以你可以看到我按照预期得到的东西 - 两段包装在“ABC”和“CBA”中,除了有一个最后是空的节点。为什么还有另外的节点?

回答

1

您不检查段落是否正确关闭。因此,您的代码会看到三个打开的​​p标签并假定有三个段落。最后一个p标签应该是一个封闭的p标签。这是一个问题,因为它将text_paragraphs设置为3而不是2.你将需要编写一个正则表达式来检查这个...但要小心......为HTML解析写正则表达式是一件可怕的事情......并且通常不可能准确地做100%的时间。

编辑:我不是说你不应该写一个正则表达式来检查标签是否根据你的情况正确关闭......我只是说,要小心。