解析结构化和非结构化文本的混合

我需要解析它们的格式是这样的文本块：解析结构化和非结构化文本的混合

Today the weather is excellent bla bla bla. 
<temperature>35</temperature>. 
I'm in a great mood today. 
<item>Desk</item>

我想分析这样的文字，并将其翻译成它类似于一个数组是这样的：

$array[0]['text'] = 'Today the weather is excellent bla bla bla. '; 
$array[0]['type'] = 'normalText'; 

$array[1]['text'] = '35'; 
$array[1]['type'] = 'temperature'; 

$array[2]['text'] = ". I'm in a great mood today."; 
$array[2]['type'] = 'normalText'; 

$array[3]['text'] = 'Desk'; 
$array[3]['type'] = 'item';

从本质上讲，我想数组包含所有在同一顺序的文字作为原始文本，但分成两类：普通文本（意思的东西这是不是任何标记之间）以及由标签确定的其他类型，如温度，项目文字介于两者之间。

有没有办法做到这一点（即单独的文本转换为普通文本，以及其他类型，使用正则表达式），或者我幕后应该将文本转换成适当的结构化文本，如：

<normal>Today the weather is excellent bla bla bla.</normal> 
<temperature>35</temperature>. 
<normal> I'm in a great mood today.</normal><item>Desk</item>

在尝试解析文本之前？

来源

2012-10-19 Click Upvote

无需转换它，但包含一个条目的数组每行文字？ – cerealy

@cerealy有些标签可能会扩展到多条线路，如''可能包含描述的几行，以及项目名称 –

编辑：现在它的工作原理与预期一致！

解决方案：

<?php 

$code = <<<'CODE' 
Today the weather is excellent bla bla bla. 
<temperature>35</temperature>. 
I'm in a great mood today. 
<item>Desk</item> 
CODE; 

$result = array_filter(
    array_map(
     function ($element) { 
      if (!empty($element)) { 
       if (preg_match('/^\<([^\>]+)\>([^\<]+)\</', $element, $matches)) { 
        return array('text' => $matches[2], 
           'type' => $matches[1]); 
       } else { 
        return array('text' => $element, 
           'type' => 'normal'); 
       } 
      } 
      return false; 
     }, 
     preg_split('/(\<[^\>]+\>[^\<]+\<\/[^\>]+\>)/', $code, null, PREG_SPLIT_DELIM_CAPTURE) 
    ) 
); 

print_r($result);

输出：

Array 
(
    [0] => Array 
     (
      [text] => Today the weather is excellent bla bla bla. 

      [type] => normal 
     ) 

    [1] => Array 
     (
      [text] => 35 
      [type] => temperature 
     ) 

    [2] => Array 
     (
      [text] => . 
I'm in a great mood today. 

      [type] => normal 
     ) 

    [3] => Array 
     (
      [text] => Desk 
      [type] => item 
     ) 

)

来源

2012-10-19 07:17:02 Carlos

这仅解析出的标签内的文本，标签外，这些内容一直未解析.. –

@ClickUpvote对不起！我没有意识到这一点。 – Carlos

@ClickUpvote订单重要吗？ – Carlos

尝试阅读文本，逐行阅读。你有两个例子。添加普通文本并添加具有特殊标签的文本。将常规文本添加到变量中时，请使用正则表达式查找标签。

preg_match("/\<(\w)\>/", $line_from_text, $matches)

匹配标签，（）的保存单词用于$匹配中的数组。现在只需将文本添加到变量中，直到您遇到结束标记。希望这有助于。

来源

2012-10-19 05:36:13 cerealy

解析结构化和非结构化文本的混合

回答

相关问题