使用CodeIgniter的正则表达式最终化HTML输出

Google页面建议您缩小HTML，即删除所有不必要的空格。 CodeIgniter具有giziping输出的功能，或者可以通过.htaccess完成。但我仍然想从最终的HTML输出中删除不必要的空格。使用CodeIgniter的正则表达式最终化HTML输出

我用这段代码玩了一下，它似乎工作。这确实会导致没有多余空格的HTML并删除其他标签格式。

class Welcome extends CI_Controller 
{ 
    function _output() 
    { 
     echo preg_replace('!\s+!', ' ', $output); 
    } 

    function index(){ 
    ... 
    } 
}

的问题是有可能像 <pre>，<textarea>等标签..这可能有空格他们和正则表达式应该删除它们。那么，如何从最终的HTML中删除多余空间，而不会影响使用正则表达式的空间或格式化这些特定标记？

由于@Alan摩尔得到了答案，这个工作对我来说

echo preg_replace('#(?ix)(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)(?:<(?>textarea|pre)\b|\z))#', ' ', $output);

ridgerunner做了分析这个正则表达式的一个很好的工作。我最终使用他的解决方案。欢呼ridgerunner。

来源

2011-03-15 Aman

+12

不要使用正则表达式来执行HTML。 – SLaks 2011-03-15 13:23:53

无限upvotes你，SLaks。 – 2011-03-15 13:24:26

好吧，那么重新格式化最终html输出的好方法是什么呢？ – Aman 2011-03-15 13:30:54

对于那些好奇如何艾伦·摩尔的正则表达式的作品（是的，它不工作），我已经采取的评论，以便它可以通过凡人阅读自由：

function process_data_alan($text) // 
{ 
    $re = '%# Collapse ws everywhere but in blacklisted elements. 
     (?>    # Match all whitespans other than single space. 
      [^\S ]\s*  # Either one [\t\r\n\f\v] and zero or more ws, 
     | \s{2,}  # or two or more consecutive-any-whitespace. 
     ) # Note: The remaining regex consumes no text at all... 
     (?=    # Ensure we are not in a blacklist tag. 
      (?:   # Begin (unnecessary) group. 
      (?:   # Zero or more of... 
       [^<]++ # Either one or more non-"<" 
      | <   # or a < starting a non-blacklist tag. 
       (?!/?(?:textarea|pre)\b) 
      )*+   # (This could be "unroll-the-loop"ified.) 
     )    # End (unnecessary) group. 
      (?:   # Begin alternation group. 
      <   # Either a blacklist start tag. 
      (?>textarea|pre)\b 
      | \z   # or end of file. 
     )    # End alternation group. 
     ) # If we made it here, we are not in a blacklist tag. 
     %ix'; 
    $text = preg_replace($re, " ", $text); 
    return $text; 
}

我这里有新东西，但我可以马上看到Alan在正则表达式方面非常出色。我只会添加以下建议。

有一个不必要的捕获组可以删除。
尽管OP没有这么说，但<SCRIPT>元素应该被添加到<PRE>和<TEXTAREA>黑名单中。
添加'S' PCRE“研究”修饰符加快了这个正则表达式大约20％。
在预测中有一个替代组，适用于Friedl的“展开回路”效率构造。
更严重的是，这个相同的替换组（即(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+）容易在大型目标字符串上出现过多的PCRE递归，这可能导致堆栈溢出，从而导致Apache/PHP可执行文件无提示地 seg-fault并没有警告的崩溃。（Apache httpd.exe的Win32版本特别容易受此影响，因为它与* nix可执行文件相比，只有256KB堆栈，通常使用8MB堆栈或更多版本构建。）Philip Hazel（PHP中使用的PCRE正则表达式引擎的作者）在文档中讨论这个问题：PCRE DISCUSSION OF STACK USAGE。尽管Alan已经正确应用了与本文档中Philip展示的相同的修补程序（对第一个替代方案应用了所有格），但如果HTML文件很大并且有很多未列入黑名单的标签，仍然会有很多递归。例如在我的Win32盒子（带有一个256KB堆栈的可执行文件）上，脚本中只有60KB的测试文件。还要注意的是，PHP遗憾的是不遵循这些建议，并将默认递归限制方式设置得太高（100000）。（根据PCRE文档，这应该设置为等于堆栈大小除以500的值）。

下面是一个改进的版本，这是比原来快，处理较大的输入，并正常失败，如果输入字符串过大，无法处理的消息：

// Set PCRE recursion limit to sane value = STACKSIZE/500 
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache 
ini_set("pcre.recursion_limit", "16777"); // 8MB stack. *nix 
function process_data_jmr1($text) // 
{ 
    $re = '%# Collapse whitespace everywhere but in blacklisted elements. 
     (?>    # Match all whitespans other than single space. 
      [^\S ]\s*  # Either one [\t\r\n\f\v] and zero or more ws, 
     | \s{2,}  # or two or more consecutive-any-whitespace. 
     ) # Note: The remaining regex consumes no text at all... 
     (?=    # Ensure we are not in a blacklist tag. 
      [^<]*+  # Either zero or more non-"<" {normal*} 
      (?:   # Begin {(special normal*)*} construct 
      <   # or a < starting a non-blacklist tag. 
      (?!/?(?:textarea|pre|script)\b) 
      [^<]*+  # more non-"<" {normal*} 
     )*+   # Finish "unrolling-the-loop" 
      (?:   # Begin alternation group. 
      <   # Either a blacklist start tag. 
      (?>textarea|pre|script)\b 
      | \z   # or end of file. 
     )    # End alternation group. 
     ) # If we made it here, we are not in a blacklist tag. 
     %Six'; 
    $text = preg_replace($re, " ", $text); 
    if ($text === null) exit("PCRE Error! File too big.\n"); 
    return $text; 
}

P.S.我非常熟悉这个PHP/Apache seg-fault问题，因为我在参与帮助Drupal社区的同时也在解决这个问题。参见：Optimize CSS option causes php cgi to segfault in pcre function "match"。我们也在FluxBB论坛软件项目中使用BBCode解析器。

希望这会有所帮助。

来源

2011-03-16 10:31:38 ridgerunner

哇，这是相当深入的分析，我不知道所有这些细节。感谢很多，我会尝试你的正则表达式。 – Aman 2011-03-17 04:41:08

我可以有你使用的测试文件吗？ – Aman 2011-03-17 10:28:27

@Aman是的，但它会有一段时间之前，我发布它（该文件是一篇文章正在进行中（在HTML中）...） – ridgerunner 2011-03-18 05:02:38

我在两个项目中实施了@ridgerunner的答案，最终在其中一个项目中进行了一些严重的减速（10-30秒的请求时间）。我发现我必须将pcre.recursion_limit和pcre.backtrack_limit都设置得很低才能工作，但即使如此，在大约2秒的处理后它也会放弃，并返回false。由于这个原因，我用这个解决方案（易于理解的正则表达式）取代了它，它受Smarty 2的outputfilter.trimwhitespace函数的启发。它没有回溯或递归，并且每次都工作（而不是在蓝色月亮中发生灾难性故障）：

function filterHtml($input) { 
    // Remove HTML comments, but not SSI 
    $input = preg_replace('/<!--[^#](.*?)-->/s', '', $input); 

    // The content inside these tags will be spared: 
    $doNotCompressTags = ['script', 'pre', 'textarea']; 
    $matches = []; 

    foreach ($doNotCompressTags as $tag) { 
     $regex = "!<{$tag}[^>]*?>.*?</{$tag}>!is"; 

     // It is assumed that this placeholder could not appear organically in your 
     // output. If it can, you may have an XSS problem. 
     $placeholder = "@@<'-placeholder-$tag'>@@"; 

     // Replace all the tags (including their content) with a placeholder, and keep their contents for later. 
     $input = preg_replace_callback(
      $regex, 
      function ($match) use ($tag, &$matches, $placeholder) { 
       $matches[$tag][] = $match[0]; 
       return $placeholder; 
      }, 
      $input 
     ); 
    } 

    // Remove whitespace (spaces, newlines and tabs) 
    $input = trim(preg_replace('/[ \n\t]+/m', ' ', $input)); 

    // Iterate the blocks we replaced with placeholders beforehand, and replace the placeholders 
    // with the original content. 
    foreach ($matches as $tag => $blocks) { 
     $placeholder = "@@<'-placeholder-$tag'>@@"; 
     $placeholderLength = strlen($placeholder); 
     $position = 0; 

     foreach ($blocks as $block) { 
      $position = strpos($input, $placeholder, $position); 
      if ($position === false) { 
       throw new \RuntimeException("Found too many placeholders of type $tag in input string"); 
      } 
      $input = substr_replace($input, $block, $position, $placeholderLength); 
     } 
    } 

    return $input; 
}

来源

2016-08-10 22:07:16 olemartinorg

使用CodeIgniter的正则表达式最终化HTML输出

回答

相关问题