简单化的开始:
<?php
// source text
$paragraph = "Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Proin congue, quam nec tincidunt congue, massa ipsum sodales tellus,
in rhoncus sem quam quis ante. Nam condimentum pellentesque libero at
blandit. Suspendisse felis sem, interdum pulvinar ultricies a, auctor
vel leo. Curabitur congue mi nec purus placerat sit amet mollis magna
laoreet. Duis eu purus non turpis lacinia sagittis. Aliquam tristique
nulla volutpat neque posuere faucibus. Aenean tempus diam quis sem
convallis id cursus lorem sagittis. Nam feugiat, felis nec tincidunt
aliquet, felis lectus bibendum mi, ut tincidunt purus urna ac felis.
Quisque ut lectus dolor. Duis ipsum arcu, adipiscing id vestibulum
fringilla, euismod non augue. Nullam quis ipsum nec tortor tristique
egestas sed nec leo. Pellentesque tempus velit lacus, sit amet rhoncus
mi. Curabitur justo ipsum, consectetur ac vestibulum sed, porttitor
eget dui. Vivamus nisi lorem, porta vel gravida quis, varius et elit.
Nulla eros metus, congue sit amet interdum at, porta eget ligula.";
// remove newlines
$paragraph = str_replace(array("\r","\n"), '', $paragraph);
// convert to lowercase
$paragraph = strtolower($paragraph);
// remove non-alphanumeric characters
$paragraph = preg_replace('/[^A-Za-z0-9\s]/', '', $paragraph);
// convert into array
$words = explode(' ', $paragraph);
// remove null values
$words = array_filter($words, 'strlen');
// remove duplicate values
$words = array_unique($words);
// sort array alphabetically (optional)
natsort($words);
// reindex array
$words = array_values($words);
// display array
print_r($words);
?>
更新:现在删除换行。将所有修改分离为单个命令。
什么是特定问题?请不要告诉我们您需要知道如何使用简单的拆分操作来读取文件并将文本分割为字符串。否则,这个问题值得质量差。 – 2011-03-31 18:02:19
也许你应该安装一个搜索引擎,例如[ElasticSearch](http://www.elasticsearch.org/)。除非你真的*想要*重塑它? – bart 2011-04-01 12:33:32
感谢您的想法。我会从这些工作。我想知道从长远来看,由于性能问题和更复杂的解析/突出显示,我需要使用基于Java或Python的某种后端系统,比如Apache Solr。 – markwk 2011-04-01 15:37:49