PHP找到正克阵列

-2

$excerpts = array(
    'I love cheap red apples', 
    'Cheap red apples are what I love', 
    'Do you sell cheap red apples?', 
    'I want red apples', 
    'Give me my red apples', 
    'OK now where are my apples?' 
);

我想找到的所有正克在这些行得到这样的结果：

便宜的红苹果：3个
红苹果：5
苹果：6

我试图破解数组然后解析它，但它很愚蠢，因为可以找到新的n-gram，因为字符串之间没有任何可见的连接。

你将如何进行？

来源

2014-10-19 mattspain

为了继续，我会查找n-gram算法，然后决定哪个适合在这个数据集上实现。第一次电话：[关于N-grams的维基百科]（http://en.wikipedia.org/wiki/N-gram）。 – 2014-10-19 22:14:58

感谢您的建议，这是我所做的，但我需要任何解决方案或至少具体的例子，它们会给我我提供的最终输出。 – mattspain 2014-10-20 11:42:22

你好，这个图书馆为你服务：https://packagist.org/packages/drupol/phpngrams 让我知道它是怎么回事！ – 2018-02-05 20:53:04

我想找到一组单词没有之前知道他们虽然与功能，我需要什么

之前提供给他们试试这个：

mb_internal_encoding('UTF-8'); 

$joinedExcerpts = implode(".\n", $excerpts); 
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY); 

$wordsSequencesCount = array(); 
foreach($sentences as $sentence) { 
    $words = array_map('mb_strtolower', 
         preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY)); 
    foreach($words as $index => $word) { 
     $wordsSequence = ''; 
     foreach(array_slice($words, $index) as $nextWord) { 
       $wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord; 
      if(!isset($wordsSequencesCount[$wordsSequence])) { 
       $wordsSequencesCount[$wordsSequence] = 0; 
      } 
      ++$wordsSequencesCount[$wordsSequence]; 
     } 
    } 
} 

$ngramsCount = array_filter($wordsSequencesCount, 
          function($count) { return $count > 1; });

我假设你只想重复一组单词。的var_dump($ngramsCount);的输出中是：

array (size=11) 
    'i' => int 3 
    'i love' => int 2 
    'love' => int 2 
    'cheap' => int 3 
    'cheap red' => int 3 
    'cheap red apples' => int 3 
    'red' => int 5 
    'red apples' => int 5 
    'apples' => int 6 
    'are' => int 2 
    'my' => int 2

的代码可以被优化，以，例如，使用较少的存储器。

来源

2014-10-20 13:38:08

这是如此完美，正是我所问的。非常感谢！ – mattspain 2014-10-20 18:07:04

-1

假设你只是想算一笔串出现的次数：

$cheapRedAppleCount = 0; 
$redAppleCount = 0; 
$appleCount = 0; 
for($i = 0; $i < count($excerpts); $i++) 
{ 
    $cheapRedAppleCount += preg_match_all('cheap red apples', $excerpts[$i]); 
    $redAppleCount += preg_match_all('red apples', $excerpts[$i]); 
    $appleCount += preg_match_all('apples', $excerpts[$i]); 
}

preg_match_all返回给定字符串匹配的数量，所以你可以只添加匹配的数量上的计数器。

preg_match_all欲了解更多信息。

道歉，如果我误解了。

来源

2014-10-19 22:24:18 user1849060

我想OP可能想要找到任何字符串集合中的所有n元组，而不仅仅是那些特定字符串中的那三个。：\ – 2014-10-19 22:27:13

我想在不知道他们之前找到一组单词，但不幸的是，这不符合我的要求。无论如何，感谢您的帮助。 – mattspain 2014-10-20 11:41:16

试试这个（使用implode，因为这是你提到的企图）：

$ngrams = array(
    'cheap red apples', 
    'red apples', 
    'apples', 
); 

$joinedExcerpts = implode("\n", $excerpts); 
$nGramsCount = array_fill_keys($ngrams, 0); 
var_dump($ngrams, $joinedExcerpts); 
foreach($ngrams as $ngram) { 
    $regex = '/(?:^|[^\pL])(' . preg_quote($ngram, '/') . ')(?:$|[^\pL])/umi'; 
    $nGramsCount[$ngram] = preg_match_all($regex, $joinedExcerpts); 
}

来源

2014-10-19 23:06:51

重点是：我想在不知道它们的情况下找到一组单词，尽管使用你的功能我需要在任何事情之前提供它们。无论如何，感谢您的帮助。 – mattspain 2014-10-20 11:44:15

对不起，我误解了这个问题。如果“I”，“I love”和“are”这两个词组被认为是n-gram，并且不应该重复的组词被忽略（“Do”，“Do you”等）？ – 2014-10-20 12:05:46

The code provided by Pedro Amaral Couto以上是非常好的。因为我用它为法国，我修改了正则表达式如下：

$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);

通过这种方式，我们可以分析包含连字符和撇号（“EST-CE阙”的话，“J'AI”等）

来源

2016-04-07 19:49:36 easypronunciation

PHP找到正克阵列

回答

相关问题