2012-08-09 63 views
0
<?php 



$filename = "largefile.txt"; 



/* get content of $filename in $content */ 

$content = strtolower(file_get_contents($filename)); 



/* split $content into array of substrings of $content i.e wordwise */ 

$wordArray = preg_split('/[^a-z]/', $content, -1, PREG_SPLIT_NO_EMPTY); 



/* "stop words", filter them */ 

$filteredArray = array_filter($wordArray, function($x){ 

return !preg_match("/^(.|a|an|and|the|this|at|in|or|of|is|for|to)$/",$x); 

}); 



/* get associative array of values from $filteredArray as keys and their frequency count as value */ 

$wordFrequencyArray = array_count_values($filteredArray); 



/* Sort array from higher to lower, keeping keys */ 

arsort($wordFrequencyArray); 

这是我的代码,我已经实现了查找文件中不同词的频率。 这是行得通的。计算多个文件中的词频

现在我想要做的是,让我们假设有10个文本文件。我想要统计所有10个文件中的一个单词的词频,即如果我想要查找所有单词“堆栈”的频率10个文件,即单词堆栈在所有文件中出现的次数。然后将为所有不同的单词执行此操作。

我已经完成了单个文件,但不能如何将其扩展到多个文件。 感谢您的帮助和抱歉,我的英语不好

+0

你试过包裹了整个事情在每个文件的循环中? – Scuzzy 2012-08-09 06:16:31

回答

2

放什么你已经陷入了功能&使用foreach循环调用它的每个文件名中的数组:

<?php 

$wordFrequencyArray = array(); 

function countWords($file) use($wordFrequencyArray) { 
    /* get content of $filename in $content */ 
    $content = strtolower(file_get_contents($filename)); 

    /* split $content into array of substrings of $content i.e wordwise */ 
    $wordArray = preg_split('/[^a-z]/', $content, -1, PREG_SPLIT_NO_EMPTY); 

    /* "stop words", filter them */ 
    $filteredArray = array_filter($wordArray, function($x){ 
     return !preg_match("/^(.|a|an|and|the|this|at|in|or|of|is|for|to)$/",$x); 
    }); 

    /* get associative array of values from $filteredArray as keys and their frequency count as value */ 
    foreach (array_count_values($filteredArray) as $word => $count) { 
     if (!isset($wordFrequencyArray[$word])) $wordFrequencyArray[$word] = 0; 
     $wordFrequencyArray[$word] += $count; 
    } 
} 
$filenames = array('file1.txt', 'file2.txt', 'file3.txt', 'file4.txt' ...); 
foreach ($filenames as $file) { 
    countWords($file); 
} 

print_r($wordFrequencyArray);