2011-08-27 54 views
2

我想要循环超过200,000个用户数据集来过滤30,000个产品,我如何优化这个嵌套的大循环以获得最佳性能?重构循环?

//settings , 5 max per user, can up to 200,000 
    $settings = array(...); 

    //all prods, up to 30,000 
    $prods = array(...); 

    //all prods category relation map, up to 2 * 30,000 
    $prods_cate_ref_all = array(...); 

    //msgs filtered by settings saved yesterday , more then 100 * 200,000 
    $msg_all = array(...); 

    //filter counter 
    $j = 0; 

    //filter result 
    $res = array(); 

    foreach($settings as $set){ 

     foreach($prods as $k=>$p){ 

      //filter prods by site_id 
      if ($set['site_id'] != $p['site_id']) continue; 

       //filter prods by city_id , city_id == 0 is all over the country 
      if ($set['city_id'] != $p['city_id'] && $p['city_id'] > 0) continue; 

      //muti settings of a user may get same prods 
       if (prod_in($p['id'], $set['uuid'], $res)) continue; 

      //prods filtered by settings saved to msg table yesterday 
      if (msg_in($p['id'], $set['uuid'], $msg_all)) continue; 

       //filter prods by category id 
      if (!prod_cate_in($p['id'], $set['cate_id'], $prods_cate_ref_all)) continue; 

      //filter prods by tags of set not in prod title, website ... 
       $arr = array($p['title'], $p['website'], $p['detail'], $p['shop'], $p['tags']); 
      if (!tags_in($set['tags'], $arr)) continue; 

       $res[$j]['name'] = $v['name']; 
      $res[$j]['prod_id'] = $p['id']; 
       $res[$j]['uuid'] = $v['uuid']; 
       $res[$j]['msg'] = '...'; 
       $j++; 
     } 

    } 

    save_to_msg($res); 

function prod_in($prod_id, $uuid, $prod_all){ 
    foreach($prod_all as $v){ 
    if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid) 
     return true; 
    } 
    return false; 
} 

function prod_cate_in($prod_id, $cate_id, $prod_cate_all){ 
    foreach($prod_cate_all as $v){ 
    if ($v['prod_id'] == $prod_id && $v['cate_id'] == $cate_id) 
     return true; 
    } 
    return false; 
} 

function tags_in($tags, $arr){ 
    $tag_arr = explode(',', str_replace(',', ',', $tags)); 
    foreach($tag_arr as $v){ 
    foreach($arr as $a){ 
     if(strpos($a, strtolower($v)) !== false){ 
     return true; 
     } 
    } 
    } 
    return false; 
} 

function msg_in($prod_id, $uuid, $msg_all){ 
    foreach($msg_all as $v){ 
    if ($v['prod_id'] == $prod_id && $v['uuid'] == $uuid) 
     return true; 
    } 
    return false; 
} 

更新: 非常感谢。 是的,数据是在MySQL中,以下是主要结构:

-- user settings to filter prods, 5 max per user 
CREATE TABLE setting(
    id INT NOT NULL AUTO_INCREMENT, 
    uuid VARCHAR(100) NOT NULL DEFAULT '', 
    tags VARCHAR(100) NOT NULL DEFAULT '', 
    site_id SMALLINT UNSIGNED NOT NULL DEFAULT 0, 
    city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0, 
    cate_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0, 
    addtime INT UNSIGNED NOT NULL DEFAULT 0, 
    PRIMARY KEY (`id`), 
    KEY `idx_setting_uuid` (`uuid`), 
    KEY `idx_setting_tags` (`tags`), 
    KEY `idx_setting_city_id` (`city_id`), 
    KEY `idx_setting_cate_id` (`cate_id`) 
) DEFAULT CHARSET=utf8; 


CREATE TABLE users(
    id INT NOT NULL AUTO_INCREMENT, 
    uuid VARCHAR(100) NOT NULL DEFAULT '', 
    PRIMARY KEY (`id`), 
    UNIQUE KEY `idx_unique_uuid` (`uuid`) 
) DEFAULT CHARSET=utf8; 


-- filtered prods 
CREATE TABLE msg_list(
    id INT NOT NULL AUTO_INCREMENT, 
    uuid VARCHAR(100) NOT NULL DEFAULT '', 
    prod_id INT UNSIGNED NOT NULL DEFAULT 0, 
    msg TEXT NOT NULL DEFAULT '', 
    PRIMARY KEY (`id`), 
    KEY `idx_ml_uuid` (`uuid`) 
) DEFAULT CHARSET=utf8; 



-- prods and prod_cate_ref table in another database, so can not join it 


CREATE TABLE prod(
    id INT NOT NULL AUTO_INCREMENT, 
    website VARCHAR(100) NOT NULL DEFAULT '' COMMENT ' site name ', 
    site_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0, 
    city_id MEDIUMINT UNSIGNED NOT NULL DEFAULT 0, 
    title VARCHAR(50) NOT NULL DEFAULT '', 
    tags VARCHAR(50) NOT NULL DEFAULT '', 
    detail VARCHAR(500) NOT NULL DEFAULT '', 
    shop VARCHAR(300) NOT NULL DEFAULT '', 
    PRIMARY KEY (`id`), 
    KEY `idx_prod_tags` (`tags`), 
    KEY `idx_prod_site_id` (`site_id`), 
    KEY `idx_prod_city_id` (`city_id`), 
    KEY `idx_prod_mix` (`site_id`,`city_id`,`tags`) 
) DEFAULT CHARSET=utf8; 

CREATE TABLE prod_cate_ref(
    id MEDIUMINT NOT NULL AUTO_INCREMENT, 
    prod_id INT NOT NULL NULL DEFAULT 0, 
    cate_id MEDIUMINT NOT NULL NULL DEFAULT 0, 
    PRIMARY KEY (`id`), 
    KEY `idx_pcr_mix` (`prod_id`,`cate_id`) 
) DEFAULT CHARSET=utf8; 


-- ENGINE all is myisam 

我不知道如何只使用一个SQL来获取所有。

+3

我假设你从数据库中获取这个数据。我还假设你正在使用SQL数据库。那为什么不使用'JOINs'和'WHERE'子句呢? – NullUserException

+0

你从哪里得到设置?数据库? – svick

+0

这是一个很重要的数据存储在PHP数组中。这是从数据库或其他东西出来的吗? –

回答

2

谢谢大家对我的启发,我终于得到了它,它的确是一个很简单的方法,但一个巨大的进步!

我重组在$ prods_cate_ref_all数据和$ msg_all(使用在最后两个功能), 也结果数组$ RES, 然后使用strpos和in_array而不是三个迭代函数(prod_in msg_in prod_cate_in)

我得到了惊人的50倍加速!随着数据变大,效果变得更加有效。

//settings , 5 max per user, can up to 200,000 
    $settings = array(...); 

    //all prods, up to 30,000 
    $prods = array(...); 

    //all prods category relation map, up to 2 * 30,000 
    $prods_cate_ref_all = get_cate_ref_all(); 

    //msgs filtered by settings saved yesterday , more then 100 * 200,000 
    $msg_all = get_msg_all(); 

    //filter counter 
    $j = 0; 

    //filter result 
    $res = array(); 


    foreach($settings as $set){ 

     foreach($prods as $p){ 

     $res_uuid_setted = false; 

     $uuid = $set['uuid']; 

     if (isset($res[$uuid])){ 
      $res_uuid_setted = true; 
     } 

     //filter prods by site_id 
     if ($set['site_id'] != $p['site_id']) 
       continue; 

     //filter prods by city_id , city_id == 0 is all over the country 
     if ($set['city_id'] != $p['city_id'] && $p['city_id'] > 0) 
       continue; 


     //muti settings of a user may get same prods 
     if ($res_uuid_setted) 
      //in_array faster than strpos if item < 1000 
      if (in_array($p['id'], $res[$uuid]['prod_ids'])) 
      continue; 

     //prods filtered by settings saved to msg table yesterday 
     if (isset($msg_all[$uuid])) 
      //strpos faster than in_array in large data 
      if (false !== strpos($msg_all[$uuid], ' ' . $p['id'] . ' ')) 
      continue; 

     //filter prods by category id 
     if (false === strpos($prods_cate_ref_all[$p['id']], ' ' . $set['cate_id'] . ' ')) 
      continue; 

     $arr = array($p['title'], $p['website'], $p['detail'], $p['shop'], $p['tags']); 
     if (!tags_in($set['tags'], $arr)) 
      continue; 


     $res[$uuid]['prod_ids'][] = $p['id']; 

     $res[$uuid][] = array(
     'name' => $set['name'], 
     'prod_id' => $p['id'], 
     'msg' => '', 
     ); 

     } 

    } 


function get_msg_all(){ 

    $temp = array(); 
    $msg_all = array(
     array('uuid' => 312, 'prod_id' => 211), 
     array('uuid' => 1227, 'prod_id' => 31), 
     array('uuid' => 1, 'prod_id' => 72), 
     array('uuid' => 993, 'prod_id' => 332), 
     ... 
    ); 

    foreach($msg_all as $k=>$v){ 
    if (!isset($temp[$v['uuid']])) 
     $temp[$v['uuid']] = ' '; 

    $temp[$v['uuid']] .= $v['prod_id'] . ' '; 
    } 

    $msg_all = $temp; 
    unset($temp); 

    return $msg_all; 
} 


function get_cate_ref_all(){ 

    $temp = array(); 
    $cate_ref = array(
     array('prod_id' => 3, 'cate_id' => 21), 
     array('prod_id' => 27, 'cate_id' => 1), 
     array('prod_id' => 1, 'cate_id' => 232), 
     array('prod_id' => 3, 'cate_id' => 232), 
     ... 
    ); 

    foreach($cate_ref as $k=>$v){ 
    if (!isset($temp[$v['prod_id']])) 
     $temp[$v['prod_id']] = ' '; 

    $temp[$v['prod_id']] .= $v['cate_id'] . ' '; 
    } 
    $cate_ref = $temp; 
    unset($temp); 

    return $cate_ref; 
} 
0

正如你已经有了较大的集外循环,很难说在那里你可以优化这一点。你可以内联函数代码来省去函数调用,或者部分从你的许多foreach中展开。

例如,该功能

function tags_in($tags, $arr){ 
    $tag_arr = explode(',', str_replace(',', ',', $tags)); 
    foreach($tag_arr as $v){ 
    foreach($arr as $a){ 
     if(strpos($a, strtolower($v)) !== false){ 
     return true; 
     } 
    } 
    } 
    return false; 
} 

内你基本上使用字符串访问的阵列。做弦更直接的(注:完整的标签匹配,你做了部分匹配):

function tags_in($tags, $arr) 
{ 
    $tags = ', '.strtolower($tags).', '; 

    foreach($arr as $tag) 
    { 
     if (false !== strpos($tags, ', '.$tag.', ') 
      return true; 
    } 
    return false; 
} 

但是当你有一个大的数据量,事情就只是需要很长时间。

  • 只有优化你的发布版本。
  • 只做小而体面的变化。
  • 每次更改后运行您的测试。
  • 简介针对代表真实世界测试数据的每个变化。

因此,下至小码走势,你就可能寻找尺度的变化。如果你之前可以解决问题,你可以做一个地图和减少策略。也许你已经在使用基于文档的数据库,为此提供了一个接口。

+0

您在tags_in中的重构非常好。 – hamlet