TCLAP使多线程程序变得更慢

TCLAP是一个C++模板化头标库，用于分析命令行参数。我使用TCLAP处理多线程程序中的命令行参数：在主函数中读取参数，然后启动多个线程以处理由参数定义的任务（一些参数用于NLP任务）。TCLAP使多线程程序变得更慢

我已经开始显示线程处理的每秒字数，并且我发现如果我将参数硬编码到main中而不是使用TCLAP从cli读取它们，则吞吐量为6时间更快！

我使用gcc和-O2参数，我发现在编译过程中（没有使用TCLAP的时候）没有进行优化，速度提高了10倍左右......所以看起来TCLAP以某种方式否定了部分编译器优化的优点。

下面是主要功能，我用TCLAP的唯一的地方，看起来像：

int main(int argc, char** argv)             
{                    
uint32_t mincount;               
uint32_t dim;                
uint32_t contexthalfwidth;             
uint32_t negsamples;               
uint32_t numthreads;               
uint32_t randomseed;               
string corpus_fname;               
string output_basefname;              
string vocab_fname;               

Eigen::initParallel();              

try {                  
TCLAP::CmdLine cmd("Driver for various word embedding models", ' ', "0.1"); 
TCLAP::ValueArg<uint32_t> dimArg("d","dimension","dimension of word representations",false,300,"uint32_t"); 
TCLAP::ValueArg<uint32_t> mincountArg("m", "mincount", "required minimum occurrence count to be added to vocabulary",false,5,"uint32_t"); 
TCLAP::ValueArg<uint32_t> contexthalfwidthArg("c", "contexthalfwidth", "half window size of a context frame",false,15,"uint32_t"); 
TCLAP::ValueArg<uint32_t> numthreadsArg("t", "numthreads", "number of threads",false,12,"uint32_t"); 
TCLAP::ValueArg<uint32_t> negsamplesArg("n", "negsamples", "number of negative samples for skipgram model",false,15,"uint32_t"); 
TCLAP::ValueArg<uint32_t> randomseedArg("s", "randomseed", "seed for random number generator",false,2014,"uint32_t"); 
TCLAP::UnlabeledValueArg<string> corpus_fnameArg("corpusfname", "file containing the training corpus, one paragraph or sentence per line", true, "corpus", "corpusfname"); 
TCLAP::UnlabeledValueArg<string> output_basefnameArg("outputbasefname", "base filename for the learnt word embeddings", true, "wordreps-", "outputbasefname"); 
TCLAP::ValueArg<string> vocab_fnameArg("v", "vocabfname", "filename for the vocabulary and word counts", false, "wordsandcounts.txt", "filename"); 
cmd.add(dimArg);                
cmd.add(mincountArg);              
cmd.add(contexthalfwidthArg);            
cmd.add(numthreadsArg);              
cmd.add(randomseedArg);              
cmd.add(corpus_fnameArg);             
cmd.add(output_basefnameArg);            
cmd.add(vocab_fnameArg);              
cmd.parse(argc, argv);              

mincount = mincountArg.getValue();           
dim = dimArg.getValue();              
contexthalfwidth = contexthalfwidthArg.getValue();       
negsamples = negsamplesArg.getValue();          
numthreads = numthreadsArg.getValue();          
randomseed = randomseedArg.getValue();          
corpus_fname = corpus_fnameArg.getValue();         
output_basefname = output_basefnameArg.getValue();       
vocab_fname = vocab_fnameArg.getValue();          
}                   
catch (TCLAP::ArgException &e) {};   

/*                   
uint32_t mincount = 5;              
uint32_t dim = 50;               
uint32_t contexthalfwidth = 15;            
uint32_t negsamples = 15;             
uint32_t numthreads = 10;             
uint32_t randomseed = 2014;             
string corpus_fname = "imdbtrain.txt";          
string output_basefname = "wordreps-";          
string vocab_fname = "wordsandcounts.txt";         
*/                   

string test_fname = "imdbtest.txt";           
string output_fname = "parreps.txt";           
string countmat_fname = "counts.hdf5";          
Vocabulary * vocab;                            

vocab = determineVocabulary(corpus_fname, mincount);       
vocab->dump(vocab_fname);             

Par2VecModel p2vm = Par2VecModel(corpus_fname, vocab, dim, contexthalfwidth, negsamples, randomseed); 
p2vm.learn(numthreads);              
p2vm.save(output_basefname);             
p2vm.learnparreps(test_fname, output_fname, numthreads); 

}

被使用的唯一的地方多线程是在Par2VecModel ::学习功能：

void Par2VecModel::learn(uint32_t numthreads) {         
thread* workers;                
workers = new thread[numthreads];           
uint64_t numwords = 0;              
bool killflag = 0;               
uint32_t randseed;               

ifstream filein(corpus_fname.c_str(), ifstream::ate | ifstream::binary);  
uint64_t filesize = filein.tellg();           

fprintf(stderr, "Total number of in vocab words to train over: %u\n", vocab->gettotalinvocabwords()); 

for(uint32_t idx = 0; idx < numthreads; idx++) {        
    randseed = eng();              
    workers[idx] = thread(skipgram_training_thread, this, numthreads, idx, filesize, randseed, std::ref(numwords)); 
}                   

thread monitor(monitor_training_thread, this, numthreads, std::ref(numwords), std::ref(killflag)); 

for(uint32_t idx = 0; idx < numthreads; idx++)        
    workers[idx].join();              

killflag = true;                
monitor.join();                
}

这部分根本不涉及TCLAP，那么发生了什么？（我也使用C++ 11的功能，所以有-std = C++ 11标志，如果这有所影响）

来源

2014-09-05 AatG

没有看到你的任何代码，这是不可能的。 – nvoigt 2014-09-05 21:14:21

所以这已经打开了很长时间，这个建议可能不再是有用的，但我首先检查一下，如果用一个“简单的”解析器代替TCLAP会发生什么（也就是说，只需按照特定的固定顺序在命令行中输入参数并将它们转换为正确的类型）。它是高度不大可能是由于TCLAP引起的问题（即我无法想象这种行为的任何机制）。但是，可以想象，使用硬编码值时，编译器能够进行一些编译时优化，这些优化在这些值必须是变量时是不可能的。然而，表现差异的程度似乎有点病态，所以我仍然怀疑没有其他事情发生。

来源

2015-07-13 02:54:35 nomad

TCLAP使多线程程序变得更慢

回答

相关问题