我正在写一段很长的代码,这段代码太长而无法执行。我在代码上使用了cProfile,我发现下面的函数被调用了150次,每次调用需要1.3秒,导致这个函数大约需要200秒。功能是 -该功能可以针对速度进行优化吗?

def makeGsList(sentences,org): 
    for s in sentences: 
     if s.startswith(tuple(StartWords)): 
      s = s.lower() 
      if org=='m': 
       gs_list1 = [k for k in m_words if k in s] 
      if org=='h': 
       gs_list1 = [k for k in h_words if k in s] 
      for gs_element in gs_list1: 
    gs_list3 = list(set(gs_list2)) 
    return gs_list3 



备注 - 单词h_words/m_words并不都是单个单词,其中很多单词都是包含3-4个单词的短语。

一些例子 -

StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1'] 

sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n', u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n'] 

h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5'] 


关于尺寸 -



应该去的代码审查stackexchange如果你的代码是工作的罚款 –


1.想想每次迭代之后什么在'gs_list1'。 2.为什么不开始*设置? – jonrsharpe


你能保证'org'将会'm'或'h'吗? –






句子=元组(s.split(”“)对于s的句子) 然后,而不是使用startswith,把你的StartsWords,并把它们放在一组

所以 sw_set = {w表示w的StartsWords}

然后,你遍历你的句子,这样做: 如果s [0] sw_set: #继续你的逻辑




  1. 不要使用全局变量m_wordsk_words
  2. if语句放在for循环之外。
  3. 铸造tuple(StartWords)一劳永逸。
  4. 使用编程方式创建的正则表达式而不是列表理解。
  5. 预编译你所能做的一切。
  6. 直接扩展你的列表,而不是迭代到每个元素append()
  7. 从头开始使用set而不是list
  8. 使用集合理解而不是显式for循环。

m_reg = re.compile("|".join(re.escape(w) for w in m_words)) 
h_reg = re.compile("|".join(re.escape(w) for w in h_words)) 

def make_gs_list(sentences, start_words, m_reg, h_reg, org): 
    if org == 'm': 
     reg = m_reg 
    elif org == 'h': 
     reg = h_reg 

    matched = {w for s in sentences if s.startswith(start_words) 
       for w in reg.findall(s.lower())} 

    return matched 

谢谢你的详细解答。正如我在编辑中提到的那样,由于列表'm_words'和'h_words'都很大(每个列表中有250,000个条目),最好是在main函数中编译一次(两者),并将'reg_m'和'reg_h'这个函数? 此外,为什么你认为传递'start_words','m_words','h_words'作为函数参数比使用全局变量更快? – user1993


@ user1993如果你能够预编译正则表达式,是的,你应该这样做!局部变量通常比全局变量更快。 – Delgan


@ user1993另外,让我知道如何比你的第一个功能快得多,我很好奇。 – Delgan



# optionaly change these regexes 
FIRST_WORD_RE = re.compile(r"^[a-zA-Z]+") 
LOWER_WORD_RE = re.compile(r"[a-z]+") 
m_or_h_words = {'m': set(m_words), 'h': set(h_words)} 
startwords_set = set(StartWords) 

def makeGsList(sentences, org): 
    words = m_or_h_words[org] 
    gs_set2 = set() 
    for s in sentences: 
     mo = FIRST_WORD_RE.match(s) 
     if mo and mo.group(0) in startwords_set: 
      gs_set2 |= set(LOWER_WORD_RE.findall(s.lower())) & words 
    return list(gs_set2) 

感谢您的回答。我有几个问题 - 1.为什么你认为使用正则表达式比'.startswith'好? 2.'| ='做什么? – user1993


我不知道'.startswith()'算法。我怀疑它遍历所有的单词,如果有很多单词,这可能是低效的。此外,我从编程语言解析技术中知道,识别文本中关键字的最有效方法是使用正则表达式并在集合或字典中使用查找。您可以使用timeit来比较两种方法。 '| ='以增量方式执行set union('|'是set类型的二元联合运算符)。 – Gribouillis


我认为这行有一个错误 - 'gs_set2 | = set(LOWER_WORD_RE.findall(s。lower())&words'因为我的编译器给出了一个错误'SyntaxError:invalid syntax'对于我之后放置的任何行 – user1993