我正在写一段很长的代码,这段代码太长而无法执行。我在代码上使用了cProfile,我发现下面的函数被调用了150次,每次调用需要1.3秒,导致这个函数大约需要200秒。功能是 -该功能可以针对速度进行优化吗?
def makeGsList(sentences,org):
gs_list1=[]
gs_list2=[]
for s in sentences:
if s.startswith(tuple(StartWords)):
s = s.lower()
if org=='m':
gs_list1 = [k for k in m_words if k in s]
if org=='h':
gs_list1 = [k for k in h_words if k in s]
for gs_element in gs_list1:
gs_list2.append(gs_element)
gs_list3 = list(set(gs_list2))
return gs_list3
该代码应该是一个句子列表和一个标志org
。然后,它会遍历每一行,检查它是否以列表StartWords
中的任何单词开头,然后小写它。然后,根据org
的值,它会列出当前句子中的所有单词,这些单词也存在于m_words
或h_words
中。它不断将这些单词附加到另一个列表gs_list2
。最后它会生成一组gs_list2
并返回它。
有人可以给我任何关于如何优化此功能以减少执行时间的建议吗?
备注 - 单词h_words
/m_words
并不都是单个单词,其中很多单词都是包含3-4个单词的短语。
一些例子 -
StartWords = ['!Series_title','!Series_summary','!Series_overall_design','!Sample_title','!Sample_source_name_ch1','!Sample_characteristics_ch1']
sentences = [u'!Series_title\t"Transcript profiles of DCs of PLOSL patients show abnormalities in pathways of actin bundling and immune response"\n', u'!Series_summary\t"This study was aimed to identify pathways associated with loss-of-function of the DAP12/TREM2 receptor complex and thus gain insight into pathogenesis of PLOSL (polycystic lipomembranous osteodysplasia with sclerosing leukoencephalopathy). Transcript profiles of PLOSL patients\' DCs showed differential expression of genes involved in actin bundling and immune response, but also for the stability of myelin and bone remodeling."\n', u'!Series_summary\t"Keywords: PLOSL patient samples vs. control samples"\n', u'!Series_overall_design\t"Transcript profiles of in vitro differentiated DCs of three controls and five PLOSL patients were analyzed."\n', u'!Series_type\t"Expression profiling by array"\n', u'!Sample_title\t"potilas_DC_A"\t"potilas_DC_B"\t"potilas_DC_C"\t"kontrolli_DC_A"\t"kontrolli_DC_C"\t"kontrolli_DC_D"\t"potilas_DC_E"\t"potilas_DC_D"\n', u'!Sample_characteristics_ch1\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\t"in vitro differentiated DCs"\n', u'!Sample_description\t"DAP12mut"\t"DAP12mut"\t"DAP12mut"\t"control"\t"control"\t"control"\t"TREM2mut"\t"TREM2mut"\n']
h_words = ['pp1665', 'glycerophosphodiester phosphodiesterase domain containing 5', 'gde2', 'PLOSL patients', 'actin bundling', 'glycerophosphodiester phosphodiesterase 2', 'glycerophosphodiester phosphodiesterase domain-containing protein 5']
m_words是相似的。
关于尺寸 -
两个列表h_words
的长度和m_words
是大约250,000。列表中的每个元素平均长2个字。句子的列表长度大约为10-20个句子,我提供了一个示例列表,让您了解每个句子的大小。
应该去的代码审查stackexchange如果你的代码是工作的罚款 –
1.想想每次迭代之后什么在'gs_list1'。 2.为什么不开始*设置? – jonrsharpe
你能保证'org'将会'm'或'h'吗? –