2013-03-23 54 views
0

我正在设置一个脚本来根据文件中包含的文本合并PDF。我在这里的问题是“小提琴I”也包含在“小提琴II”中,并且“中音萨克斯管I”也包含在“中音萨克斯管II”中。我该如何设置,以便tempList只包含来自“Violin I”的条目并排除“Violin II”,反之亦然?分离其他地方包含的字符串

pdfList = ["01 Violin I.pdf", "02 Violin I.pdf","01 Violin II.pdf", "02 Violin II.pdf", ] 
instruments = ["Soprano", "Tenor", "Violin I", "Violin II", "Viola", "Cello", "Contrabass", "Alto Saxophone I", "Alto Saxophone II", "Tenor Saxophone", "Baritone Saxophone"] 


# create arrays for each instrument that can be used for merging/organization 
def organizer(): 
    for fileName in pdfList: 
     for instrument in instruments: 
      tempList = [] 
      if instrument in fileName: 
       tempList.append(fileName) 
     print tempList 


print pdfList 
organizer() 
+0

PDF是否总是像这样命名? IE浏览器。 '号码+仪表+ .pdf'。或者我们是否应该假定PDF可以有任何包含该工具的名称? – woemler 2013-03-23 16:22:39

+0

是的,PDFs将始终采用格式“(初始数字)+(一些文本)+(仪器)+ .pdf – jumbopap 2013-03-23 16:23:37

回答

1

尝试使这一变化:

... 
if instrument+'.pdf' in fileName: 
... 

这会涵盖所有情况?以避免包括子

+0

简单而有效,谢谢。 – jumbopap 2013-03-23 17:57:40

3

一种方法是使用正则表达式,如:

import re 

pdfList = ["01 Violin I.pdf", "02 Violin I.pdf","01 Violin II.pdf", "02 Violin \ 
II.pdf", ] 
instruments = ["Soprano", "Tenor", "Violin I", "Violin II", "Viola", "Cello", "\ 
Contrabass", "Alto Saxophone I", "Alto Saxophone II", "Tenor Saxophone", "Barit\ 
one Saxophone"] 

# create arrays for each instrument that can be used for merging/organization 
def organizer(): 
    for fileName in pdfList: 
     tempList = [] 
     for instrument in instruments: 
      if re.search(r'\b{}\b'.format(instrument), fileName): 
       tempList.append(fileName) 
     print tempList 

print pdfList 
organizer() 

这种包装了\b搜索词,使其只在开头和结尾都以字边界匹配。此外,也许很明显但值得指出的是,这也会使你的乐器名称成为正则表达式的一部分,所以请注意,如果你使用任何也是正则表达式元字符的字符,它们将被相互插入(现在你不是)。更普遍的方案将需要一些代码来查找和正确地逃避这些角色。