2011-05-20 205 views
1

以下代码允许我提取.tgz文件。然而,它在大约两级之后停止提取;还有其他的子文件夹需要提取.tgz文件。此外,当我提取一个文件时,我必须手动将它移动到另一个路径,否则它会被其他提取到该位置的.tgz文件覆盖(我使用的所有.tgz文件都具有相同的文件结构/文件夹名称一旦提取)。任何帮助表示赞赏。谢谢!提取压缩文件

import os, sys, tarfile 

def extract(tar_url, extract_path='.'): 
    print tar_url 
    tar = tarfile.open(tar_url, 'r') 
    for item in tar: 
     tar.extract(item, extract_path) 
     if item.name.find(".tgz") != -1 or item.name.find(".tar") != -1: 
      extract(item.name, "./" + item.name[:item.name.rfind('/')]) 
try: 

    extract(sys.argv[1] + '.tgz') 
    print 'Done.' 
except: 
    name = os.path.basename(sys.argv[0]) 
    print name[:name.rfind('.')], '<filename>' 
+4

跳转首先想到的事情是,你打电话提取物(事实)递归而不关闭tar文件打开,所以你可以打开太多文件。我会重写一个列表作为一个堆栈,您可以将发现的tar文件放入并关闭每个tar文件,然后再从堆栈中取出下一个并处理它。 – 2011-05-20 20:32:13

+0

第二件事是你传递了错误的'extract_path'。使用'os.path.join(extract_path,item.name ....)'。 – khachik 2011-05-20 20:36:41

+2

第三件事是,你使用的是“除外”,所以即使它提出异常来说出现问题,也没有机会报告它。使用try ...除外,具体说明您正在捕捉哪个异常。 – MRAB 2011-05-20 22:50:32

回答

3

如果我没有错误解你的问题,那么这里就是你想做的事 -

  • 提取物可能有内它 更.tgz的文件.tgz的文件,需要进一步 提取(等等..)
  • 提取时,您需要小心不要替换文件夹中已有的目录。

如果我正确理解你的问题,然后...
这里是我的代码做 -

  • 提取物每.tgz的文件(递归)在一个单独的文件夹名称相同.tgz文件(没有扩展名)放在同一个目录下。
  • 提取时,它确保它不覆盖/替换任何已经存在的文件/文件夹。

因此,如果这是.tgz的文件的目录结构 -

parent/ 
    xyz.tgz/ 
     a 
     b 
     c 
     d.tgz/ 
      x 
      y 
      z 
     a.tgz/     # note if I extract this directly, it will replace/overwrite contents of the folder 'a' 
      m 
      n 
      o 
      p 

提取之后,目录结构将是 -

parent/ 
    xyz.tgz 
    xyz/ 
     a 
     b 
     c 
     d/ 
      x 
      y 
      z 
     a 1/     # it extracts 'a.tgz' to the folder 'a 1' as folder 'a' already exists in the same folder. 
      m 
      n 
      o 
      p 

虽然我已经提供了大量的文档我的代码如下,我只是简单介绍一下我的程序结构。这里是我定义的功能 -

FileExtension --> returns the extension of a file 
AppropriateFolderName --> helps in preventing overwriting/replacing of already existing folders (how? you will see it in the program) 
Extract --> extracts a .tgz file (safely) 
WalkTreeAndExtract - walks down a directory (passed as parameter) and extracts all .tgz files(recursively) on the way down. 

我不能建议您所做的更改,因为我的方法有点不同。我已经使用extractall方法的tarfile模块,而不是像以前那样复杂的方法extract方法。 (只要有浏览一下这个 - 。http://docs.python.org/library/tarfile.html#tarfile.TarFile.extractall和阅读使用extractall方法相关的警告,我不`吨认为我们将有任何一般这样的问题,而只是记住这一点)

所以这里是代码这为我工作 -
(我试了.tar文件嵌套5级深度(即在.tar.tar.tar ... 5次),但它应该对任何深入的工作*,也为.tgz文件。)

# extracting_nested_tars.py 

import os 
import re 
import tarfile 

file_extensions = ('tar', 'tgz') 
# Edit this according to the archive types you want to extract. Keep in 
# mind that these should be extractable by the tarfile module. 

def FileExtension(file_name): 
    """Return the file extension of file 

    'file' should be a string. It can be either the full path of 
    the file or just its name (or any string as long it contains 
    the file extension.) 

    Examples: 
    input (file) --> 'abc.tar' 
    return value --> 'tar' 

    """ 
    match = re.compile(r"^.*[.](?P<ext>\w+)$", 
     re.VERBOSE|re.IGNORECASE).match(file_name) 

    if match:   # if match != None: 
     ext = match.group('ext') 
     return ext 
    else: 
     return ''  # there is no file extension to file_name 

def AppropriateFolderName(folder_name, parent_fullpath): 
    """Return a folder name such that it can be safely created in 
    parent_fullpath without replacing any existing folder in it. 

    Check if a folder named folder_name exists in parent_fullpath. If no, 
    return folder_name (without changing, because it can be safely created 
    without replacing any already existing folder). If yes, append an 
    appropriate number to the folder_name such that this new folder_name 
    can be safely created in the folder parent_fullpath. 

    Examples: 
    folder_name = 'untitled folder' 
    return value = 'untitled folder' (if no such folder already exists 
             in parent_fullpath.) 

    folder_name = 'untitled folder' 
    return value = 'untitled folder 1' (if a folder named 'untitled folder' 
             already exists but no folder named 
             'untitled folder 1' exists in 
             parent_fullpath.) 

    folder_name = 'untitled folder' 
    return value = 'untitled folder 2' (if folders named 'untitled folder' 
             and 'untitled folder 1' both 
             already exist but no folder named 
             'untitled folder 2' exists in 
             parent_fullpath.) 

    """ 
    if os.path.exists(os.path.join(parent_fullpath,folder_name)): 
     match = re.compile(r'^(?P<name>.*)[ ](?P<num>\d+)$').match(folder_name) 
     if match:       # if match != None: 
      name = match.group('name') 
      number = match.group('num') 
      new_folder_name = '%s %d' %(name, int(number)+1) 
      return AppropriateFolderName(new_folder_name, 
             parent_fullpath) 
      # Recursively call itself so that it can be check whether a 
      # folder named new_folder_name already exists in parent_fullpath 
      # or not. 
     else: 
      new_folder_name = '%s 1' %folder_name 
      return AppropriateFolderName(new_folder_name, parent_fullpath) 
      # Recursively call itself so that it can be check whether a 
      # folder named new_folder_name already exists in parent_fullpath 
      # or not. 
    else: 
     return folder_name 

def Extract(tarfile_fullpath, delete_tar_file=True): 
    """Extract the tarfile_fullpath to an appropriate* folder of the same 
    name as the tar file (without an extension) and return the path 
    of this folder. 

    If delete_tar_file is True, it will delete the tar file after 
    its extraction; if False, it won`t. Default value is True as you 
    would normally want to delete the (nested) tar files after 
    extraction. Pass a False, if you don`t want to delete the 
    tar file (after its extraction) you are passing. 

    """ 
    tarfile_name = os.path.basename(tarfile_fullpath) 
    parent_dir = os.path.dirname(tarfile_fullpath) 

    extract_folder_name = AppropriateFolderName(tarfile_name[:\ 
    -1*len(FileExtension(tarfile_name))-1], parent_dir) 
    # (the slicing is to remove the extension (.tar) from the file name.) 
    # Get a folder name (from the function AppropriateFolderName) 
    # in which the contents of the tar file can be extracted, 
    # so that it doesn't replace an already existing folder. 
    extract_folder_fullpath = os.path.join(parent_dir, 
    extract_folder_name) 
    # The full path to this new folder. 

    try: 
     tar = tarfile.open(tarfile_fullpath) 
     tar.extractall(extract_folder_fullpath) 
     tar.close() 
     if delete_tar_file: 
      os.remove(tarfile_fullpath) 
     return extract_folder_name 
    except Exception as e: 
     # Exceptions can occur while opening a damaged tar file. 
     print 'Error occured while extracting %s\n'\ 
     'Reason: %s' %(tarfile_fullpath, e) 
     return 

def WalkTreeAndExtract(parent_dir): 
    """Recursively descend the directory tree rooted at parent_dir 
    and extract each tar file on the way down (recursively). 
    """ 
    try: 
     dir_contents = os.listdir(parent_dir) 
    except OSError as e: 
     # Exception can occur if trying to open some folder whose 
     # permissions this program does not have. 
     print 'Error occured. Could not open folder %s\n'\ 
     'Reason: %s' %(parent_dir, e) 
     return 

    for content in dir_contents: 
     content_fullpath = os.path.join(parent_dir, content) 
     if os.path.isdir(content_fullpath): 
      # If content is a folder, walk it down completely. 
      WalkTreeAndExtract(content_fullpath) 
     elif os.path.isfile(content_fullpath): 
      # If content is a file, check if it is a tar file. 
      # If so, extract its contents to a new folder. 
      if FileExtension(content_fullpath) in file_extensions: 
       extract_folder_name = Extract(content_fullpath) 
       if extract_folder_name:  # if extract_folder_name != None: 
        dir_contents.append(extract_folder_name) 
        # Append the newly extracted folder to dir_contents 
        # so that it can be later searched for more tar files 
        # to extract. 
     else: 
      # Unknown file type. 
      print 'Skipping %s. <Neither file nor folder>' % content_fullpath 

if __name__ == '__main__': 
    tarfile_fullpath = 'fullpath_path_of_your_tarfile' # pass the path of your tar file here. 
    extract_folder_name = Extract(tarfile_fullpath, False) 

    # tarfile_fullpath is extracted to extract_folder_name. Now descend 
    # down its directory structure and extract all other tar files 
    # (recursively). 
    extract_folder_fullpath = os.path.join(os.path.dirname(tarfile_fullpath), 
     extract_folder_name) 
    WalkTreeAndExtract(extract_folder_fullpath) 
    # If you want to extract all tar files in a dir, just execute the above 
    # line and nothing else. 

我还没有添加命令行界面。我想你可以添加它,如果你觉得它有用。

这里有一个稍微好一点的版本,上述程序 -
http://guanidene.blogspot.com/2011/06/nested-tar-archives-extractor.html

+0

而不是'os.listdir()',使用'os.walk()'来遍历目录树。 Re:“(分片是从文件名中删除扩展名(.tar)。)”使用'os.path.splitext(tarfile_name)[0]' – hughdbrown 2011-06-25 15:00:45

+0

我不能这样做--'dir_contents.append( extract_folder_name)'(并且另外定制它)如果我使用'os.walk()'。 关于使用'os.path.splitext',在'.tgz'文件的情况下是正确的,但是我已经写了一个更通用的目的 - 提取'.tar.gz'和'.tar.bz2 '文件(扩展名拼写错误地给出了'.gz'和'.bz2')。 – 2011-06-25 15:08:22

+0

这就是我正在尝试做的......谢谢! – suffa 2011-06-28 01:31:33