2010-10-21 89 views
25

我正在尝试使用Python来处理一些使用Adobe Acrobat Reader填写并签名的PDF表单。如何从Python中填写的表单中提取PDF字段?

我已经试过:

  • pdfminer演示:它没有任何倾倒在填写数据。
  • pyPdf:当我尝试使用PdfFileReader(f)加载文件时,它最大化了一个核心2分钟,我放弃并杀死了它。
  • Jython和PDFBox:得到了很好的工作,但启动时间过长,如果这是我唯一的选择,我将直接在Java中编写外部实用程序。

我可以继续寻找图书馆并尝试它们,但我希望有人已经有一个有效的解决方案。


更新:根据史蒂芬的答案,我看着pdfminer和它很好的伎俩。

from argparse import ArgumentParser 
import pickle 
import pprint 
from pdfminer.pdfparser import PDFParser, PDFDocument 
from pdfminer.pdftypes import resolve1, PDFObjRef 

def load_form(filename): 
    """Load pdf form contents into a nested list of name/value tuples""" 
    with open(filename, 'rb') as file: 
     parser = PDFParser(file) 
     doc = PDFDocument() 
     parser.set_document(doc) 
     doc.set_parser(parser) 
     doc.initialize() 
     return [load_fields(resolve1(f)) for f in 
        resolve1(doc.catalog['AcroForm'])['Fields']] 

def load_fields(field): 
    """Recursively load form fields""" 
    form = field.get('Kids', None) 
    if form: 
     return [load_fields(resolve1(f)) for f in form] 
    else: 
     # Some field types, like signatures, need extra resolving 
     return (field.get('T').decode('utf-16'), resolve1(field.get('V'))) 

def parse_cli(): 
    """Load command line arguments""" 
    parser = ArgumentParser(description='Dump the form contents of a PDF.') 
    parser.add_argument('file', metavar='pdf_form', 
        help='PDF Form to dump the contents of') 
    parser.add_argument('-o', '--out', help='Write output to file', 
         default=None, metavar='FILE') 
    parser.add_argument('-p', '--pickle', action='store_true', default=False, 
         help='Format output for python consumption') 
    return parser.parse_args() 

def main(): 
    args = parse_cli() 
    form = load_form(args.file) 
    if args.out: 
     with open(args.out, 'w') as outfile: 
      if args.pickle: 
       pickle.dump(form, outfile) 
      else: 
       pp = pprint.PrettyPrinter(indent=2) 
       file.write(pp.pformat(form)) 
    else: 
     if args.pickle: 
      print pickle.dumps(form) 
     else: 
      pp = pprint.PrettyPrinter(indent=2) 
      pp.pprint(form) 

if __name__ == '__main__': 
    main() 
+0

作为一个说明,我也尝试使用pdftk作为外部工具,它没有超过所有者密码。 – Olson 2010-10-21 03:09:47

回答

25

你应该能够pdfminer做到这一点,但它需要一些钻研pdfminer的内部和有关PDF格式的一些知识(当然WRT形式,但也对PDF格式的内部结构,如“字典“和”间接对象“)。

这个例子可以帮助你对你的方式(我认为这将简单的情况下,只有工作,没有嵌套字段等等)

import sys 
from pdfminer.pdfparser import PDFParser 
from pdfminer.pdfdocument import PDFDocument 
from pdfminer.pdftypes import resolve1 

filename = sys.argv[1] 
fp = open(filename, 'rb') 

parser = PDFParser(fp) 
doc = PDFDocument(parser) 
fields = resolve1(doc.catalog['AcroForm'])['Fields'] 
for i in fields: 
    field = resolve1(i) 
    name, value = field.get('T'), field.get('V') 
    print '{0}: {1}'.format(name, value) 

编辑:忘了提:如果您需要提供一个密码,传递给doc.initialize()

+0

这样做,谢谢。我看到了网络演示,并发现我可以看到我想要的内容,如果没有,我可以跳过它。不仅可以按照我想要的方式完成,它甚至可以处理PdfBox无法处理的签名字段。 – Olson 2010-10-22 02:25:14

+1

我有一个编码问题。使用field.get('V')不会正确地编码特殊字符,如'ü'或'ä'。有没有人有解决这个问题?将字符串转换为unicode会引发解码错误。 – Basil 2012-08-20 09:20:52

+2

在当前版本的pdfminer中,PDFDocument.initialize方法已被删除。如果你只是删除该行,这段代码就可以工作。 – joshua 2014-11-05 22:07:24

3

快速和肮脏的2分钟的工作;只需使用PDFminer将PDF转换为xml,然后抓取所有字段。

from xml.etree import ElementTree 
from pprint import pprint 
import os 

def main(): 
    print "Calling PDFDUMP.py" 
    os.system("dumppdf.py -a FILE.pdf > out.xml") 

    # Preprocess the file to eliminate bad XML. 
    print "Screening the file" 
    o = open("output.xml","w") #open for append 
    for line in open("out.xml"): 
     line = line.replace("&#", "Invalid_XML") #some bad data in xml for formatting info. 
     o.write(line) 
    o.close() 

    print "Opening XML output" 
    tree = ElementTree.parse('output.xml') 
    lastnode = "" 
    lastnode2 = "" 
    list = {} 
    entry = {} 

    for node in tree.iter(): # Run through the tree..   
     # Check if New node 
     if node.tag == "key" and node.text == "T": 
      lastnode = node.tag + node.text 
     elif lastnode == "keyT": 
      for child in node.iter(): 
       entry["ID"] = child.text 
      lastnode = "" 

     if node.tag == "key" and node.text == "V": 
      lastnode2 = node.tag + node.text 
     elif lastnode2 == "keyV": 
      for child in node.iter(): 
       if child.tag == "string": 
        if entry.has_key("ID"): 
         entry["Value"] = child.text 
         list[entry["ID"]] = entry["Value"] 
         entry = {} 
      lastnode2 = "" 

    pprint(list) 

if __name__ == '__main__': 
    main() 

这并不美观,只是一个简单的概念证明。我需要为我正在处理的系统实施它,所以我会将其清理干净,但是我认为我会发布它以防万一任何人发现它有用。

3

更新PDF矿工(其他城市进口和在第一功能解析/文档设置)的最新版本

from argparse import ArgumentParser 
import pickle 
import pprint 
from pdfminer.pdfparser import PDFParser 
from pdfminer.pdfdocument import PDFDocument 
from pdfminer.pdftypes import resolve1 
from pdfminer.pdftypes import PDFObjRef 

def load_form(filename): 
    """Load pdf form contents into a nested list of name/value tuples""" 
    with open(filename, 'rb') as file: 
     parser = PDFParser(file) 
     doc = PDFDocument(parser) 
     parser.set_document(doc) 
     #doc.set_parser(parser) 
     doc.initialize() 
     return [load_fields(resolve1(f)) for f in 
      resolve1(doc.catalog['AcroForm'])['Fields']] 

def load_fields(field): 
    """Recursively load form fields""" 
    form = field.get('Kids', None) 
    if form: 
     return [load_fields(resolve1(f)) for f in form] 
    else: 
     # Some field types, like signatures, need extra resolving 
     return (field.get('T').decode('utf-8'), resolve1(field.get('V'))) 

def parse_cli(): 
    """Load command line arguments""" 
    parser = ArgumentParser(description='Dump the form contents of a PDF.') 
    parser.add_argument('file', metavar='pdf_form', 
     help='PDF Form to dump the contents of') 
    parser.add_argument('-o', '--out', help='Write output to file', 
     default=None, metavar='FILE') 
    parser.add_argument('-p', '--pickle', action='store_true', default=False, 
     help='Format output for python consumption') 
    return parser.parse_args() 

def main(): 
    args = parse_cli() 
    form = load_form(args.file) 
    if args.out: 
     with open(args.out, 'w') as outfile: 
      if args.pickle: 
       pickle.dump(form, outfile) 
      else: 
       pp = pprint.PrettyPrinter(indent=2) 
       file.write(pp.pformat(form)) 
    else: 
     if args.pickle: 
      print pickle.dumps(form) 
     else: 
      pp = pprint.PrettyPrinter(indent=2) 
      pp.pprint(form) 

if __name__ == '__main__': 
    main() 
+0

您在哪里放置文件名以便脚本可以运行? – user2067030 2016-12-22 15:31:16

0

有这些线路上的一个错字:

file.write(pp.pformat(form)) 

应该是:

outfile.write(pp.pformat(form)) 
3

Python 3。6+:

pip install PyPDF2

# -*- coding: utf-8 -*- 

from collections import OrderedDict 
from PyPDF2 import PdfFileWriter, PdfFileReader 


def _getFields(obj, tree=None, retval=None, fileobj=None): 
    """ 
    Extracts field data if this PDF contains interactive form fields. 
    The *tree* and *retval* parameters are for recursive use. 

    :param fileobj: A file object (usually a text file) to write 
     a report to on all interactive form fields found. 
    :return: A dictionary where each key is a field name, and each 
     value is a :class:`Field<PyPDF2.generic.Field>` object. By 
     default, the mapping name is used for keys. 
    :rtype: dict, or ``None`` if form data could not be located. 
    """ 
    fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name', 
         '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'} 
    if retval is None: 
     retval = OrderedDict() 
     catalog = obj.trailer["/Root"] 
     # get the AcroForm tree 
     if "/AcroForm" in catalog: 
      tree = catalog["/AcroForm"] 
     else: 
      return None 
    if tree is None: 
     return retval 

    obj._checkKids(tree, retval, fileobj) 
    for attr in fieldAttributes: 
     if attr in tree: 
      # Tree is a field 
      obj._buildField(tree, retval, fileobj, fieldAttributes) 
      break 

    if "/Fields" in tree: 
     fields = tree["/Fields"] 
     for f in fields: 
      field = f.getObject() 
      obj._buildField(field, retval, fileobj, fieldAttributes) 

    return retval 


def get_form_fields(infile): 
    infile = PdfFileReader(open(infile, 'rb')) 
    fields = _getFields(infile) 
    return OrderedDict((k, v.get('/V', '')) for k, v in fields.items()) 



if __name__ == '__main__': 
    from pprint import pprint 

    pdf_file_name = 'FormExample.pdf' 

    pprint(get_form_fields(pdf_file_name)) 
0

Python的PyPDF2包(继任者pyPdf)非常方便:

import PyPDF2 
f = PyPDF2.PdfFileReader('form.pdf') 
ff = f.getFields() 

然后ffdict包含所有相关形式的信息。