2012-03-24 141 views
2

我试图获取URL的HTML源代码,解析它,然后将结果打印为PDF。Python - 获取URL,解析并打印PDF

为此,我想依赖BeautifulSoup,urllib2和reportlab,但我缺乏如何正确合并它们的方法。

作为错误,我运行django 1.3.1 dev服务器并访问视图时得到了'module' object is not callable

这是我的代码:

from reportlab.pdfgen import canvas 
from cStringIO import StringIO 
from django.http import HttpResponse 
from django.shortcuts import render_to_response 
from django.template import RequestContext 
# Fetching the URL 
import urllib2 

# Parsing the HTML 
from BeautifulSoup import BeautifulSoup 

# The ConverterForm 
from django import forms 

class ConverterForm(forms.Form): 
    # Use textarea instead the default TextInput. 
    html_files = forms.CharField(widget=forms.Textarea) 
    filename = forms.CharField() 

# Create your views here. 
def create_pdf(request): 
    # If the form has been submitted 
    if request.method == 'POST': 
     # A form bound to the POST data 
     form = ConverterForm(request.POST) 
    # All validation rules pass 
    if form.is_valid(): 
     # PDF creation process 
     # Assign variables 
     html_files = form.cleaned_data['html_files'] 
     filename = form.cleaned_data['filename'] 

     # Create the HttpResponse object with the appropriate PDF headers. 
     response = HttpResponse(mimetype='application/pdf') 
     # The use of attachment forces the Save as dialog to open. 
     response['Content-Disposition'] = 'attachment; filename=%s.pdf' % filename 

     buffer = StringIO() 

     # Get the page source 
     page = urllib2.urlopen(html_files) 
     html = page.read() 

     # Parse the page source 
     soup = BeautifulSoup(html) 

     # Create the PDF object, using the StringIO() object as its "file". 
     p = canvas.Canvas(buffer) 

     # Draw things on the PDF and generate the PDF. 
     # See ReportLab documentation for full list of functions. 
     p.drawString(100, 100, soup) 

     # Close the PDF object cleanly. 
     p.showPage() 
     p.save() 

     # Get the value of the StringIO buffer and write it to the response. 
     pdf = buffer.getvalue() 
     buffer.close() 
     response.write(pdf) 
     return response 

    else: 
     # An unbound form 
     form = ConverterForm() 

    # For RequestContext in relation to csrf see more here: 
    # https://docs.djangoproject.com/en/1.3/intro/tutorial04/ 
    return render_to_response('converter/index.html', { 
    'form': form, 
    }, context_instance=RequestContext(request)) 
+1

你从哪里得到错误?请全部显示。你甚至没有显示你的整个代码。 – Marcin 2012-03-24 13:08:04

+0

我编辑了代码。对不起,最初我以为其余的可能不相关。 Registers – orschiro 2012-03-24 13:13:10

+0

你确切的错误是'buffer = StringIO()',这应该是'buffer = StringIO.StringIO()',但我提供了一个更简单的解决方案作为答案。 – 2012-03-24 14:27:57

回答

1

这里有一个简单的方法:

import cStringIO as StringIO 

import ho.pisa as pisa 
import requests 

def pdf_maker(request): 

    browser = requests.get('http://www.google.com/') 
    html = browser.text 

    result = StringIO.StringIO() 
    source = StringIO.StringIO(html.encode('UTF-8')) # adjust as required 

    pdf = pisa.pisaDocument(source,dest=result) 

    if not pdf.err: 
     response = HttpResponse(result.getvalue(),mimetype='application/pdf') 
     response['Content-Disposition'] = 'attachment; filename=the_file.pdf' 
     return response 

    return render(request,'error.html') 

这使用requestspisa。但是,您将对此(以及其他此类解决方案)有一​​些限制。也就是说,您需要找到一种自己获取和嵌入图像的方式,因为PDF转换过程无法直接从Internet加载图像。

+1

谢谢。但根据PyPi比萨不再开发。 XHTML2PDF库的行为与Pisa一样吗? – orschiro 2012-03-24 13:28:04

+0

是的,几乎完全一样。 – 2012-03-24 14:14:50

+0

尽管如此,我想了解为什么我的方法失败。 – orschiro 2012-03-24 14:18:58

4

您需要导入BeautifulSoup类:

from BeautifulSoup import BeautifulSoup 

,因为无论是模块和类具有相同的基本名称可能是混乱。

+0

显然不在我的系统上。 '无法导入converter.views。错误是:没有名为BeutifulSoup的模块。我使用ActivePython发行版,并通过pypm安装了beautifulsoup。 http://code.activestate.com/pypm/beautifulsoup/ – orschiro 2012-03-24 15:03:01

+2

@orschiro:检查拼写。 – jfs 2012-03-24 15:36:41

+0

这并没有解决问题。我更新了上面的代码。确切的错误输出:http://dpaste.com/721558/ – orschiro 2012-03-26 14:40:28