共计 13959 个字符,预计需要花费 35 分钟才能阅读完成。
虽然使用 Python 处理 PDF 文档的方法有很多种,但我发现生成或编辑 HTML 比尝试弄清楚 PDF 格式的复杂性更容易、更可靠。当然,有令人尊敬的 ReportLab,如果您不喜欢 HTML,我鼓励您研究该选项。还有 PyPDF2。或者也许是 PyPDF3?不,也许是 PyPDF4!嗯 … 看到问题了吗?我最好的猜测是 PyPDF3,无论它的价值如何。
这么多选择 …
但如果您熟悉 HTML,那么有一个简单的选择。
输入 WeasyPrint。它需要 HTML 和 CSS,并将其转换为可用且可能美观的 PDF 文档。
可以在关联的 Github 存储库中访问本文中的代码示例。随意克隆和适应。
github.com/bowmanjd/pyweasyprintdemo
安装
要安装 WeasyPrint,我建议您首先使用您选择的工具设置一个虚拟环境。
然后,安装就像在激活的虚拟环境中执行类似以下操作一样简单:
pip install weasyprin
上述方案的替代方案,具体取决于您的工具:
poetry add weasyprint
conda install -c conda-forge weasyprint
pipenv install weasyprint
你明白了。
如果您只需要 weasyprint 命令行工具,您甚至可以使用 pipx 并使用 pipx install weasyprint. 虽然这不会使作为 Python 库的访问变得非常方便,但如果您只想将网页转换为 PDF,这可能就是您所需要的。
命令行工具(Python 使用可选)
安装后,weasyprint 命令行工具即可使用。您可以将 HTML 文件或网页转换为 PDF。例如,您可以尝试以下操作:
weasyprint
"https://en. 测试网址.org/wiki/Python_(programming_language)"
python.pdf
python.pdf 上面的命令将在当前工作目录中保存一个文件,该文件是从百科上的 Python 编程语言英文文章的 HTML 转换而来的。它并不完美,但希望它能给你一个想法。
当然,您不必指定网址。本地 HTML 文件工作正常,并且它们提供对内容和样式的必要控制。
weasyprint sample.html out/sample.pdf
请随意下载 sample.html 与本文内容相关的 sample.css 样式表。
CSS
body {font-family: sans-serif;}
code {
font-family: monospace;
background: #ccc;
padding: 2px;
}
pre code {display: block;}
img {
display: block;
margin-left: auto;
margin-right: auto;
width: 90%;
}
@media print {
a::after {content: "(" attr(href) ")";
}
pre {white-space: pre-wrap;}
@page {
margin: 0.75in;
size: Letter;
@top-right {content: counter(page);
}
}
@page :first {
@top-right {content: "";}
}
}
HTML
html>PDF Generation with Python and WeasyPrint Python PDF Generation from HTML with WeasyPrint
While there are numerous ways to handle PDF documents with Python, I find generating or editing HTML far easier and more reliable than trying to figure out the intricacies of the PDF format. Sure, there is the venerable ReportLab, and if HTML is not your cup of tea, I encourage you to look into that option. There is also PyPDF2. Or maybe PyPDF3? No, perhaps PyPDF4! Hmmm... see the problem? My best guess is PyPDF3, for what that is worth.
So many choices...
But there is an easy choice if you are comfortable with HTML.
Enter WeasyPrint. It takes HTML and CSS, and converts it to a usable and potentially beautiful PDF document.
The code samples in this article can be accessed in the associated Github repo >. Feel free to clone and adapt.
Installation
To install WeasyPrint, I recommend you first set up a virtual environment with the tool of your choice >.
Then, installation is as simple as performing something like the following in an activated virtual environment:
pip install weasyprin
t
Alternatives to the above, depending on your tooling:
poetry add weasyprint
conda install -c conda-forge weasyprint
pipenv install weasyprint
You get the idea.
If you only want the weasyprint
command-line tool, you could
even
use pipx >
and install with pipx install weasyprint
. While that would
not make it very convenient to access as a Python library, if you just
want to convert web pages to PDFs, that may be all you need.
A command line tool (Python usage optional)
Once installed, the weasyprint
command line tool is
available. You can convert an HTML file or a web page to PDF. For
instance, you could try the following:
weasyprint
"https://en. 网址.org/wiki/Python_(programming_language)"
python.pdf
The above command will save a file python.pdf
in the current
working directory, converted from the HTML from the
Python programming language article in English on 网址 >. It ain’t perfect, but it gives you an idea, hopefully.
You don’t have to specify a web address, of course. Local HTML files work
fine, and they provide necessary control over content and styling.
weasyprint sample.html out/sample.pdf
Feel free to
download a sample.html
>
and an associated
sample.css
stylesheet >
with the contents of this article.
See
the WeasyPrint docs >
for further examples and instructions regarding the standalone
weasyprint
command line tool.
Utilizing WeasyPrint as a Python library
The
Python API for WeasyPrint
is quite versatile. It can be used to load HTML when passed appropriate
file pointers, file names, or the text of the HTML itself.
Here is an example of a simple makepdf()
function that
accepts an HTML string, and returns the binary PDF data.
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
The main workhorse here is the HTML
class. When instantiating
it, I found I needed to pass a base_url
parameter in order
for it to load images and other assets from relative urls, as in
.
Using HTML
and write_pdf()
, not only will the
HTML be parsed, but associated CSS, whether it is embedded in the head of
the HTML (in a
tag), or included in a
stylesheet (with a
>
tag).
I should note that HTML
can load straight from files, and
write_pdf()
can write to a file, by specifying filenames or
file pointers. See
the docs for more detail.
Here is a more full-fledged example of the above, with primitive command
line handling capability added:
from pathlib import Path
import sys
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
def run():
"""Command runner."""
infile = sys.argv[1]
outfile = sys.argv[2]
html = Path(infile).read_text()
pdf = makepdf(html)
Path(outfile).write_bytes(pdf)
if __name__ == "__main__":
run()
You may
download the above file >
directly, or
browse the Github repo >.
A note about Python types: the
string
parameter when
instantiatingHTML
is a normal (Unicode)str
,
butmakepdf()
outputsbytes
.
Assuming the above file is in your working directory as
weasyprintdemo.py
and that a sample.html
and an
out
directory are also there, the following should work well:
python weasyprintdemo.py sample.html out/sample.pdf
Try it out, then open out/sample.pdf
with your PDF reader.
Are we close?
Styling HTML for print
As is probably apparent, using WeasyPrint is easy. The real work with HTML
to PDF conversion, however, is in the styling. Thankfully, CSS has pretty
good support for printing.
Some useful CSS print resources:
This simple stylesheet demonstrates a few basic tricks:
body {font-family: sans-serif;}
@media print {
a::after {content: "(" attr(href) ")";
}
pre {white-space: pre-wrap;}
@page {
margin: 0.75in;
size: Letter;
@top-right {content: counter(page);
}
}
@page :first {
@top-right {content: "";}
}
}
First, use
media queries >. This allows you to use the same stylesheet for both print and screen,
using @media print
and
@media screen
respectively. In the example stylesheet, I
assume that the defaults (such as seen in the
body
declaration) apply to all formats, and that
@media print
provides overrides. Alternatively, you could
include separate stylesheets for print and screen, using the
media
attribute of the tag, as in
>.
Second,
use @page
CSS rules >. While
browser support >
is pretty abysmal in 2020, WeasyPrint does a pretty good job of supporting
what you need. Note the margin and size adjustments above, and the page
numbering, in which we first define a counter in the top-right, then
override with :first
to make it blank on the first page only.
In other words, page numbers only show from page 2 onward.
Also note the a::after
trick to explicitly display the
href
attribute when printing. This is either clever or
annoying, depending on your goals.
Another hint, not demonstrated above: within the
@media print
block, set display: none
on any
elements that don’t need to be printed, and set
background: none
where you don’t want backgrounds printed.
Django and Flask support
If you write Django or
Flask apps, you may
benefit from the convenience of the respective libraries for generating
PDFs within these frameworks:
-
django-weasyprint >
provides aWeasyTemplateView
view base class or a
WeasyTemplateResponseMixin
mixin on a TemplateView
-
Flask-WeasyPrint >
provides a specialHTML
class that works just like
WeasyPrint’s, but respects Flask routes and WSGI. Also provided is a
render_pdf
function that can be called on a template or on
theurl_for()
of another view, setting the correct
mimetype.
Generate HTML the way you like
WeasyPrint encourages the developer to make HTML and CSS, and the PDF just
happens. If that fits your skill set, then you may enjoy experimenting
with and utilizing this library.
How you generate HTML is entirely up to you. You might:
-
Write HTML from scratch, and use
Jinja templates for
variables and logic.
-
Write Markdown and convert it to HTML with
cmarkgfm or
other Commonmark implementation >.
-
Generate HTML Pythonically, with
Dominate or
lxml’s E factory >
-
Parse, modify, and prettify your HTML (or HTML written by others) with
BeautifulSoup >
Then generate the PDF using WeasyPrint.
Anything I missed? Feel free to leave comments!
有关独立命令行工具的更多示例和说明,请参阅 WeasyPrint 文档。
(https://weasyprint.readthedocs.io/en/latest/tutorial.html#as-a-standalone-program)
weasyprint
使用 WeasyPrint 作为 Python 库
WeasyPrint 的 Python API 非常通用。当传递适当的文件指针、文件名或 HTML 本身的文本时,它可用于加载 HTML。
下面是一个简单 makepdf()函数的示例,它接受 HTML 字符串并返回二进制 PDF 数据。
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
这里的主要工作是 HTML 班级。实例化它时,我发现我需要传递一个 base_url 参数,以便它从相对 URL 加载图像和其他资源,如.
使用 HTMLand write_pdf(),不仅会解析 HTML,还会解析关联的 CSS,无论它是嵌入 HTML 的头部(在标签中
我应该注意,HTML 可以直接从文件加载,并且 write_pdf()可以通过指定文件名或文件指针写入文件。有关更多详细信息,请参阅文档。
(https://weasyprint.readthedocs.io/)
这是上面的一个更成熟的示例,添加了原始命令行处理功能:
from pathlib import Path
import sys
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
def run():
"""Command runner."""
infile = sys.argv[1]
outfile = sys.argv[2]
html = Path(infile).read_text()
pdf = makepdf(html)
Path(outfile).write_bytes(pdf)
if __name__ == "__main__":
run()
您可以直接下载上述文件,或者浏览 Github repo。(https://github.com/bowmanjd/pyweasyprintdemo)
"""Generate PDF from HTML."""
from pathlib import Path
import sys
from weasyprint import HTML
def makepdf(html):
"""Generate a PDF file from a string of HTML."""
htmldoc = HTML(string=html, base_url="")
return htmldoc.write_pdf()
def run():
"""Command runner."""
infile = sys.argv[1]
outfile = sys.argv[2]
html = Path(infile).read_text()
pdf = makepdf(html)
Path(outfile).write_bytes(pdf)
if __name__ == "__main__":
run()
文章来源地址 https://www.toymoban.com/diary/python/309.html
关于 Python 类型的说明:在实例化 HTML 时,字符串参数是普通的 Unicode str 类型,
但是 makepdf()方法输出的是字节 (bytes) 类型
假设上述文件以 weasyprintdemo.py 的形式存在于您的工作目录中,并且还有一个 sample.html 文件和一个名为 out 的目录,那么以下内容应该能够正常工作:
python weasyprintdemo.py sample.html out/sample.pdf
尝试一下,然后 out/sample.pdf 用 PDF 阅读器打开。我们很亲近吗?
打印 HTML 样式
显而易见,使用 WeasyPrint 很容易。然而,HTML 到 PDF 转换的真正工作在于样式。值得庆幸的是,CSS 对打印有很好的支持。
一些有用的 CSS 打印资源:
-
有关 CSS 技巧的各种文章 https://css-tricks.com/tag/print-stylesheet/
-
关于 flaviocopes 的一个很好的总结 https://flaviocopes.com/css-printing/#print-css
-
MDN 网络文档 https://developer.mozilla.org/en-US/docs/Web/Guide/Printing
这个简单的样式表演示了一些基本技巧:
body {font-family: sans-serif;}
@media print {
a::after {content: "(" attr(href) ")";
}
pre {white-space: pre-wrap;}
@page {
margin: 0.75in;
size: Letter;
@top-right {content: counter(page);
}
}
@page :first {
@top-right {content: "";}
}
}
首先,使用媒体查询(media queries)。这允许您在打印和屏幕上使用相同的样式表,分别使用 @media print 和 @media screen。在示例样式表中,我假设默认值(如 body 声明中所见)适用于所有格式,并且 @media print 提供了覆盖样式。或者,您可以使用 标签的 media 属性,在打印和屏幕上分别包含单独的样式表,例如。
其次,使用 @page CSS 规则。虽然 2020 年浏览器支持情况相当糟糕,但 WeasyPrint 在支持所需功能方面做得很好。请注意上述代码中的边距和大小调整以及页面编号。其中,我们首先在右上角定义一个计数器,然后使用:first 来使其在第一页上为空白。换句话说,页码只会从第二页开始显示。
还请注意 a::after 的技巧,在打印时明确显示 href 属性。这可能要根据您的目标来判断,有些人可能会认为这个技巧很聪明,有些人可能会觉得有些烦人。
另一个提示,上述示例中没有演示的是:在 @media print 块中,将 display: none 设置为不需要打印的任何元素,并在不希望背景被打印的地方设置 background: none。
Django 和 Flask 支持
如果您使用 Django 或 Flask 应用程序,您可能会受益于这些框架中用于生成 PDF 的方便库:
-
django-weasyprint 提供了一个 WeasyTemplateView 视图基类或在 TemplateView 上提供的 WeasyTemplateResponseMixin 混合类。
-
Flask-WeasyPrint 提供了一个特殊的 HTML 类,其工作方式与 WeasyPrint 相同,但同时支持 Flask 的路由和 WSGI。还提供了一个 render_pdf 函数,可以在模板上调用该函数,也可以在其他视图的 url_for()上调用该函数,并设置正确的 MIME 类型。
生成 HTML 的方式完全取决于您。以下是一些可能的方法:
-
从头开始编写 HTML,并使用 Jinja 模板处理变量和逻辑。
-
使用 cmarkgfm 或其他 Commonmark 实现将 Markdown 转换为 HTML。
-
使用 Dominate 或 lxml 的 E 工厂以 Python 的方式生成 HTML。
-
使用 BeautifulSoup 解析、修改和美化您的 HTML(或他人编写的 HTML)。
然后使用 WeasyPrint 生成 PDF。
如果我漏掉了什么,请随时留下评论!文章来源:https://www.toymoban.com/diary/python/309.html
到此这篇关于使用 WeasyPrint 将 HTML 转换为 Python PDF 生成的文章就介绍到这了, 更多相关内容可以在右上角搜索或继续浏览下面的相关文章,希望大家以后多多支持 TOY 模板网!