Pandoc与LaTeX小记

Markdown和LaTeX都是老朋友了。本文试图通过简单例子，将markdown文件通过latex输出一本书，以此来熟悉一下pandoc

pandoc book

关于Pandoc

功能太多，本文只关心其从 Markdown 到 LaTeX 的转换功能。

Pandoc 是一个广泛使用的文档转换工具，它可以在不同的标记语言和文档格式之间进行转换。Pandoc 由加州大学伯克利分校的哲学教授 John MacFarlane 开发。
Pandoc 的第一个版本发布于 2006 年，最初的重点是从 Markdown 转换为 LaTeX、HTML 和 DocBook。
Pandoc 的设计理念是为每种输入格式解析器构建一个抽象语法树（AST），然后通过这个 AST 生成目标格式。通过这种方式，Pandoc 可以支持任意多的输入和输出格式。

命令行

通过例子来走马观花看一看，如何从一个包含中文的markdown文件，转换成latex，进而到pdf。

例子1

将 hello.md 转换成 hello.tex，可执行如下命令：

pandoc -f markdown -t latex hello.md -o hello.tex

其中：

-f，--from：输入文件格式
-t，--to：输出文件格式
-o，--output：输出文件名

其实pandoc可以根据后缀直接猜测格式，所以，上面命令可以写作：

pandoc hello.md -o hello.tex

注意：pandoc支持令人头大的 各种风格 的 markdown 格式！！！

markdown：Pandoc风格的Markdown
markdown_mmd：MultiMarkdown
markdown_phpextra：PHP Markdown Extra
markdown_strict：原始的Markdown
gfm：Github风格的Markdown
...

输入文件hello.md内容：

# Hello 1+1=2

我是测试文本。

我是公式：

$$
E=mc^2
$$

输出文件hello.tex内容如下：

\section{Hello 1+1=2}\label{hello-112}

我是测试文本。

我是公式：

\[
E=mc^2
\]

操作成功了，尽管输出只是tex片段。

例子2

如果要生成完整的tex文件，我们需要命令行参数：

-s，--standalone：生成文件头和尾（完整的tex或html等文件）

这样一来，

pandoc hello.md -s -o hello.tex

此时 tex文件就完整了。可以调用latex来生成pdf文件了：

1	`xelatex hello.tex`

尽管可以生成pdf，但是因为中文的问题，如上命令生成的pdf和我们期望的并不一样，丢失了中文字体。

编译之前，我们需要手动将.tex的文档类从article改为ctexart才行。

%\documentclass[]{article}
\documentclass[]{ctexart}

例子3

前面我们启用-s选项，它就添加了tex文件的头和尾。很神奇，它怎么做的？？

实际上，它有一个模板，对于latex，它的默认模板长下面这样：

$passoptions.latex()$
\documentclass[
$if(fontsize)$
  $fontsize$,
$endif$
$if(papersize)$
  $papersize$paper,
$endif$
$for(classoption)$
  $classoption$$sep$,
$endfor$
]{$documentclass$}
$if(beamerarticle)$
\usepackage{beamerarticle} % needs to be loaded first
$endif$
\usepackage{xcolor}
$.......

通过如下命令，我们可以输出这个模板：

1	`pandoc -D latex`

其中命令行参数

-D，----print-default-template 用于输出指定格式的默认模板。

注意看上面的模板，明显有一个我们感兴趣的变量 $documentclass$ 。上面例子中，我们已经发现它不符合我们胃口。但，如何修改它？！！

pandoc hello.md -s -V documentclass=ctexart -o hello.tex
xelatex hello.tex

这样就可以，直接生成我们可用的latex文件。

通过命令行选项（大写）V可以定义模板变量：

-V，--variable：用于定义模板变量

例子4

文档类是文档的一部分啊，除了命令行之外，有没有其他写法？？

好在 Pandoc的Markdown支持yaml格式的front matter（一种元数据）。通过它，我们可以将前面的markdown文件，写为

---
documentclass: ctexart
---

# Hello 1+1=2

我是测试文本。

我是公式：

$$
E=mc^2
$$

这样一来，通过front matter我们定义了模板变量，下面命令也就没问题了

pandoc hello.md -s -o hello.tex
xelatex hello.tex

例子5

例子跑通了，不过这么牛的pandoc，既然能各种格式直接转。是不是可以直接生成pdf呢？

试一下

pandoc hello.md -o hello.pdf

对我们这个例子没问题。不过因为markdown中我们有中文。pandoc默认使用pdflatex引擎对中文支持不好，万一不能正常工作。怎么办，如何切换成我们前面用的xelatex。

pandoc hello.md --pdf-engine xelatex -o hello.pdf

通过命令行 --pdf-engine 我们可以指定一个特定的latex编译器。问题解决！

例子6

如果我们不想再头部添加yaml，可以独立成一个单独文件，比如hello.yaml:

---
documentclass: ctexart
---

这样，只需要

pandoc hello.md --metadata-file=hello.yaml --pdf-engine xelatex -o hello.pdf

或者

pandoc hello.md hello.yaml --pdf-engine xelatex -s -o hello.pdf

这个文件应该挺有用。

模板

要取代默认模板很简单，只需要我们定义一个

例子8

一个简单的模板，不需要多高大上：

\documentclass{ctexart}
\begin{document}
$body$
\end{document}

通过命令行指定模板

pandoc hello.md -s --template=debao_template.tex -o hello.tex
xelatex hello.tex

或者一步到位，生成pdf

pandoc hello.md -s --template=debao_template.tex --pdf-engine xelatex -o hello.pdf

变量

模板变量，直接和模板相关，比如yaml中定义如下：

---
title: The document title
author:
- name: Author One
  affiliation: University of Somewhere
- name: Author Two
  affiliation: University of Nowhere
...

在模板文件中，可以

$for(author)$
$if(author.name)$
$author.name$$if(author.affiliation)$ ($author.affiliation$)$endif$
$else$
$author$
$endif$
$endfor$

模板中，条件语句、循环语句，以及默认哪些变量，需要查手册

开始写本书

假定一本书有10章，每章对应一个markdown文件。

例子1

作为模拟，可以写个python脚本，生成这10个markdown文件：

import os

# 定义章节的数量
num_chapters = 10

output_dir = "chapters"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# 循环生成每个章节文件
for i in range(1, num_chapters + 1):
    filename = f"{output_dir}/chapter{i:02}.md"

    with open(filename, "w", encoding="utf-8") as f:
        f.write(f"# 第{i}章: 章节标题\n\n")

        f.write(f"这是第 {i} 章的内容。\n")
        f.write(f"\n## 小节 1\n")
        f.write(f"这是第 {i} 章的小节 1 的内容。\n")
        f.write(f"\n## 小节 2\n")
        f.write(f"这是第 {i} 章的小节 2 的内容。\n")

    print(f"生成 {filename}")

print(f"\n成功生成 {num_chapters} 个章节的 Markdown 文件！")

而后，直接使用pandoc命令，来生成pdf文件：

pandoc chapters/chapter01.md chapters/chapter02.md chapters/chapter03.md chapters/chapter04.md chapters/chapter05.md chapters/chapter06.md chapters/chapter07.md chapters/chapter08.md chapters/chapter09.md chapters/chapter10.md -V documentclass=ctexbook --pdf-engine xelatex -o book.pdf

这样就可以。就是一堆文件名太闹心

例子2

写个文本文件，比如 booklist.txt

chapters/chapter01.md
chapters/chapter02.md
chapters/chapter03.md
chapters/chapter04.md
chapters/chapter05.md
chapters/chapter06.md
chapters/chapter07.md
chapters/chapter08.md
chapters/chapter09.md
chapters/chapter10.md

而后，就清爽多了

pandoc $(cat booklist.txt) -o book.pdf --pdf-engine=xelatex -V documentclass=ctexbook

例子3

还可以编写一个yaml文件，来存在模板变量信息：

---
documentclass: ctexbook
title: "我的书籍"
author: "1+1=10"
date: "2024-10-29"
fontsize: 12pt
geometry: margin=1in
toc: true
numbersections: true
---

这样一来

pandoc --metadata-file=metadata.yaml $(cat booklist.txt) -o book.pdf --pdf-engine=xelatex -V documentclass=ctexbook

扩展

Extension

https://pandoc.org/chunkedhtml-demo/8.21-non-default-extensions.html
https://pandoc.org/chunkedhtml-demo/7.3-math-input.html

扩展可以通过在格式名称后添加 +EXTENSION 来启用，通过添加 -EXTENSION 来禁用。例如：

--from markdown_strict+footnotes 是启用脚注的严格 Markdown，而
--from markdown-footnotes-pipe_tables 是不包含脚注或管道表的 Pandoc Markdown。

可用扩展列表，可通过如下命令

1	`pandoc --list-extensions=markdown`

或

1	`pandoc --list-extensions=gfm`

来获取

-abbreviations
+all_symbols_escapable
-angle_brackets_escapable
-ascii_identifiers
+auto_identifiers
-autolink_bare_uris
+backtick_code_blocks
+blank_before_blockquote
+blank_before_header
+bracketed_spans
+citations
-compact_definition_lists
+definition_lists
-east_asian_line_breaks
-emoji
+escaped_line_breaks
+example_lists
+fancy_lists
+fenced_code_attributes
+fenced_code_blocks
+fenced_divs
+footnotes
-four_space_rule
...

Filter

Pandoc 过滤器在文档转换过程中对 Pandoc 的抽象语法树（AST）进行操作。这意味着我们可以对文档的结构、内容和格式进行修改，添加、删除或重排元素。

可以使用各种脚本语言（默认对应解释器）：

文件后缀	解释器
.py	python
.hs	runhaskell
.pl	perl
.rb	ruby
.php	php
.js	node
.r	Rscript

使用python编写时，使用 pandocfilters，见：

https://pypi.org/project/pandocfilters/

继续

如果写点像书的正经东西，还要多学习一些。

文件、代码结构

my-book/                  # 项目根目录
├── chapters/             # 存放各个章节的 Markdown 文件
│   ├── 01-introduction.md
│   ├── 02-chapter1.md
│   ├── 03-chapter2.md
│   └── ...               # 其他章节
├── images/               # 存放图片资源
│   └── cover.png         # 书籍封面图片
├── templates/            # 存放自定义的模板和样式文件
│   ├── custom-template.tex   # 自定义的 LaTeX 模板
│   └── custom-style.css      # 自定义的 CSS 样式表
├── metadata.yaml         # 元数据文件，存放书籍的全局信息，如标题、作者等
├── booklist.txt          # 列出所有章节的文件列表
├── Makefile              # Makefile 文件，用于生成书籍（可选）
└── README.md             # 项目说明文件

这些文件前面基本都见到了，Makefile文件用于简化pandoc的调用

Makefile

Makefile真要好好写，挺不容易。

不过我们可以只用最简单和傻瓜的规则（不考虑构建依赖以及清理等操作）：

# 生成多个格式
all: pdf epub html

# 生成 PDF 文件
pdf:
    pandoc --metadata-file=metadata.yaml --toc --number-sections --template=templates/custom-template.tex --pdf-engine=xelatex -o mybook.pdf $(cat booklist.txt)

# 生成 EPUB 文件
epub:
    pandoc --metadata-file=metadata.yaml --toc --number-sections --css=templates/custom-style.css --epub-cover-image=images/cover.png -o mybook.epub $(cat booklist.txt)

# 生成 HTML 文件
html:
    pandoc --metadata-file=metadata.yaml --toc --number-sections --css=templates/custom-style.css -o mybook.html $(cat booklist.txt)

或者直接用python吧

import os
import subprocess

metadata_file = "metadata.yaml"
booklist_file = "booklist.txt"
output_dir = "build"
output_pdf = os.path.join(output_dir, "mybook.pdf")
output_epub = os.path.join(output_dir, "mybook.epub")
output_html = os.path.join(output_dir, "mybook.html")
cover_image = "images/cover.png"
latex_template = "templates/pdf.latex"
css_file = "templates/style.css"
pdf_engine = "xelatex"
md_format = "markdown+tex_math_single_backslash"

with open(booklist_file) as f:
    booklist = f.read().strip().splitlines()

command_comm = [
        "pandoc",
        "--metadata-file", metadata_file,
        "--from", md_format,
        "--toc",
        "--number-sections",
    ] + booklist

def check_file_exists(file_path):
    if not os.path.exists(file_path):
        print(f"Error: Cannot found {file_path}")
        return False
    return True

def create_output_directory():
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        print(f"Build directory: {output_dir}")

def generate_pdf():
    if not check_file_exists(latex_template):
        return
    pdf_command = command_comm + [
        "--template", latex_template,
        "--pdf-engine", pdf_engine,
        "-o", output_pdf,
    ]
    print(f"Generate PDF: {output_pdf}")
    try:
        subprocess.run(pdf_command, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Error generate PDF: {e}")

def generate_epub():
    epub_command = command_comm + [
        "--css", css_file,
        "--epub-cover-image", cover_image,
        "-o", output_epub,
    ]
    print(f"Generate EPUB: {output_epub}")
    try:
        subprocess.run(epub_command, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Error generating EPUB: {e}")

def generate_html():
    html_command = command_comm + [
        "--css", css_file,
        "-o", output_html,
    ]
    print(f"Generate HTML: {output_html}")
    try:
        subprocess.run(html_command, check=True)
    except subprocess.CalledProcessError as e:
        print(f"Error generating HTML: {e}")

def main():
    print("..start...")
    create_output_directory()
    generate_pdf()
    generate_epub()
    generate_html()
    print("..finished...")

if __name__ == "__main__":
    main()

或许找个现成模板，起步会更好...

1+1=10

记记笔记，放松一下...