Python Markdown使用小记

在getpeclican下启用了 python-markdown 的 markdown-katex后，markdown到html转换速度奇慢无比（在Windows下，从10秒变成了30分钟，Github Action的Ubuntu下，尽管没Windows下夸张，其workflow也从1分钟以内变成了4分钟出头。问题出在哪里？？不妨，从 python-markdown 基本用法开始...

python-markdown 是一个用于将 Markdown 文本转换为 HTML 的 Python 库。它提供了一个简单而灵活的方式来处理 Markdown 格式的文本，并将其转换为可以在网页上显示的 HTML 代码。

注意：python-markdown 是一个较早的 Markdown 解析器，它遵循的是原始 Markdown 语法，所以不完全符合 CommonMark 标准。在python下，符合CommonMark的解析器有 markdown-it-py 和 mistune。

python-markdown 使用
markdown-katex 源码？
其他：vscode使用
- 公式分隔符
参考

python-markdown 使用

安装

1	`pip install markdown`

基本用法

将markdown文本作为输入，调用 markdown.markdown() 函数将其转换为HTML。

import markdown

md_text = """
# 这是一个标题

这是一个段落，其中包括 **加粗** 和 *斜体* 文本。

- 列表项 1
- 列表项 2
"""

# 转换为 HTML
html = markdown.markdown(md_text)

print(html)

输出内容：

<h1>这是一个标题</h1>
<p>这是一个段落，其中包括 <strong>加粗</strong> 和 <em>斜体</em> 文本。</p>
<ul>
<li>列表项 1</li>
<li>列表项 2</li>
</ul>

如果markdown内容在文件中，比如a.md，可以直接写简单的脚本（比较灵活）

import markdown
import pathlib
import argparse

# Parse command line
parser = argparse.ArgumentParser(description="Convert a Markdown file to HTML.")
parser.add_argument(
    "md_file",
    nargs="?",
    default="a.md",
    help="The Markdown file to convert. Defaults to 'a.md'."
)
args = parser.parse_args()
md_name = args.md_file

# Read the md file
md_path = pathlib.Path(md_name)
md_text = md_path.read_text(encoding='utf-8')

# Convert .md to .html
html_text = markdown.markdown(md_text)

# Write to html file
html_path = md_path.with_suffix('.html')
html_path.write_text(html_text, encoding='utf-8')

另外，python-markdown也有自己的命令模式可用

python -m markdown a.md -f b.html -e utf-8

或

markdown_py a.md -f b.html -e utf-8

如果.md文件内有非ASCII字符的话，指定输出文件名和编码是很重要的。这也是此处不用输出重定向的原因（不能指定编码）。

使用扩展（一）：单一内置扩展

Python markdown使用了扩展式设计，使用起来很灵活。

比如，前面的例子中，一旦我们的markdown中使用了表格，类似下面这样：

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

可以发现，输出的html中，它是作为纯文本处理的，并没有转换成html表格。要转成表格，需要写成下面这样（'tables'是注册的入口点，通常在setup.py中）。

html = markdown.markdown(md_text, extensions=['tables'])

或者（对第三方插件来说，这种写法不需要注册）

from markdown.extensions.tables import TableExtension
html = markdown.markdown(md_text, extensions=[TableExtension()])

还可以写成下面这样（不需要注册）

html = markdown.markdown(md_text, extensions=["markdown.extensions.tables:TableExtension"])

如果省略 ':' 后面的内容，可以这样...

html = markdown.markdown(md_text, extensions=['markdown.extensions.tables'])

这样能工作的前提的，扩展模块在模块级别实现了 makeExtension(**kwargs) 函数。

由于，python-markdown将一些常用的扩展放置到了一个extra扩展中，所以，还可以

html = markdown.markdown(md_text, extensions=['extra'])

如果使用命令行的话，需要使用 -x 来指定启动的扩展

markdown_py .\debaodemo.md -f a.html -e utf-8 -x fenced_code

使用扩展（二）：内置扩展介绍

Python-markdown支持扩展如下：

扩展	入口点	备注
Extra	`extra`	包含一组常用扩展的集合。
├── Abbreviations	`abbr`	支持缩写语法。
├── Attribute Lists	`attr_list`	允许为 Markdown 元素添加 HTML 属性。
├── Definition Lists	`def_list`	支持定义列表语法。
├──Fenced Code Blocks	`fenced_code`	支持围栏代码块语法。
├── Footnotes	`footnotes`	支持脚注语法。
├── Markdown in HTML	`md_in_html`	允许在 HTML 标签中嵌入 Markdown 内容。
└──Tables	`tables`	支持表格语法。
Admonition	`admonition`	支持提示框语法。
CodeHilite	`codehilite`	为代码块添加语法高亮功能。
Legacy Attributes	`legacy_attrs`	支持旧版的属性语法。
Legacy Emphasis	`legacy_em`	支持旧版的强调语法。
Meta-Data	`meta`	允许在 Markdown 文档的开头添加元数据。
New Line to Break	`nl2br`	将换行符 `\n` 转换为 `<br />` 标签。
Sane Lists	`sane_lists`	修复列表解析的不合理行为。
SmartyPants	`smarty`	自动转换直角引号、连字符等为更符合排版规则的符号。
Table of Contents	`toc`	自动生成内容目录（Table of Contents）。
WikiLinks	`wikilinks`	支持类似 Wiki 的链接语法。

注意表格中的入口点（entry point)，通常扩展会在自己的的 setup.py 文件内注册。通常放置在在markdown.extensions组中，如下：

from setuptools import setup

setup(
    # ...
    entry_points={
        'markdown.extensions': [
            'markdown_katex = markdown_katex.extension:KatexExtension',
        ]
    }
)

后面dot方式可以工作的前提是：扩展模块在模块级别实现了 makeExtension(**kwargs) 函数。像下面这样

class MyExtension(markdown.extensions.Extension)
    # Define extension here...

def makeExtension(**kwargs):
    return MyExtension(**kwargs)

注意：Extra中的tables和Fenced Code，以及CodeHilite都是很常用的扩展。Meta-Data对于 getpeclian是必须的。TOC、Footnotes也有一定意义。

要在代码中启用常用的扩展，只需要

html_text = markdown.markdown(md_text, extensions=['extra', 'meta', 'codehilite', 'toc'])

如果用命令行的话，

markdown_py .\debaodemo.md -f a.html -e utf-8 -x extra -x codehilite -x toc -x meta

使用扩展（三）：配置扩展

前两个例子，启用了扩展，但是如何对其配置？

比如要配置toc的层级？构造扩展时直接指定：

from markdown.extensions.toc import TocExtension
html = markdown.markdown(md_text, extensions=[TocExtension(baselevel=1, toc_depth='2-3')])

如果不直接构建，需要借助于 extension_configs进行（每个扩展的配置对应一个dict，所有扩展的配置又放置在一个大的dict内）：

html = markdown.markdown(md_text, extensions=['toc'], extension_configs={
    'toc': {'baselevel': 2, 'toc_depth': '2-3'},
    })

如果使用命令行，那就需要写一个配置文件，使用yml或json格式，而后通过 -c来指定：

markdown_py .\debaodemo.md -f a.html -e utf-8 -x extra -x codehilite -x toc -x meta -c config.yml

注意，启用配置的 -x 是不可少的。

使用扩展（四）：第三方扩展katex

最终到了关注的问题的点，看看katex如何用。

首先，安装很简单：

1	`pip install markdown-katex`

然后，编写带公式的markdown文本：

` ` `math
\int_{a}^{b} x^2 \,dx
` ` `

最后，启用katex进行转换：

html_text = markdown.markdown(md_text, extensions=['extra', 'markdown_katex'], extension_configs={'markdown_katex': {'no_inline_svg' : False, 'insert_fonts_css': False}, })

或者考虑其他扩展

html_text = markdown.markdown(md_text, extensions=['extra', 'meta', 'codehilite', 'toc', 'markdown_katex'], extension_configs={'toc':{}, 'markdown_katex': {'no_inline_svg' : False，'insert_fonts_css' ： False}, })

工作都正常，但是

可以复现：就是很慢！！转换速度从秒级变成分钟级。原因？？

在其老的gitlab的网站上，可以看到有人提过类似的性能问题。但是看起来其他人没遇到过？https://gitlab.com/mbarkhau/markdown-katex/-/issues/17

markdown-katex 源码？

要排查问题，只能看看源码。简单记录一下，源码查看过程...

setup.py 文件

精简一下，内容大概如下：

import os
import setuptools

def read_file(path):
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

setuptools.setup(
    name="markdown-katex",
    version="202406.1035",
    author="Manuel Barkhau",
    author_email="mbarkhau@gmail.com",
    url="https://github.com/mbarkhau/markdown-katex",
    description="KaTeX extension for Python Markdown",
    long_description=read_file("README.md"),
    long_description_content_type="text/markdown",
    license="MIT",
    packages=["markdown_katex"],
    package_data={"markdown_katex": [os.path.join("bin", "katex*")]},
    install_requires=[
        line.strip() for line in read_file("requirements/pypi.txt").splitlines()
        if line.strip() and not line.startswith("#")
    ],
    python_requires=">=2.7",
    entry_points={
        'markdown.extensions': [
            'markdown_katex = markdown_katex.extension:KatexExtension',
        ]
    },
    classifiers=[
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 2.7",
        "Programming Language :: Python :: 3",
    ],
)

核心关注：

package_data：需要打包二进制可执行文件 katex
entry_points：注册入口点 makedown_katex

init.py 文件

精简一下，大致这样

__version__ = "v202406.1035"

from markdown_katex.wrapper import tex2html, get_bin_cmd
from markdown_katex.extension import KatexExtension

makeExtension = lambda **kwargs: KatexExtension(**kwargs)

__all__ = ['makeExtension', '__version__', 'get_bin_cmd', 'tex2html']

主要关注 makeExtension。它的存在使得第二个写法可用

markdown_katex.extension:KatexExtension
markdown_katex.extension

extension.py 文件

这是个katex扩展的主文件，主要的类是 KatexExtension，大致如下：

from markdown.extensions import Extension
from markdown.preprocessors import Preprocessor
from markdown.postprocessors import Postprocessor

class KatexExtension(Extension):
    def __init__(self, **kwargs) -> None:
        self.config = {
            'no_inline_svg': ["", "Replace inline <svg> with <img> tags."],
            'insert_fonts_css': ["", "Insert font loading stylesheet."],
            **{name: ["", options_text] for name, options_text in wrapper.parse_options().items()},
        }
        self.options = {name: kwargs.get(name, self.getConfig(name, "")) for name in self.config if kwargs.get(name, "")}
        self.math_html: typ.Dict[str, str] = {}
        super().__init__(**kwargs)

    def reset(self) -> None:
        self.math_html.clear()

    def extendMarkdown(self, md) -> None:
        md.preprocessors.register(KatexPreprocessor(md, self), name='katex_fenced_code_block', priority=50)
        md.postprocessors.register(KatexPostprocessor(md, self), name='katex_fenced_code_block', priority=0)
        md.registerExtension(self)

可以看到它注册了两个处理器类：

KatexPreprocessor：预处理器类，负责在 Markdown 文档解析前处理 LaTeX 数学公式。它会将公式替换为占位符标记，并在 KatexExtension 中缓存 HTML 代码。
KatexPostprocessor：后处理器类，负责在 Markdown 文档解析后将占位符标记替换为实际的 HTML 代码。

主要工作在预处理器类中

class KatexPreprocessor(Preprocessor):
    def __init__(self, md, ext: KatexExtension) -> None:
        super().__init__(md)
        self.ext = ext

    def _make_tag_for_block(self, block_lines: typ.List[str]) -> str:
        block_text = "\n".join(line[len(block_lines[0]) - len(block_lines[0].lstrip()):] for line in block_lines).rstrip()
        marker_tag = f"tmp_block_md_katex_{make_marker_id('block' + block_text)}"
        self.ext.math_html[marker_tag] = f"<p>{md_block2html(block_text, self.ext.options)}</p>"
        return block_lines[0][:len(block_lines[0]) - len(block_lines[0].lstrip())] + marker_tag

    def _make_tag_for_inline(self, inline_text: str) -> str:
        marker_tag = f"tmp_inline_md_katex_{make_marker_id('inline' + inline_text)}"
        self.ext.math_html[marker_tag] = md_inline2html(inline_text, self.ext.options)
        return marker_tag

    def _iter_out_lines(self, lines: typ.List[str]) -> typ.Iterable[str]:
        is_in_math_fence, is_in_fence, block_lines = False, False, []
        expected_close_fence = "```"

        for line in lines:
            if is_in_fence or is_in_math_fence:
                yield line
                if line.rstrip() == expected_close_fence:
                    if is_in_math_fence:
                        yield self._make_tag_for_block(block_lines)
                        block_lines.clear()
                    is_in_fence = is_in_math_fence = False
            else:
                if BLOCK_START_RE.match(line):
                    is_in_math_fence = True
                    expected_close_fence = line[:BLOCK_START_RE.match(line).end(1)] + BLOCK_START_RE.match(line).group(2)
                    block_lines.append(line)
                elif FENCE_RE.match(line):
                    is_in_fence = True
                    expected_close_fence = line[:FENCE_RE.match(line).end(1)] + FENCE_RE.match(line).group(2)
                    yield line
                else:
                    for code in reversed(list(iter_inline_katex(line))):
                        line = line[:code.start] + self._make_tag_for_inline(code.inline_text) + line[code.end :]
                    yield line

        if block_lines:
            yield from block_lines

    def run(self, lines: typ.List[str]) -> typ.List[str]:
        return list(self._iter_out_lines(lines))

其入口是 run函数，逐行遍历，识别公式块和行内公式，而后调用

md_block2html()：处理块级数学公式，将其转换为 HTML。
md_inline2html()：处理行内数学公式，将其转换为 HTML。

wrapper.py 文件

真正的tex到html的转换操作发生在这个文件内。

def tex2html(tex: str, options: MaybeOptions = None) -> str:
    cmd_parts         = list(_iter_cmd_parts(options))
    digest            = _cmd_digest(tex, cmd_parts)
    cache_filename    = digest + ".html"
    cache_output_file = CACHE_DIR / cache_filename

    try:
        if cache_output_file.exists():
            # give cached file a life extension (update mtime)
            cache_output_file.touch()
        else:
            with _atomic_writable_path(cache_output_file) as tmp_output_file:
                _write_tex2html(cmd_parts, tex, tmp_output_file)

        with cache_output_file.open(mode="r", encoding=KATEX_OUTPUT_ENCODING) as fobj:
            result: str = fobj.read()
            return result.strip()
    finally:
        _cleanup_cache_dir()

它竟然使用大量缓存文件，为什么？？缓存不能命中的再调用katex转换

具体转换操作在 _write_tex2html()函数中

def _write_tex2html(cmd_parts: typ.List[str], tex: str, tmp_output_file: Path) -> None:
    # pylint: disable=consider-using-with ; not supported on py27
    tmp_input_file = CACHE_DIR / tmp_output_file.name.replace(".html", ".tex")
    input_data     = tex.encode(KATEX_INPUT_ENCODING)

    CACHE_DIR.mkdir(parents=True, exist_ok=True)
    with _atomic_writable_path(tmp_input_file) as tmp_path:
        with tmp_path.open(mode="wb") as fobj:
            fobj.write(input_data)

    cmd_parts.extend(["--input", str(tmp_input_file), "--output", str(tmp_output_file)])
    proc = None
    try:
        proc     = sp.Popen(cmd_parts, stdout=sp.PIPE, stderr=sp.PIPE)
        ret_code = proc.wait()

每一个公式都通过子进程方式调用一次 katex 进行转换！！

单独‌拎出来看看

根源：markdown-katex始终优先用户安装的katex或者nodejs的katex的包，它遍历系统PATH，搜索katex或npx是否存在。但一旦npx存在，但是npx中的katex没有安装，就会阻塞很长时间。而且每个公式会阻塞一次。

单独用这个tex2html试试：

from markdown_katex import tex2html

latex_string = r"c = \pm\sqrt{a^2 + b^2}"

html_output = tex2html(latex_string)

print(html_output)

在个人PC上，确实非常慢！

再进一步，试一下，发现还是很慢。看来慢的原因不在转换上，在于搜索katex程序

from markdown_katex import tex2html, get_bin_cmd

print(get_bin_cmd())

在Windows下，它遍历系统环境变量PATH中的所有路径，逐一去搜索 katex.cmd、katex.exe、katex.ps1、npx.cmd、npx.exe、npx.ps1

CMD_NAME = "katex"

def _get_local_bin_candidates() -> typ.List[str]:
    if OSNAME == 'Windows':
        # whackamole
        return [
            f"{CMD_NAME}.cmd",
            f"{CMD_NAME}.exe",
            f"npx.cmd --no-install {CMD_NAME}",
            f"npx.exe --no-install {CMD_NAME}",
            f"{CMD_NAME}.ps1",
            f"npx.ps1 --no-install {CMD_NAME}",
        ]
    else:
        return [CMD_NAME, f"npx --no-install {CMD_NAME}"]

而后添加参数 --version 进行执行和确认

            try:
                output_data = sp.check_output(local_cmd_parts + ['--version'], stderr=sp.STDOUT)
                output_text = output_data.decode("utf-8")
                if re.match(r"\d+\.\d+\.\d+", output_text.strip()) is None:
                    continue
            except sp.CalledProcessError:
                continue
            except OSError:
                continue

直接原因：我的系统PATH路径中有 nodejs，但是没有通过npm装katex模块，造成在执行如下命令时，会阻塞很长时间，而后抛出 CalledProcessError 异常。

import subprocess as sp
local_cmd_parts = ['D:\\Program Files\\nodejs\\npx.cmd', '--no-install', 'katex']

output_data = sp.check_output(local_cmd_parts + [
                    '--version'], stderr=sp.STDOUT)

解决方案，安装katex：

npm install -g katex

使用国内源的话：

npm --registry https://registry.npm.taobao.org --strict-ssl=false install -g katex

Github Action

测试表明，再Github Action中，配置好 nodejs 的 katex 包，也能显著提升转换速度。（从4分钟恢复到1分钟以内，也就是没用启用markdown-katex时的水平）。

附 github action的workflow文件：

name: pelican CI for debao blog

on:
  # Trigger the workflow on push on main branch,
  push:
    branches:
      - main

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
      with: 
        submodules: 'true'
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.12'
    - name: Set up Nodejs for katex
      uses: actions/setup-node@v4
      with:
        node-version: 20
    - name: Install packages needed by pelican
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt
        npm install -g katex
    - name: Run pelican
      run: |
        pelican -s publishconf.py
    - name: depoly to gh pages

最终方案

20241024更新：

当前blog 使用自己编写Markdown的katex插件，不进行离线转换。速度只需数秒。

其他：vscode使用

随手记录一下

公式分隔符

vscode下有多种扩展支持 markdown + katex 组合。而且支持的katex的公式分隔符各式各样，详见：https://github.com/goessner/markdown-it-texmath 。

而本文中提及的python下的这个包只支持 gitlab风格的公式分割。

一些正则表达式：

将 dollar 风格的行内单个dollar替换为 gitlab风格

查找表达式：

1	(?<!`)\$(?!`)([^$`]+)(?<!`)\$(?!`)

替换表达式

$`$1`$

将括号风格的行内公式替换为 gitlab风格

查找表达式

1	`\\$(.+?)\\$`