1 Writing Extensions for Python-Markdown 2 ====================================== 3 4 Overview 5 -------- 6 7 Python-Markdown includes an API for extension writers to plug their own 8 custom functionality and/or syntax into the parser. There are preprocessors 9 which allow you to alter the source before it is passed to the parser, 10 inline patterns which allow you to add, remove or override the syntax of 11 any inline elements, and postprocessors which allow munging of the 12 output of the parser before it is returned. If you really want to dive in, 13 there are also blockprocessors which are part of the core BlockParser. 14 15 As the parser builds an [ElementTree][] object which is later rendered 16 as Unicode text, there are also some helpers provided to ease manipulation of 17 the tree. Each part of the API is discussed in its respective section below. 18 Additionaly, reading the source of some [[Available Extensions]] may be helpful. 19 For example, the [[Footnotes]] extension uses most of the features documented 20 here. 21 22 * [Preprocessors][] 23 * [InlinePatterns][] 24 * [Treeprocessors][] 25 * [Postprocessors][] 26 * [BlockParser][] 27 * [Working with the ElementTree][] 28 * [Integrating your code into Markdown][] 29 * [extendMarkdown][] 30 * [OrderedDict][] 31 * [registerExtension][] 32 * [Config Settings][] 33 * [makeExtension][] 34 35 <h3 id="preprocessors">Preprocessors</h3> 36 37 Preprocessors munge the source text before it is passed into the Markdown 38 core. This is an excellent place to clean up bad syntax, extract things the 39 parser may otherwise choke on and perhaps even store it for later retrieval. 40 41 Preprocessors should inherit from ``markdown.preprocessors.Preprocessor`` and 42 implement a ``run`` method with one argument ``lines``. The ``run`` method of 43 each Preprocessor will be passed the entire source text as a list of Unicode 44 strings. Each string will contain one line of text. The ``run`` method should 45 return either that list, or an altered list of Unicode strings. 46 47 A pseudo example: 48 49 class MyPreprocessor(markdown.preprocessors.Preprocessor): 50 def run(self, lines): 51 new_lines = [] 52 for line in lines: 53 m = MYREGEX.match(line) 54 if m: 55 # do stuff 56 else: 57 new_lines.append(line) 58 return new_lines 59 60 <h3 id="inlinepatterns">Inline Patterns</h3> 61 62 Inline Patterns implement the inline HTML element syntax for Markdown such as 63 ``*emphasis*`` or ``[links](http://example.com)``. Pattern objects should be 64 instances of classes that inherit from ``markdown.inlinepatterns.Pattern`` or 65 one of its children. Each pattern object uses a single regular expression and 66 must have the following methods: 67 68 * **``getCompiledRegExp()``**: 69 70 Returns a compiled regular expression. 71 72 * **``handleMatch(m)``**: 73 74 Accepts a match object and returns an ElementTree element of a plain 75 Unicode string. 76 77 Note that any regular expression returned by ``getCompiledRegExp`` must capture 78 the whole block. Therefore, they should all start with ``r'^(.*?)'`` and end 79 with ``r'(.*?)!'``. When using the default ``getCompiledRegExp()`` method 80 provided in the ``Pattern`` you can pass in a regular expression without that 81 and ``getCompiledRegExp`` will wrap your expression for you. This means that 82 the first group of your match will be ``m.group(2)`` as ``m.group(1)`` will 83 match everything before the pattern. 84 85 For an example, consider this simplified emphasis pattern: 86 87 class EmphasisPattern(markdown.inlinepatterns.Pattern): 88 def handleMatch(self, m): 89 el = markdown.etree.Element('em') 90 el.text = m.group(3) 91 return el 92 93 As discussed in [Integrating Your Code Into Markdown][], an instance of this 94 class will need to be provided to Markdown. That instance would be created 95 like so: 96 97 # an oversimplified regex 98 MYPATTERN = r'\*([^*]+)\*' 99 # pass in pattern and create instance 100 emphasis = EmphasisPattern(MYPATTERN) 101 102 Actually it would not be necessary to create that pattern (and not just because 103 a more sophisticated emphasis pattern already exists in Markdown). The fact is, 104 that example pattern is not very DRY. A pattern for `**strong**` text would 105 be almost identical, with the exception that it would create a 'strong' element. 106 Therefore, Markdown provides a number of generic pattern classes that can 107 provide some common functionality. For example, both emphasis and strong are 108 implemented with separate instances of the ``SimpleTagPettern`` listed below. 109 Feel free to use or extend any of these Pattern classes. 110 111 **Generic Pattern Classes** 112 113 * **``SimpleTextPattern(pattern)``**: 114 115 Returns simple text of ``group(2)`` of a ``pattern``. 116 117 * **``SimpleTagPattern(pattern, tag)``**: 118 119 Returns an element of type "`tag`" with a text attribute of ``group(3)`` 120 of a ``pattern``. ``tag`` should be a string of a HTML element (i.e.: 'em'). 121 122 * **``SubstituteTagPattern(pattern, tag)``**: 123 124 Returns an element of type "`tag`" with no children or text (i.e.: 'br'). 125 126 There may be other Pattern classes in the Markdown source that you could extend 127 or use as well. Read through the source and see if there is anything you can 128 use. You might even get a few ideas for different approaches to your specific 129 situation. 130 131 <h3 id="treeprocessors">Treeprocessors</h3> 132 133 Treeprocessors manipulate an ElemenTree object after it has passed through the 134 core BlockParser. This is where additional manipulation of the tree takes 135 place. Additionally, the InlineProcessor is a Treeprocessor which steps through 136 the tree and runs the InlinePatterns on the text of each Element in the tree. 137 138 A Treeprocessor should inherit from ``markdown.treeprocessors.Treeprocessor``, 139 over-ride the ``run`` method which takes one argument ``root`` (an Elementree 140 object) and returns either that root element or a modified root element. 141 142 A pseudo example: 143 144 class MyTreeprocessor(markdown.treeprocessors.Treeprocessor): 145 def run(self, root): 146 #do stuff 147 return my_modified_root 148 149 For specifics on manipulating the ElementTree, see 150 [Working with the ElementTree][] below. 151 152 <h3 id="postprocessors">Postprocessors</h3> 153 154 Postprocessors manipulate the document after the ElementTree has been 155 serialized into a string. Postprocessors should be used to work with the 156 text just before output. 157 158 A Postprocessor should inherit from ``markdown.postprocessors.Postprocessor`` 159 and over-ride the ``run`` method which takes one argument ``text`` and returns 160 a Unicode string. 161 162 Postprocessors are run after the ElementTree has been serialized back into 163 Unicode text. For example, this may be an appropriate place to add a table of 164 contents to a document: 165 166 class TocPostprocessor(markdown.postprocessors.Postprocessor): 167 def run(self, text): 168 return MYMARKERRE.sub(MyToc, text) 169 170 <h3 id="blockparser">BlockParser</h3> 171 172 Sometimes, pre/tree/postprocessors and Inline Patterns aren't going to do what 173 you need. Perhaps you want a new type of block type that needs to be integrated 174 into the core parsing. In such a situation, you can add/change/remove 175 functionality of the core ``BlockParser``. The BlockParser is composed of a 176 number of Blockproccessors. The BlockParser steps through each block of text 177 (split by blank lines) and passes each block to the appropriate Blockprocessor. 178 That Blockprocessor parses the block and adds it to the ElementTree. The 179 [[Definition Lists]] extension would be a good example of an extension that 180 adds/modifies Blockprocessors. 181 182 A Blockprocessor should inherit from ``markdown.blockprocessors.BlockProcessor`` 183 and implement both the ``test`` and ``run`` methods. 184 185 The ``test`` method is used by BlockParser to identify the type of block. 186 Therefore the ``test`` method must return a boolean value. If the test returns 187 ``True``, then the BlockParser will call that Blockprocessor's ``run`` method. 188 If it returns ``False``, the BlockParser will move on to the next 189 BlockProcessor. 190 191 The **``test``** method takes two arguments: 192 193 * **``parent``**: The parent etree Element of the block. This can be useful as 194 the block may need to be treated differently if it is inside a list, for 195 example. 196 197 * **``block``**: A string of the current block of text. The test may be a 198 simple string method (such as ``block.startswith(some_text)``) or a complex 199 regular expression. 200 201 The **``run``** method takes two arguments: 202 203 * **``parent``**: A pointer to the parent etree Element of the block. The run 204 method will most likely attach additional nodes to this parent. Note that 205 nothing is returned by the method. The Elementree object is altered in place. 206 207 * **``blocks``**: A list of all remaining blocks of the document. Your run 208 method must remove (pop) the first block from the list (which it altered in 209 place - not returned) and parse that block. You may find that a block of text 210 legitimately contains multiple block types. Therefore, after processing the 211 first type, your processor can insert the remaining text into the beginning 212 of the ``blocks`` list for future parsing. 213 214 Please be aware that a single block can span multiple text blocks. For example, 215 The official Markdown syntax rules state that a blank line does not end a 216 Code Block. If the next block of text is also indented, then it is part of 217 the previous block. Therefore, the BlockParser was specifically designed to 218 address these types of situations. If you notice the ``CodeBlockProcessor``, 219 in the core, you will note that it checks the last child of the ``parent``. 220 If the last child is a code block (``<pre><code>...</code></pre>``), then it 221 appends that block to the previous code block rather than creating a new 222 code block. 223 224 Each BlockProcessor has the following utility methods available: 225 226 * **``lastChild(parent)``**: 227 228 Returns the last child of the given etree Element or ``None`` if it had no 229 children. 230 231 * **``detab(text)``**: 232 233 Removes one level of indent (four spaces by default) from the front of each 234 line of the given text string. 235 236 * **``looseDetab(text, level)``**: 237 238 Removes "level" levels of indent (defaults to 1) from the front of each line 239 of the given text string. However, this methods allows secondary lines to 240 not be indented as does some parts of the Markdown syntax. 241 242 Each BlockProcessor also has a pointer to the containing BlockParser instance at 243 ``self.parser``, which can be used to check or alter the state of the parser. 244 The BlockParser tracks it's state in a stack at ``parser.state``. The state 245 stack is an instance of the ``State`` class. 246 247 **``State``** is a subclass of ``list`` and has the additional methods: 248 249 * **``set(state)``**: 250 251 Set a new state to string ``state``. The new state is appended to the end 252 of the stack. 253 254 * **``reset()``**: 255 256 Step back one step in the stack. The last state at the end is removed from 257 the stack. 258 259 * **``isstate(state)``**: 260 261 Test that the top (current) level of the stack is of the given string 262 ``state``. 263 264 Note that to ensure that the state stack doesn't become corrupted, each time a 265 state is set for a block, that state *must* be reset when the parser finishes 266 parsing that block. 267 268 An instance of the **``BlockParser``** is found at ``Markdown.parser``. 269 ``BlockParser`` has the following methods: 270 271 * **``parseDocument(lines)``**: 272 273 Given a list of lines, an ElementTree object is returned. This should be 274 passed an entire document and is the only method the ``Markdown`` class 275 calls directly. 276 277 * **``parseChunk(parent, text)``**: 278 279 Parses a chunk of markdown text composed of multiple blocks and attaches 280 those blocks to the ``parent`` Element. The ``parent`` is altered in place 281 and nothing is returned. Extensions would most likely use this method for 282 block parsing. 283 284 * **``parseBlocks(parent, blocks)``**: 285 286 Parses a list of blocks of text and attaches those blocks to the ``parent`` 287 Element. The ``parent`` is altered in place and nothing is returned. This 288 method will generally only be used internally to recursively parse nested 289 blocks of text. 290 291 While is is not recommended, an extension could subclass or completely replace 292 the ``BlockParser``. The new class would have to provide the same public API. 293 However, be aware that other extensions may expect the core parser provided 294 and will not work with such a drastically different parser. 295 296 <h3 id="working_with_et">Working with the ElementTree</h3> 297 298 As mentioned, the Markdown parser converts a source document to an 299 [ElementTree][] object before serializing that back to Unicode text. 300 Markdown has provided some helpers to ease that manipulation within the context 301 of the Markdown module. 302 303 First, to get access to the ElementTree module import ElementTree from 304 ``markdown`` rather than importing it directly. This will ensure you are using 305 the same version of ElementTree as markdown. The module is named ``etree`` 306 within Markdown. 307 308 from markdown import etree 309 310 ``markdown.etree`` tries to import ElementTree from any known location, first 311 as a standard library module (from ``xml.etree`` in Python 2.5), then as a third 312 party package (``Elementree``). In each instance, ``cElementTree`` is tried 313 first, then ``ElementTree`` if the faster C implementation is not available on 314 your system. 315 316 Sometimes you may want text inserted into an element to be parsed by 317 [InlinePatterns][]. In such a situation, simply insert the text as you normally 318 would and the text will be automatically run through the InlinePatterns. 319 However, if you do *not* want some text to be parsed by InlinePatterns, 320 then insert the text as an ``AtomicString``. 321 322 some_element.text = markdown.AtomicString(some_text) 323 324 Here's a basic example which creates an HTML table (note that the contents of 325 the second cell (``td2``) will be run through InlinePatterns latter): 326 327 table = etree.Element("table") 328 table.set("cellpadding", "2") # Set cellpadding to 2 329 tr = etree.SubElement(table, "tr") # Add child tr to table 330 td1 = etree.SubElement(tr, "td") # Add child td1 to tr 331 td1.text = markdown.AtomicString("Cell content") # Add plain text content 332 td2 = etree.SubElement(tr, "td") # Add second td to tr 333 td2.text = "*text* with **inline** formatting." # Add markup text 334 table.tail = "Text after table" # Add text after table 335 336 You can also manipulate an existing tree. Consider the following example which 337 adds a ``class`` attribute to ``<a>`` elements: 338 339 def set_link_class(self, element): 340 for child in element: 341 if child.tag == "a": 342 child.set("class", "myclass") #set the class attribute 343 set_link_class(child) # run recursively on children 344 345 For more information about working with ElementTree see the ElementTree 346 [Documentation](http://effbot.org/zone/element-index.htm) 347 ([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)). 348 349 <h3 id="integrating_into_markdown">Integrating Your Code Into Markdown</h3> 350 351 Once you have the various pieces of your extension built, you need to tell 352 Markdown about them and ensure that they are run in the proper sequence. 353 Markdown accepts a ``Extension`` instance for each extension. Therefore, you 354 will need to define a class that extends ``markdown.Extension`` and over-rides 355 the ``extendMarkdown`` method. Within this class you will manage configuration 356 options for your extension and attach the various processors and patterns to 357 the Markdown instance. 358 359 It is important to note that the order of the various processors and patterns 360 matters. For example, if we replace ``http://...`` links with <a> elements, and 361 *then* try to deal with inline html, we will end up with a mess. Therefore, 362 the various types of processors and patterns are stored within an instance of 363 the Markdown class in [OrderedDict][]s. Your ``Extension`` class will need to 364 manipulate those OrderedDicts appropriately. You may insert instances of your 365 processors and patterns into the appropriate location in an OrderedDict, remove 366 a built-in instance, or replace a built-in instance with your own. 367 368 <h4 id="extendmarkdown">extendMarkdown</h4> 369 370 The ``extendMarkdown`` method of a ``markdown.Extension`` class accepts two 371 arguments: 372 373 * **``md``**: 374 375 A pointer to the instance of the Markdown class. You should use this to 376 access the [OrderedDict][]s of processors and patterns. They are found 377 under the following attributes: 378 379 * ``md.preprocessors`` 380 * ``md.inlinePatterns`` 381 * ``md.parser.blockprocessors`` 382 * ``md.treepreprocessors`` 383 * ``md.postprocessors`` 384 385 Some other things you may want to access in the markdown instance are: 386 387 * ``md.htmlStash`` 388 * ``md.output_formats`` 389 * ``md.set_output_format()`` 390 * ``md.registerExtension()`` 391 392 * **``md_globals``**: 393 394 Contains all the various global variables within the markdown module. 395 396 Of course, with access to those items, theoretically you have the option to 397 changing anything through various [monkey_patching][] techniques. However, you 398 should be aware that the various undocumented or private parts of markdown 399 may change without notice and your monkey_patches may break with a new release. 400 Therefore, what you really should be doing is inserting processors and patterns 401 into the markdown pipeline. Consider yourself warned. 402 403 [monkey_patching]: http://en.wikipedia.org/wiki/Monkey_patch 404 405 A simple example: 406 407 class MyExtension(markdown.Extension): 408 def extendMarkdown(self, md, md_globals): 409 # Insert instance of 'mypattern' before 'references' pattern 410 md.inlinePatterns.add('mypattern', MyPattern(md), '<references') 411 412 <h4 id="ordereddict">OrderedDict</h4> 413 414 An OrderedDict is a dictionary like object that retains the order of it's 415 items. The items are ordered in the order in which they were appended to 416 the OrderedDict. However, an item can also be inserted into the OrderedDict 417 in a specific location in relation to the existing items. 418 419 Think of OrderedDict as a combination of a list and a dictionary as it has 420 methods common to both. For example, you can get and set items using the 421 ``od[key] = value`` syntax and the methods ``keys()``, ``values()``, and 422 ``items()`` work as expected with the keys, values and items returned in the 423 proper order. At the same time, you can use ``insert()``, ``append()``, and 424 ``index()`` as you would with a list. 425 426 Generally speaking, within Markdown extensions you will be using the special 427 helper method ``add()`` to add additional items to an existing OrderedDict. 428 429 The ``add()`` method accepts three arguments: 430 431 * **``key``**: A string. The key is used for later reference to the item. 432 433 * **``value``**: The object instance stored in this item. 434 435 * **``location``**: Optional. The items location in relation to other items. 436 437 Note that the location can consist of a few different values: 438 439 * The special strings ``"_begin"`` and ``"_end"`` insert that item at the 440 beginning or end of the OrderedDict respectively. 441 442 * A less-than sign (``<``) followed by an existing key (i.e.: 443 ``"<somekey"``) inserts that item before the existing key. 444 445 * A greater-than sign (``>``) followed by an existing key (i.e.: 446 ``">somekey"``) inserts that item after the existing key. 447 448 Consider the following example: 449 450 >>> import markdown 451 >>> od = markdown.OrderedDict() 452 >>> od['one'] = 1 # The same as: od.add('one', 1, '_begin') 453 >>> od['three'] = 3 # The same as: od.add('three', 3, '>one') 454 >>> od['four'] = 4 # The same as: od.add('four', 4, '_end') 455 >>> od.items() 456 [("one", 1), ("three", 3), ("four", 4)] 457 458 Note that when building an OrderedDict in order, the extra features of the 459 ``add`` method offer no real value and are not necessary. However, when 460 manipulating an existing OrderedDict, ``add`` can be very helpful. So let's 461 insert another item into the OrderedDict. 462 463 >>> od.add('two', 2, '>one') # Insert after 'one' 464 >>> od.values() 465 [1, 2, 3, 4] 466 467 Now let's insert another item. 468 469 >>> od.add('twohalf', 2.5, '<three') # Insert before 'three' 470 >>> od.keys() 471 ["one", "two", "twohalf", "three", "four"] 472 473 Note that we also could have set the location of "twohalf" to be 'after two' 474 (i.e.: ``'>two'``). However, it's unlikely that you will have control over the 475 order in which extensions will be loaded, and this could affect the final 476 sorted order of an OrderedDict. For example, suppose an extension adding 477 'twohalf' in the above examples was loaded before a separate extension which 478 adds 'two'. You may need to take this into consideration when adding your 479 extension components to the various markdown OrderedDicts. 480 481 Once an OrderedDict is created, the items are available via key: 482 483 MyNode = od['somekey'] 484 485 Therefore, to delete an existing item: 486 487 del od['somekey'] 488 489 To change the value of an existing item (leaving location unchanged): 490 491 od['somekey'] = MyNewObject() 492 493 To change the location of an existing item: 494 495 t.link('somekey', '<otherkey') 496 497 <h4 id="registerextension">registerExtension</h4> 498 499 Some extensions may need to have their state reset between multiple runs of the 500 Markdown class. For example, consider the following use of the [[Footnotes]] 501 extension: 502 503 md = markdown.Markdown(extensions=['footnotes']) 504 html1 = md.convert(text_with_footnote) 505 md.reset() 506 html2 = md.convert(text_without_footnote) 507 508 Without calling ``reset``, the footnote definitions from the first document will 509 be inserted into the second document as they are still stored within the class 510 instance. Therefore the ``Extension`` class needs to define a ``reset`` method 511 that will reset the state of the extension (i.e.: ``self.footnotes = {}``). 512 However, as many extensions do not have a need for ``reset``, ``reset`` is only 513 called on extensions that are registered. 514 515 To register an extension, call ``md.registerExtension`` from within your 516 ``extendMarkdown`` method: 517 518 519 def extendMarkdown(self, md, md_globals): 520 md.registerExtension(self) 521 # insert processors and patterns here 522 523 Then, each time ``reset`` is called on the Markdown instance, the ``reset`` 524 method of each registered extension will be called as well. You should also 525 note that ``reset`` will be called on each registered extension after it is 526 initialized the first time. Keep that in mind when over-riding the extension's 527 ``reset`` method. 528 529 <h4 id="configsettings">Config Settings</h4> 530 531 If an extension uses any parameters that the user may want to change, 532 those parameters should be stored in ``self.config`` of your 533 ``markdown.Extension`` class in the following format: 534 535 self.config = {parameter_1_name : [value1, description1], 536 parameter_2_name : [value2, description2] } 537 538 When stored this way the config parameters can be over-ridden from the 539 command line or at the time Markdown is initiated: 540 541 markdown.py -x myextension(SOME_PARAM=2) inputfile.txt > output.txt 542 543 Note that parameters should always be assumed to be set to string 544 values, and should be converted at run time. For example: 545 546 i = int(self.getConfig("SOME_PARAM")) 547 548 <h4 id="makeextension">makeExtension</h4> 549 550 Each extension should ideally be placed in its own module starting 551 with the ``mdx_`` prefix (e.g. ``mdx_footnotes.py``). The module must 552 provide a module-level function called ``makeExtension`` that takes 553 an optional parameter consisting of a dictionary of configuration over-rides 554 and returns an instance of the extension. An example from the footnote 555 extension: 556 557 def makeExtension(configs=None) : 558 return FootnoteExtension(configs=configs) 559 560 By following the above example, when Markdown is passed the name of your 561 extension as a string (i.e.: ``'footnotes'``), it will automatically import 562 the module and call the ``makeExtension`` function initiating your extension. 563 564 You may have noted that the extensions packaged with Python-Markdown do not 565 use the ``mdx_`` prefix in their module names. This is because they are all 566 part of the ``markdown.extensions`` package. Markdown will first try to import 567 from ``markdown.extensions.extname`` and upon failure, ``mdx_extname``. If both 568 fail, Markdown will continue without the extension. 569 570 However, Markdown will also accept an already existing instance of an extension. 571 For example: 572 573 import markdown 574 import myextension 575 configs = {...} 576 myext = myextension.MyExtension(configs=configs) 577 md = markdown.Markdown(extensions=[myext]) 578 579 This is useful if you need to implement a large number of extensions with more 580 than one residing in a module. 581 582 [Preprocessors]: #preprocessors 583 [InlinePatterns]: #inlinepatterns 584 [Treeprocessors]: #treeprocessors 585 [Postprocessors]: #postprocessors 586 [BlockParser]: #blockparser 587 [Working with the ElementTree]: #working_with_et 588 [Integrating your code into Markdown]: #integrating_into_markdown 589 [extendMarkdown]: #extendmarkdown 590 [OrderedDict]: #ordereddict 591 [registerExtension]: #registerextension 592 [Config Settings]: #configsettings 593 [makeExtension]: #makeextension 594 [ElementTree]: http://effbot.org/zone/element-index.htm 595