Using external filter commands to reformat HTML


This is a transcript of screencast 64.

We can use pandoc as a filter to clean up WYSIWYG-generated HTML. Pandoc is a commandline program, but we can call it from inside Vim either using the bang Ex command, or by configuring the formatprg option to make the gq operator invoke pandoc.

Here I’ve got an HTML document that contains some crufty markup. This source code was generated by a WYSIWYG editor and I’d like to clean it up. I recently saw a neat trick on Twitter from Stephen Hay, who says that:

Few things clean up CMS-input HTML better than running it through Pandoc to convert to Markdown and then back to HTML again. 1 sec, big win.

In case you don’t know, pandoc is a swiss-army knife for converting between all sorts of markup formats. I’ll demonstrate first at the command line, then we’ll look at how to integrate this tool with our Vim setup.

We can pipe the contents of this file into pandoc, instructing it to convert from html to markdown:

cat tea-dance.tinymce.html | pandoc --from=html --to=markdown

We can see the results on standard out: it’s the same content, but now in markdown format. We could then pipe this document back into pandoc, and ask it to convert from markdown back to html:

cat tea-dance.tinymce.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html

That gives us the same content in HTML, minus all of the crufty markup that the WYSIWYG editor generated. Pretty neat!

Using filter commands in Vim

In this example, we’re using pandoc as a filter.

:h filter

That is: a program “that accepts text at standard input, changes it in some way, and sends it to standard output”.

:h :!

The bang Ex command lets us send a range of lines from our current buffer to an external filter program. The the original text from the buffer will be replaced by the output from the external command.

Let’s try that out (I’ve saved the pandoc command in register ‘a’, so I’ll just paste it):

:%!pandoc -f html -t markdown | pandoc -f markdown -t html

Boom! The entire buffer has been overwritten with the output from our pandoc pipeline.

In a followup tweet, Stephen suggests mapping this Ex command to a key so we can run it more easily. For example, you could add a mapping for normal mode and another for visual mode:

nnoremap <leader>gq :%!pandoc -f html -t markdown | pandoc -f markdown -t html<CR>
vnoremap <leader>gq :!pandoc -f html -t markdown | pandoc -f markdown -t html<CR>

That’ll work, but I want to suggest a way of doing it without leader mappings.

:h formatprg

The formatprg option lets us specify an external program that will be triggered by the gq operator. In episode 18 of Vimcasts, I demonstrated how the external par command could be used for the task of formatting plain text files with hard-wrapping. We could use a similar technique here.

Let’s set the formatprg option to our pandoc pipeline:

let &formatprg="pandoc --from=html --to=markdown | pandoc --from=markdown --to=html"

Now when we use the gq command, Vim passes the selected text to pandoc for processing.

That means I can operate on the current line by pressing gqq. Or I can filter the entire buffer through pandoc by pressing gqG “gee-queue-shift-gee”. Or I can switch to visual mode, and gq filters the selected lines only.

If you like this approach, I would recommend using this autocommand:

if has("autocmd")
  let pandoc_pipeline  = "pandoc --from=html --to=markdown"
  let pandoc_pipeline .= " | pandoc --from=markdown --to=html"
  autocmd FileType html let &formatprg=pandoc_pipeline

Which sets up pandoc as the formatprg for HTML files only. If you can think of other filter commands that could be used in this fashion, you can always use this autocommand as a template.