Using external filter commands to reformat HTML
Here I’ve got an HTML document that contains some crufty markup. This source code was generated by a WYSIWYG editor and I’d like to clean it up. I recently saw a neat trick on Twitter from Stephen Hay, who says that:
Few things clean up CMS-input HTML better than running it through Pandoc to convert to Markdown and then back to HTML again. 1 sec, big win.
In case you don’t know, pandoc is a swiss-army knife for converting between all sorts of markup formats. I’ll demonstrate first at the command line, then we’ll look at how to integrate this tool with our Vim setup.
We can pipe the contents of this file into pandoc, instructing it to convert from html to markdown:
cat tea-dance.tinymce.html | pandoc --from=html --to=markdown
We can see the results on standard out: it’s the same content, but now in markdown format. We could then pipe this document back into pandoc, and ask it to convert from markdown back to html:
cat tea-dance.tinymce.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html
That gives us the same content in HTML, minus all of the crufty markup that the WYSIWYG editor generated. Pretty neat!
Using filter commands in Vim
In this example, we’re using pandoc as a filter.
That is: a program “that accepts text at standard input, changes it in some way, and sends it to standard output”.
The bang Ex command lets us send a range of lines from our current buffer to an external filter program. The the original text from the buffer will be replaced by the output from the external command.
Let’s try that out (I’ve saved the pandoc command in register ‘a’, so I’ll just paste it):
:%!pandoc -f html -t markdown | pandoc -f markdown -t html
Boom! The entire buffer has been overwritten with the output from our pandoc pipeline.
In a followup tweet, Stephen suggests mapping this Ex command to a key so we can run it more easily. For example, you could add a mapping for normal mode and another for visual mode:
nnoremap <leader>gq :%!pandoc -f html -t markdown | pandoc -f markdown -t html<CR> vnoremap <leader>gq :!pandoc -f html -t markdown | pandoc -f markdown -t html<CR>
That’ll work, but I want to suggest a way of doing it without leader mappings.
formatprg option lets us specify an external program that will be triggered by the
In episode 18 of Vimcasts, I demonstrated how the external
par command could be used for the task of formatting plain text files with hard-wrapping.
We could use a similar technique here.
Let’s set the
formatprg option to our pandoc pipeline:
let &formatprg="pandoc --from=html --to=markdown | pandoc --from=markdown --to=html"
Now when we use the
gq command, Vim passes the selected text to pandoc for processing.
That means I can operate on the current line by pressing
Or I can filter the entire buffer through pandoc by pressing
Or I can switch to visual mode, and
gq filters the selected lines only.
If you like this approach, I would recommend using this autocommand:
if has("autocmd") let pandoc_pipeline = "pandoc --from=html --to=markdown" let pandoc_pipeline .= " | pandoc --from=markdown --to=html" autocmd FileType html let &formatprg=pandoc_pipeline endif
Which sets up pandoc as the formatprg for HTML files only. If you can think of other filter commands that could be used in this fashion, you can always use this autocommand as a template.