I recently saw a neat trick on Twitter from Stephen Hay, who says that:
Few things clean up CMS-input HTML better than running it through Pandoc to convert to Markdown and then back to HTML again. 1 sec, big win.
pandoc is a swiss-army knife for converting between all sorts of markup formats. You can find installation instructions on the pandoc site.
Suppose we have a tea-dance.html file that contains crufty markup, because it was generated by a WYSIWYG editor. We could clean it up by running this at the command line:
cat tea-dance.html | pandoc --from=html --to=markdown | pandoc --from=markdown --to=html
This emits a cleaned up version of tea-dance.html on standard out.
We’re using pandoc as a filter, that is: a program “that accepts text at standard input, changes it in some way, and sends it to standard output”.
Running text from a Vim buffer through an external filter
Suppose that we open the tea-dance.html file in Vim. We can use the bang Ex command to filter the contents of the current buffer through our pandoc pipeline:
:%!pandoc --from=html --to=markdown | pandoc --from=markdown --to=html
Vim will take the output from that pipeline and use it to overwrite the original text from the buffer.
In a followup tweet, Stephen suggests mapping this Ex command to a key so we can run it more easily. For example, you could add a mapping for normal mode and another for visual mode:
nnoremap <leader>gq :%!pandoc -f html -t markdown | pandoc -f markdown -t html<CR> vnoremap <leader>gq :!pandoc -f html -t markdown | pandoc -f markdown -t html<CR>
That’ll work, but I want to suggest a way of doing it without leader mappings.
Set up formatprg to filter selection through pandoc
In episode 18 of Vimcasts, I demonstrated how the external par command could be used for the task of formatting plain text files with hard-wrapping.
As long as we’re using Vim version 8.0.0179 (or newer), we can use a similar technique here.
The gq operation runs the selected text through the filter specified by formatprg.
This autocommand sets formatprg for HTML files to use our pandoc pipeline:
if has("autocmd") let pandoc_pipeline = "pandoc --from=html --to=markdown" let pandoc_pipeline .= " | pandoc --from=markdown --to=html" autocmd FileType html let &l:formatprg=pandoc_pipeline endif
That means we can filter the current line through pandoc by pressing gqq.
Or we can filter the entire buffer by pressing gg then gqG.
Or we can switch to visual mode, and gq will filter only the selected lines.
Update: When I originally published this episode, I assumed that the formatprg option could be set for each buffer independently.
I was wrong then, but this is now possible since this patch by Sung Pae was accepted into Vim core.
Further reading
- Stephen Hay’s tweets: one and two
- pandoc
- Installing pandoc
:h filter:h :range!:h formatprg:h gq