Blog archive format

As I’ve been improving the import and export functionality in Micro.blog, I’ve done a lot of work with WordPress’s WXR format, which is based on RSS. While there’s nothing particularly wrong with WXR, it’s more complicated than it needs to be for non-WordPress sites, especially when you start to tackle image uploads that exist outside of the post text.

Micro.blog can also push an entire site’s Markdown, HTML, and images to GitHub, which is the most complete mirror and perfect for migrating to another Jekyll server. It introduces so many extra files, though, it’s not reasonable to expect that other blog platforms could support the same level of detail.

I’d be happy to ignore the WordPress-centric nature of WXR and use it as a common blog archive format if WXR provided a mechanism to store image uploads. Helping people migrate from WordPress to Micro.blog-hosted blogs has only emphasized to me that a better format is needed.

In chatting with the IndieWeb community, the idea was proposed that an HTML file using h-feed would provide portability and also an added bonus: it could be opened in any web browser to view your archived site. Images could be stored as files with relative references in the HTML file. (I’d throw in a JSON Feed file, too, so that importers could choose between using a Microformats parser or JSON parser.)

The files would look something like this:

  • index.html
  • feed.json
  • uploads
    • 2017
      • test.jpg

The basics from h-feed would follow this structure:

  • h-feed
    • h-entry
      • p-name
      • e-content
      • dt-published
      • u-url
    • h-entry
      • ...

Only index.html and feed.json would be required. Any other paths in the archive would be determined by the contents of the HTML. (I’m using “uploads” in this example, but it could just as easily be “archive”, “audio”, or any other set of folders.)

For large sites, the HTML could be split into multiple files with appropriate tags in the header to page through the additional files. While it could contain CSS and your full blog’s design, I’m imagining that the HTML would be extremely lightweight: just enough to capture the posts, not a way to transfer templates and themes between blogs.

The whole folder is zipped and renamed with a .bar extension. Easy to move around and upload all at once. I’ve created an example file here (rename it .zip to open it).

I’d love to hear what you think. I talked about this on a recent episode of Timetable as well. Might be a nice topic to follow up on at IndieWebCamp Austin in 2 weeks.

Manton Reece @manton