Porting Content Between Site Versions

Over the years I’ve had a number of different personal websites (not even counting the number of abortive half-built efforts that got abandoned along the way), but each one has started from the same basic content as the previous iteration. Here are some of the scripts and tricks I’ve used for migrating content between different systems. This involved a large degree of fucking around && finding out, so let’s consolidate it from all the repos on my machine into one place for future reference.

Caveats

Many of the implementation details in these scripts are hyper-specific to my use case, but the principles should generalise. I think it’s useful to show how I’ve solved some of these problems by writing one-off scripts, not everything needs to be a clean reusable abstraction. It’s fine to hack together a quick script to get the job done, even if it doesn’t handle errors or edge cases correctly. My Ruby, in particular, is very much hack level.

1a: Stacey to ActiveRecord

At first I had a Stacey site for my portfolio, with a separate Wordpress site for my blog (see next section). Stacey uses a YAML file with a content key that you put markdown or HTML in for each post (similar to the markdown + YAML frontmatter format that things like Eleventy use.), e.g.

title: Post Title
date: 2022
description: Quick description of the post
tag: tag1, tag2
content: |
  Some markdown or HTML content

I was porting to a Padrino site using ActiveRecord as an ORM, so using a rake task seemed like a reasonable solution. The Post model would store HTML in its content field, and could also have and belong to many tags. Once the Post and Tag models are defined, and the Stacey content copied to /app/data/portfolio, we can add an import.rake task to loop over the folder and add a new post to the database for each item, as well as any tags required.

require 'yaml'
require 'kramdown'

# define the import task
task :import => :environment do
  portfolio_import
end

def portfolio_import
  Dir.foreach('./app/data/portfolio') do |item|
    next if item == '.' or item == '..'
    # load the yml file
    p = YAML.load_file("./app/data/portfolio/#{item}/project.yml")
    # get the content field
    raw_content = p['content']
    # Everything was inside ULs in the content, so we could tell if
    # the content was markdown or HTML by checking the first character
    if raw_content.first == '-'
    # it's Markdown, convert to HTML
      content = Kramdown::Document.new(raw_content, entity_output: :as_char).to_html
    else
      # just use the HTML as is
      content = raw_content
    end
    # get the list of tags, split on , and filter any empty items
    tags = p['tag'].downcase.gsub(',', ' ').split(' ').reject(&:blank?)

    # build an active record collection of tags
    tag_collection = []
    tags.each do |tag|
      # find or create the tag, and add it to the collection
      t = Tag.find_or_create_by(title: tag)
      tag_collection << t
    end
    # create a new Post
    post = Post.new(
      title: p['title'],
      slug: item.to_s.split('.').last,
      publish_date: (Date.new(p['date']) rescue nil),
      content: content,
      post_type: 'project',
      status: 'publish',
      tags: tag_collection
    )
    # save it
    post.save
  end
end

Running bundle exec padrino rake import would run the task. Job half done.

1b: WordPress XML to ActiveRecord

Next we want to take a WordPress XML export and do the same process. Turns out nokogiri is useful for something other than making life impossible for people trying to use Ruby on Windows.

# add to the earlier requires
require 'nokogiri'

# add to the import task
task :import => :environment do
  portfolio_import
  blog_import
end

def blog_import
  # import and parse the xml
  archive = File.open("./app/data/archive.xml") { |f| Nokogiri::XML(f) }
  # get each item (page or post)
  archive.css('item').each do |item|
    # we only want posts
    if (item.css('wp|post_type').first.content rescue nil) == 'post'
      # parse the post's tag names into an array of strings
      tags = []
      item.css('category').each do |tag|
        if tag.attributes['domain'].to_s == 'post_tag'
          tags << tag.attributes['nicename']
        end
      end

      # build an active record collection of tags
      tag_collection = []
      tags.each do |tag|
        t = Tag.find_or_create_by(title: tag.value)
        tag_collection << t
      end

      # create a new Post and populate with data from the XML.
      post = Post.new(
        title: (item.css('title').first.content rescue nil),
        slug: (item.css('title').first.content.parameterize.dasherize rescue nil),
        publish_date: (Date.parse(item.css('pubDate').first.content) rescue nil),
        # WP stores the content as HTML
        content: (item.css('content|encoded').first.content rescue nil),
        # published/draft
        status: (item.css('wp|status').first.content rescue nil),
        post_type: 'blog',
        tags: tag_collection
      )
      # save it to the DB
      post.save
    end
  end
end

Now running bundle exec padrino rake import will import the blog entries into the database.

2: Active Record to Markdown files

Fast forward a few years, and I’ve decided I’d like to move to a static site. At various times this was going to be in Next.js, Eleventy and eventually Astro, all of which can build pages from Markdown files. So we want to export all the posts from the database into correctly formatted Markdown, and also grab their associated poster images from the Asset model (which was using paperclip under the hood for file uploading).

This is largely the same process as the import, but in reverse

require 'kramdown'
require 'open-uri'

task :export => :environment do
  posts_export
  assets_export
end

def assets_export
  # for each Asset in the database
  Asset.all.each do |asset|
    # create a local file with the right name
    File.open("images/#{asset.file.url.split('/').last}", 'wb') do |fo|
      # read the file from the server and write to the local file
      fo.write open("https://kylemacquarrie.co.uk#{asset.file.url}").read
    end
  end
end

def posts_export
  Post.all.each do |post|
    # ignore contacts
    next if post.post_type == 'contact'

    slug = post.slug.chomp('/')
    post_type = "#{post.post_type}#{ post.post_type == 'blog' ? '' : 's'}"
    # parse html content back to markdown
    markdown = Kramdown::Document.new(post.content, input: 'html').to_kramdown rescue ''
    # remove attributes that the janky wysiwyg editor added
    markdown = markdown.gsub('{: target="_blank"}', '')
    markdown = markdown.gsub('{: .ql-syntax spellcheck="false"}', '')
    # construct a string in the correct markdown + frontmatter format
    string = "---
title: '#{post.title}'
abstract: \"#{post.abstract}\"
status: #{post.status}
published: #{post.publish_date}
tags: #{post.tags.all.map{|t| t.title}.join(',')}
image: #{post.main_asset.file.url.split('/').last rescue ''}
position: #{post.position}
---

#{markdown}
"
    # write to a new markdown file with the correct path
    File.open("posts/#{post_type}/#{slug}.md", 'wb') do |fo|
      fo.write string
    end
  end
end

Running bundle exec padrino rake export gives us a couple of folders of Markdown files, and a folder of all the images that we ever uploaded in the Padrino Admin CMS, ready to dump into the fancy new site’s git repo.

3: Rewrite Flickr hotlinks to local files

For historical reasons (because my original shared hosting didn’t have much space) most of the images that were in the site content were hotlinked from my Flickr account (remember them?). I was keen to remove that external dependency. Fortunately Flickr allows you to download your archive as a zip file, so we just need to reverse-engineer a way of mapping the CDN link to the original file.

The Flickr URLs looked like this:

https://farm6.staticflickr.com/5537/14063836058_5a91d05c09_b.jpg

The matching image in the zip file would be something like

/path/to/archive/original_filename_14063845389_o.jpg

Unfortunately that still leaves us some work to do to map the CDN link back to the original image.

// import.js
import { readdirSync, readFileSync, copyFileSync, writeFileSync } from 'fs'

function main() {
  // identify Flickr CDN links. The first 10+ digit section appears to be the original image ID
  // e.g. https://farm6.staticflickr.com/5537/14063836058_5a91d05c09_b.jpg
  const regex =
    /https:\/\/farm\d.static.?flickr.com\/\d{4}\/\d{10,}_.{10,}.jpg/g
  // get a list of all the images in the flickr archive
  const images = readdirSync('./flickr_archive')
    // filter out WSL guff - MacOS may need to remove .DS_Store etc
    .filter((file) => !file.includes('Zone.Identifier'))

  function doFolder(type) {
    // in this case we have two folders, /blog and /projects.
    // get a list of all posts from that folder
    const posts = readdirSync(`./posts/${type}`)

    posts.forEach((postName) => {
      const postPath = `./posts/${type}/${postName}`
      // read the post content as a string
      let post = readFileSync(postPath).toString()
      // find any flickr cdn URLs
      const flickrLinks = post.matchAll(regex)
      // for each flickr
      for (const match of flickrLinks) {
        // extract the ID
        const url = match[0]
        const parts = url.replace('https://', '').split('/')
        const id = parts[parts.length - 1].split('_')[0]
        // find the image with a matching ID
        const img = images.find((i) => i.match(id))
        if (!img) {
          console.log(`couldn't find an image matching ${id} in ${postPath}`)
        }
        // copy file to new location
        const from = `./flickr_archive/${img}`
        const to = `./public/images/${img}`
        console.log(`copying ${from} to ${to}`)
        copyFileSync(from, to)

        // update URL in file
        const newUrl = `/images/${img}`
        console.log(`replacing ${url} with ${newUrl}`)
        post = post.replace(url, newUrl)
      }
      // write the post back to disk
      writeFileSync(postPath, post)
    })
  }

  doFolder('blog')
  doFolder('projects')
}

main()

Running node import.js gives us a folder of images that are linked from the content, and updates the links to point to the new paths.

← All articles