Porting Content Between Site Versions
Over the years I’ve had a number of different personal websites (not even counting the number of abortive half-built efforts that got abandoned along the way), but each one has started from the same basic content as the previous iteration. Here are some of the scripts and tricks I’ve used for migrating content between different systems. This involved a large degree of fucking around && finding out, so let’s consolidate it from all the repos on my machine into one place for future reference.
Caveats
Many of the implementation details in these scripts are hyper-specific to my use case, but the principles should generalise. I think it’s useful to show how I’ve solved some of these problems by writing one-off scripts, not everything needs to be a clean reusable abstraction. It’s fine to hack together a quick script to get the job done, even if it doesn’t handle errors or edge cases correctly. My Ruby, in particular, is very much hack level.
1a: Stacey to ActiveRecord
At first I had a Stacey site for my portfolio, with a separate Wordpress site for my blog (see next section). Stacey uses a YAML file with a content
key that you put markdown or HTML in for each post (similar to the markdown + YAML frontmatter format that things like Eleventy use.), e.g.
title: Post Title
date: 2022
description: Quick description of the post
tag: tag1, tag2
content: |
Some markdown or HTML content
I was porting to a Padrino site using ActiveRecord as an ORM, so using a rake
task seemed like a reasonable solution. The Post
model would store HTML in its content
field, and could also have and belong to many tags. Once the Post
and Tag
models are defined, and the Stacey content copied to /app/data/portfolio
, we can add an import.rake
task to loop over the folder and add a new post to the database for each item, as well as any tags required.
require 'yaml'
require 'kramdown'
# define the import task
task :import => :environment do
portfolio_import
end
def portfolio_import
Dir.foreach('./app/data/portfolio') do |item|
next if item == '.' or item == '..'
# load the yml file
p = YAML.load_file("./app/data/portfolio/#{item}/project.yml")
# get the content field
raw_content = p['content']
# Everything was inside ULs in the content, so we could tell if
# the content was markdown or HTML by checking the first character
if raw_content.first == '-'
# it's Markdown, convert to HTML
content = Kramdown::Document.new(raw_content, entity_output: :as_char).to_html
else
# just use the HTML as is
content = raw_content
end
# get the list of tags, split on , and filter any empty items
tags = p['tag'].downcase.gsub(',', ' ').split(' ').reject(&:blank?)
# build an active record collection of tags
tag_collection = []
tags.each do |tag|
# find or create the tag, and add it to the collection
t = Tag.find_or_create_by(title: tag)
tag_collection << t
end
# create a new Post
post = Post.new(
title: p['title'],
slug: item.to_s.split('.').last,
publish_date: (Date.new(p['date']) rescue nil),
content: content,
post_type: 'project',
status: 'publish',
tags: tag_collection
)
# save it
post.save
end
end
Running bundle exec padrino rake import
would run the task. Job half done.
1b: WordPress XML to ActiveRecord
Next we want to take a WordPress XML export and do the same process. Turns out nokogiri
is useful for something other than making life impossible for people trying to use Ruby on Windows.
# add to the earlier requires
require 'nokogiri'
# add to the import task
task :import => :environment do
portfolio_import
blog_import
end
def blog_import
# import and parse the xml
archive = File.open("./app/data/archive.xml") { |f| Nokogiri::XML(f) }
# get each item (page or post)
archive.css('item').each do |item|
# we only want posts
if (item.css('wp|post_type').first.content rescue nil) == 'post'
# parse the post's tag names into an array of strings
tags = []
item.css('category').each do |tag|
if tag.attributes['domain'].to_s == 'post_tag'
tags << tag.attributes['nicename']
end
end
# build an active record collection of tags
tag_collection = []
tags.each do |tag|
t = Tag.find_or_create_by(title: tag.value)
tag_collection << t
end
# create a new Post and populate with data from the XML.
post = Post.new(
title: (item.css('title').first.content rescue nil),
slug: (item.css('title').first.content.parameterize.dasherize rescue nil),
publish_date: (Date.parse(item.css('pubDate').first.content) rescue nil),
# WP stores the content as HTML
content: (item.css('content|encoded').first.content rescue nil),
# published/draft
status: (item.css('wp|status').first.content rescue nil),
post_type: 'blog',
tags: tag_collection
)
# save it to the DB
post.save
end
end
end
Now running bundle exec padrino rake import
will import the blog entries into the database.
2: Active Record to Markdown files
Fast forward a few years, and I’ve decided I’d like to move to a static site. At various times this was going to be in Next.js, Eleventy and eventually Astro, all of which can build pages from Markdown files. So we want to export all the posts from the database into correctly formatted Markdown, and also grab their associated poster images from the Asset
model (which was using paperclip
under the hood for file uploading).
This is largely the same process as the import, but in reverse
require 'kramdown'
require 'open-uri'
task :export => :environment do
posts_export
assets_export
end
def assets_export
# for each Asset in the database
Asset.all.each do |asset|
# create a local file with the right name
File.open("images/#{asset.file.url.split('/').last}", 'wb') do |fo|
# read the file from the server and write to the local file
fo.write open("https://kylemacquarrie.co.uk#{asset.file.url}").read
end
end
end
def posts_export
Post.all.each do |post|
# ignore contacts
next if post.post_type == 'contact'
slug = post.slug.chomp('/')
post_type = "#{post.post_type}#{ post.post_type == 'blog' ? '' : 's'}"
# parse html content back to markdown
markdown = Kramdown::Document.new(post.content, input: 'html').to_kramdown rescue ''
# remove attributes that the janky wysiwyg editor added
markdown = markdown.gsub('{: target="_blank"}', '')
markdown = markdown.gsub('{: .ql-syntax spellcheck="false"}', '')
# construct a string in the correct markdown + frontmatter format
string = "---
title: '#{post.title}'
abstract: \"#{post.abstract}\"
status: #{post.status}
published: #{post.publish_date}
tags: #{post.tags.all.map{|t| t.title}.join(',')}
image: #{post.main_asset.file.url.split('/').last rescue ''}
position: #{post.position}
---
#{markdown}
"
# write to a new markdown file with the correct path
File.open("posts/#{post_type}/#{slug}.md", 'wb') do |fo|
fo.write string
end
end
end
Running bundle exec padrino rake export
gives us a couple of folders of Markdown files, and a folder of all the images that we ever uploaded in the Padrino Admin CMS, ready to dump into the fancy new site’s git repo.
3: Rewrite Flickr hotlinks to local files
For historical reasons (because my original shared hosting didn’t have much space) most of the images that were in the site content were hotlinked from my Flickr account (remember them?). I was keen to remove that external dependency. Fortunately Flickr allows you to download your archive as a zip file, so we just need to reverse-engineer a way of mapping the CDN link to the original file.
The Flickr URLs looked like this:
https://farm6.staticflickr.com/5537/14063836058_5a91d05c09_b.jpg
The matching image in the zip file would be something like
/path/to/archive/original_filename_14063845389_o.jpg
Unfortunately that still leaves us some work to do to map the CDN link back to the original image.
// import.js
import { readdirSync, readFileSync, copyFileSync, writeFileSync } from 'fs'
function main() {
// identify Flickr CDN links. The first 10+ digit section appears to be the original image ID
// e.g. https://farm6.staticflickr.com/5537/14063836058_5a91d05c09_b.jpg
const regex =
/https:\/\/farm\d.static.?flickr.com\/\d{4}\/\d{10,}_.{10,}.jpg/g
// get a list of all the images in the flickr archive
const images = readdirSync('./flickr_archive')
// filter out WSL guff - MacOS may need to remove .DS_Store etc
.filter((file) => !file.includes('Zone.Identifier'))
function doFolder(type) {
// in this case we have two folders, /blog and /projects.
// get a list of all posts from that folder
const posts = readdirSync(`./posts/${type}`)
posts.forEach((postName) => {
const postPath = `./posts/${type}/${postName}`
// read the post content as a string
let post = readFileSync(postPath).toString()
// find any flickr cdn URLs
const flickrLinks = post.matchAll(regex)
// for each flickr
for (const match of flickrLinks) {
// extract the ID
const url = match[0]
const parts = url.replace('https://', '').split('/')
const id = parts[parts.length - 1].split('_')[0]
// find the image with a matching ID
const img = images.find((i) => i.match(id))
if (!img) {
console.log(`couldn't find an image matching ${id} in ${postPath}`)
}
// copy file to new location
const from = `./flickr_archive/${img}`
const to = `./public/images/${img}`
console.log(`copying ${from} to ${to}`)
copyFileSync(from, to)
// update URL in file
const newUrl = `/images/${img}`
console.log(`replacing ${url} with ${newUrl}`)
post = post.replace(url, newUrl)
}
// write the post back to disk
writeFileSync(postPath, post)
})
}
doFolder('blog')
doFolder('projects')
}
main()
Running node import.js
gives us a folder of images that are linked from the content, and updates the links to point to the new paths.