Dumpster diving & retrieval of Wordpress content

Backing things up

Background

For reasons that are stupid, I’ve lost my hosted Wordpress website. The TLDR version is that the webhost keeps doing something. The PHP version keeps changing, the MySQL database corrupts, the Wordpress backend auto-updates either via a webhost script or within Wordpress itself and then corrupts.

Either way, enough is enough. I don’t actually use Wordpress for all of the bells and whistles. I don’t actually use it like a dynamic site, and I don’t need a relational database powering the backend. What I mostly use it as, is a static file store. In work projects, I’ve been using the Python Sphinx library for a long time to have project documentation being automatically built by the CI pipeline after a commit is merged into the develop branch.

After listening to Derek Sivers on the Tim Ferriss podcast, he correctly stated that he’s been writing daily journal articles as Markdown files for literal decades, and even with that amount of writing, the text easily fits into memory of even the smallest virtual machine. This gives the benefits that all of your history is essentially system agnostic, instantly moveable, human readable, and wildly fast to search.

Alright, I’m convinced, let’s pull the trigger.

First I needed some engine to convert static files to an actual web framework - I’m a backend guy. I don’t want to be messing around with HTML and CSS unless I absolutely have to. This very much falls into the category of “someone will have already done this better than me”.

A bit of Googling later, and I’ve settled on Hugo as my engine of choice.

Setup

I’m not going to rehash the Hugo quick start tutorial, that’s been done elsewhere, so I’m not going to add any value there.

Needless to say, Hugo is kind of great! I particularly like the ability that it just sort of works out of the box and gives you sensible defaults.

What’s initially great is that it just runs from the command line on my local Mac and I can “see” the site through http://localhost:1313/ with a simple hugo server- allowing me to fix the embarrassing bugs before anyone can lay eyes on it.

When I’m good and done, the v1 skeleton was built with

hugo

And then I could just use FileZilla to FTP the files up to my webhost.

Easy peasy!

Restoration attempt 1: webhost backup

Despite the webhost advertising “unlimited” webhosting, you quickly find that it is actually caveated with a reasonable, fair-use policy for a personal website. Which by their definition is actually 10 GB for everything, which isn’t crazy, but isn’t actually a lot of space.

Why do I mention this? Well because, they also, by default have a pseudo-logarithmic backup policy on by default that wraps your entire site up into a gzipped tarbell. For most users, this would be a nice feature to have an auto-backup option. Unless, your Wordpress engine is constantly trying to auto-update and corrupting itself.

In which case, I’ve been left with the last several backups being essentially vanilla WordPress installs with no content. Awesome!

But wait! There’s more. Because this is a dumb backup. What it has done is overwritten my older backups that I had manually triggered against a known “good” state of the system.

Awesome! Always overwrite the user triggered backups. Because they’re old. Ignore the fact that they are 10x the size on disk.

Restoration attempt 2: random files from disk

Having never been a huge fan of the WordPress WYSIWYG writing interface, because a) I’ve always hated the font (and therefore found it distracting) and b) I often like to write without WiFi much of the content on my website starts as files on my local machine.

For several of the newer pieces, I therefore had Mac Pages, Word .doc or OpenOffice .odt files in a bunch of different directories on this Mac.

Whilst annoying, to copy/paste these different text formats into (the now retired) Atom and then manually adjust some of the formatting to turn it into Markdown, wasn’t a total deal breaker.

However, that left some major holes in the back catalog.

Restoration attempt 3: web archives

I’m not really sure, why the idea popped into my head - but a few times in the past, I’ve had clear memories of blogs, tweets or news articles only to be unable to find them through Google.

So certain have I been of some of these things, I started down the Google cache rabbit hole - only to find that the Internet Archive is actually, by far, the best resource for this. Intellectually, I’ve understood the purpose of trying to snapshot the internet in the past, to avoid an Orwellian or Fahrenheit 451 reshaping of history. But here, I found myself praying to the Internet Gods(TM) that they might save me.

Indeed, heading to https://web.archive.org/web/20230000000000*/http://morganbye.com/ revealed that morganbye.com had been scraped some 37 times since mid-2015. And the earlier iteration of morganbye.net had been scraped 77 times (via https://web.archive.org/web/20230000000000*/http://morganbye.net/) as early as June 2009.

With a bit of clever wildcarding of URLs, I could find a list of all URLs that the Archive had ever scrapped

https://web.archive.org/web/*/http://morganbye.net/*

Unfortunately, with this being WordPress and all, there were a lot of URLs that was just pre-populated Wordpress queries, temporary/draft pages or similar…

http://morganbye.net/?C=S;O=A
http://morganbye.net/?custom-css=1&csblog=1&cscache=6&csrev=4
http://morganbye.net/?page_id=221
http://morganbye.net/author/morgan/
http://morganbye.net/author/morgan/feed/

Of the 1228 URLs listed, perhaps 100 of them were going to be useful content.

Themes

And so began the long and slow process of clicking each link, navigating through the couple of versions of the page the Archive might have and finding the one that was most representative, or in the best format.

It turns out, that over the years, WordPress and the Archive have evolved greatly and things like the image handling or backgrounds vary greatly based upon when the page was scraped.

Additionally, over the years my sites have used 4 or 5 different themes - with some themes being easier to scrape than others.

Most frustratingly, at least one of the themes didn’t seem to have the post datetime preserved, so whilst I had the content, I was left guessing at the approximate date something was published.

Bad restoration

In the cases where I did have the datetimes, another problem seemed to arise. It appears that at some point in 2014, I restored the Wordpress database of the webhost from an archive. In fact this is probably around the time that I changed webhosts and moved away from GoDaddy.

In any case, it appears that the restoration process replaced the publication date with the restoration time, yielding more than 20 posts published within 2 seconds of each other. I might have moments of prolific writing, but not like that.

Scraping settings

Due to the webhost settings, the Archive missing things or my own personal settings (robots.txt, site license or whatever) there were still some pages missing and particularly everything pre-2013 was a real challenge to find.

In some instances, I could find the blog feed from say 2009: https://web.archive.org/web/20111104030648/http://morganbye.net/blog/Morgan/2009

However, none of the linked pages actually got scraped. Puzzling.

Archive conclusion

I felt like I was coming to the end of what was useful from the Internet Archive. However, at the same time, I really felt like it had saved my bacon.

I made my donation to the Archive as some form of payment for helping me out in a tough moment.

But I remained stunned that places like Flickr had kept my profile live and all of my photos up after all these years.

Restoration attempt 4: dig out that hard drive

At this point, I’m getting pretty desperate. Desperate is probably the wrong word. After discovering the webhost had destroyed all my backups, I sort of suspected that it was game over - so the fact that I’ve managed to get anything of my personal archive restored felt like a huge win.

But exactly because I had had some success up until this point, I was on a mission to try and now maximize every possible option. A personal mission.

And so began attempt number 4.

At the bottom of a wardrobe, lies a box of assorted, random cables that I’ve kept because of my OCD that thinks that cables might be useful one day.

I knew in that box was on old backup hard drives from half a lifetime ago.

In my heart, I felt like this is probably what I should have done from the beginning and started to feel a bit stupid for ploughing so many hours into reformatting the Web Archive into markdown files manually.

I pull out the box, I find the Western Digital 2TB disks. The labels tell me that they were manufactured in February of 2011. I cringe. How much hope am I putting into decade old hard drives that haven’t been connected to anything in more than 5 years?

In Sharpie the drives are labelled “BACKUP SERVER 1” and “BACKUP SERVER 2”. These are the same disks that were used in the backup server that I set up in the first weeks of my PhD. Terrifying!

While I’m in the box, I dig out the Icy Box external hard drive dock - which somehow they are still in business and essentially still sell the same item. Of course, the power adapter is for the UK, so I have to find a plug adapter to get it working.

I now have a Frankenstein setup across my desk, of ancient USB cables, plugging into my dock, to get it to talk to my Mac via Thunderbolt, whilst I have a power brick on top of an international adapter on top of a power bar.

This allows me cross my fingers, flip on the power and pray.

The disks spin up, and MacOS just detects them like it isn’t even a thing.

A decade ago, these hard drives defined who I was as a person. I dragged them around the planet with me several times. They’ve seen more airports than I want to count. And frankly if any of those airports had stopped me and inspected the drives there’s probably enough kompromat on them to get me a few years in prison.

Between all of the mid-2000s pirated movies and TV shows, is a folder labelled website.

It contains, almost nothing useful.

Photoshop files of logo development from 10 years ago, a couple of compressed and downsampled images that I have better resolution elsewhere.

I take a couple across to my Mac hard drive for later inspection (before the drive dies), but I am disappointed.

Restoration attempt 5: Hang on! I’ve got a MacBook somewhere!

Hidden somewhere in the depths of a storage box, inspiration struck!

When I first moved to Canada, the good lady wife bought me a new laptop. A shiny, brand new 12-inch MacBook. The one that’s now infamous for having ONLY 1 USB-C port (back when nothing was USB-C) and kickstarting what became the MacBook Air revival.

The 2015 12-inch MacBook, image credit: CNET

I recalled that when I came to Canada, and I was meaningfully unemployed for a little while - I did all of my job applications and scientific journalism on the little machine. There was a strong chance that this little box, now 8 years old, probably contained some secrets.

After a little rummaging, I found the little guy… and because of that USB-C port, I didn’t need to struggle to find a charger, I just pulled one out from my work bag.

With a few minutes to get a bootable amount of charge in the battery, it sprang into life. A few minutes of browsing the file system and I found a folder of SQL dumps from various points across the years.

Okay, now we’re in business!

What do I do with .sql database dumps

But now, I have an interesting dilemma. What do I do with a .sql file?

All of the online documentation for migrating from WordPress to Hugo assumes that you have an operational instance of WordPress.

The Hugo dev team even supports a WordPress plugin that you can install to your instance and then export from the WordPress GUI backend.

But I don’t have an operational WordPress instance.

Option 1 - deploy WordPress online

I could reinstall WordPress on my webhost, then upload the backup files, import and restore them from the WordPress backend.

In principle, probably the fastest solution. Except 1) I’ve already got Hugo live on the webhost and 2) I’m giving up WordPress on this webhost because it’s always been a nightmare to get operational and keep live without everything corrupting (WordPress version, PHP version, MySQL version, VM memory).

Option 2 - spin up WordPress local

I am aware that there is a WordPress Docker image out there, and there are some nice doc pages from web hosts showing you how to use Docker Compose to orchestrate the Docker and MySQL containers.

But man that seems like a lot of effort!

I mean, the problem is that this is intentionally designed for PRD deployments, so everything is hidden behind a Nginx web server, and expects all communications to be encrypted behind a TLS/SSL cert.

I’m glad that it exists, but it’s massive overkill for what I want to do. Basically just restore a backup and click export.

Also, I’ve tried this before in the past, where I wanted to play around with some themes and UI elements but didn’t want to crash the webhost VM.

I had previously given up on the Docker option after an afternoon of debugging.

Option 3 - host my own database

The final option would be to create my own database server locally, run the .sql file, and then export what I want from the database as I want.

A couple of SQL queries, and we’re good. Right?

Wrong!

I’ll detail my journey of exporting the database in a separate post.

Older post

2023-12-10

Newer post

2023-12-24

What distinguishes you from other developers?

I've built data pipelines across 3 continents at petabyte scales, for over 15 years. But the data doesn't matter if we don't solve the human problems first - an AI solution that nobody uses is worthless.

Are the robots going to kill us all?

Not any time soon. At least not in the way that you've got imagined thanks to the Terminator movies. Sure somebody with a DARPA grant is always going to strap a knife/gun/flamethrower on the side of a robot - but just like in Dr.Who - right now, that robot will struggle to even get out of the room, let alone up some stairs.

But AI is going to steal my job, right?

A year ago, the whole world was convinced that AI was going to steal their job. Now, the reality is that most people are thinking 'I wish this POC at work would go a bit faster to scan these PDFs'.

When am I going to get my self-driving car?

Humans are complicated. If we invented driving today - there's NO WAY IN HELL we'd let humans do it. They get distracted. They text their friends. They drink. They make mistakes. But the reality is, all of our streets, cities (and even legal systems) have been built around these limitations. It would be surprisingly easy to build self-driving cars if there were no humans on the road. But today no one wants to take liability. If a self-driving company kills someone, who's responsible? The manufacturer? The insurance company? The software developer?