Happy “check your backups day”; does your restore process work? And how is the rest of your admin process doing?
Posted by jpluimers on 2018/02/01
Today is [WayBack] Check your backups Day! started by @CyberShambles in dedication of the @Gitlab outage on 20170201.
Please check your restoration process now. As people screw up and accidents happen (I know first hand from a client).
Why isn’t this date on January 31st? Long short story: the failure started that date, but restoration took most of 20170201. So February 1st it is.
Others will follow and GitLab wasn’t alone, as a few days before soup.io had to restore a 2015 database backup.
It all comes back to
Nobody wants backup.
Everybody wants restore.
which made it to the 2008 [WayBack] adminzen.org – The Admin Zen and has been attributed to various people including [WayBack] to Kristian Köhntopp and [WayBack] to Martin Seeger who told Kristian Köhntopp that it was coined by Sun’s Michael Nagorsnik at one of the early [WayBack] NuBIT. Martin was there; he knows (:
The oldest mention of the phrase I could find was in 2006 by Volker Bir at [WayBack] Spy Sheriff – so how do people get infected w/ this thing?.
Keeping clients in the loop
Since soup.io hosts their updates blog on their own platform, the restore resulted in the post prior to [Archive.is] Update after crash ;) – Soup Updates sort of ironically being the mid-2015 [WayBack] Give us your money! – Soup Updates. Usually dogfooding is a good thing though.
During such a downtime, it is crucial to stay in touch through alternative channels. Soup.io didn’t do a good job on their twitter account: they only announced the “update after crash”, not being down, why or progress.
They also deny the WayBack machine access to updates.soup.io because of [WayBack] robots.txt because how they redirect through /remotes, but luckily Archive.is doesn’t care about that and has less old updates.soup.io archived as recent as end of 2015.
GitLab did a much better job on their GitLabStatus account.
Postmortems and organisation culture.
Everybody can screw up, and usually a severe outage happens even when everybody tries to do the right thing. The only way to learn from it is to have [WayBack] Blameless PostMortems and a Just Culture – Code as Craft.
These postmortems are invaluable as you will fail, despite using everything from [WayBack] AdminZen:
The Admin Zen
Keep it up and running.
It isn’t without reason that you find a lot of WayBack or Archive.is links on my blog: hopefully they will cover as backup when any of the original links disappear.
It can happen to anyone
I had a similar issue at a client where I was hired to fix bugs in their server side software. When accepting, I asked them if they had a good backup-restore procedure which they confirmed and outsourced.
Though I was not supposed to perform any infrastructure tasks, when investigating production performance, I had made manual backups of the production database and some of the most important files. Days after that, the RAID array of the main production server collapsed and they found out the outsourced backups restore only produced one from 3 months old.
Manually stitching everything together took more than a week. About a week later, they had hourly backups on different hardware and daily transfers to an external offline location.
The GitLab story
Back to the GitLab story: some great write-ups on the story are in the links below. Recommended reading.
- [WayBack] Nobody wants backup. Everybody wants restore. – The Isoblog.
- [WayBack] Write post-mortem (#1108) · Issues · GitLab.com / www-gitlab-com · GitLab
- [Archive.is] GitLab.com Database Incident – 2017/01/31
- [WayBack] GitLab.com Database Incident | GitLab
- [Archive.is] More information about today’s outage can be found at …
- [Archive.is] We accidentally deleted production data and might have to restore from backup. Google Doc with live notes …
Oh: and the reference pointing me to soup.io: [WayBack] Post like it is 2015 – The Isoblog.
One of the issues they want to solve is to make clear on which system they are working.
Part of that could be using liquidprompt using this ~/.config/liquidpromptrc (or ~/.liquidpromptrc) setting:
That will always show the hostname and have different prompt colours on each host.
Other references
- [WayBack] MySQL Backup and Recovery from Kristian Köhntopp
- [WayBack] mysql-backup-and-recovery-15-638.jpg (638×359) – Nobody wants backup. Everybody wants restore. Martin Seegler.
- [WayBack] Time Machine desiderata – No Such Weblog
–jeroen
Leave a Reply