Happy “check your backups day”; does your restore process work? And how is the rest of your admin process doing?

All categories

February 2018
M	T	W	T	F	S	S
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28

Happy “check your backups day”; does your restore process work? And how is the rest of your admin process doing?

Posted by jpluimers on 2018/02/01

Today is [WayBack] Check your backups Day! started by @CyberShambles in dedication of the @Gitlab outage on 20170201.

Please check your restoration process now. As people screw up and accidents happen (I know first hand from a client).

Why isn’t this date on January 31st? Long short story: the failure started that date, but restoration took most of 20170201. So February 1st it is.

Others will follow and GitLab wasn’t alone, as a few days before soup.io had to restore a 2015 database backup.

It all comes back to

Nobody wants backup.

Everybody wants restore.

which made it to the 2008 [WayBack] adminzen.org – The Admin Zen and has been attributed to various people including [WayBack] to Kristian Köhntopp and [WayBack] to Martin Seeger who told Kristian Köhntopp that it was coined by Sun’s Michael Nagorsnik at one of the early [WayBack] NuBIT. Martin was there; he knows (:

The oldest mention of the phrase I could find was in 2006 by Volker Bir at [WayBack] Spy Sheriff – so how do people get infected w/ this thing?.

Keeping clients in the loop

Since soup.io hosts their updates blog on their own platform, the restore resulted in the post prior to [Archive.is] Update after crash ;) – Soup Updates sort of ironically being the mid-2015 [WayBack] Give us your money! – Soup Updates. Usually dogfooding is a good thing though.

During such a downtime, it is crucial to stay in touch through alternative channels. Soup.io didn’t do a good job on their twitter account: they only announced the “update after crash”, not being down, why or progress.

They also deny the WayBack machine access to updates.soup.io because of [WayBack] robots.txt because how they redirect through /remotes, but luckily Archive.is doesn’t care about that and has less old updates.soup.io archived as recent as end of 2015.

GitLab did a much better job on their GitLabStatus account.

Postmortems and organisation culture.

Everybody can screw up, and usually a severe outage happens even when everybody tries to do the right thing. The only way to learn from it is to have [WayBack] Blameless PostMortems and a Just Culture – Code as Craft.

These postmortems are invaluable as you will fail, despite using everything from [WayBack] AdminZen:

The Admin Zen

Keep it up and running.

[WayBack] Know your tools.

[WayBack] Anticipate.

[WayBack] Expect problems.

[WayBack] Design it.

[WayBack] Scale.

[WayBack] Backup.

[WayBack] Communicate.

[WayBack] Document.

Grab the [WayBack] PDF or [WayBack] PNG printout

It isn’t without reason that you find a lot of WayBack or Archive.is links on my blog: hopefully they will cover as backup when any of the original links disappear.

It can happen to anyone

I had a similar issue at a client where I was hired to fix bugs in their server side software. When accepting, I asked them if they had a good backup-restore procedure which they confirmed and outsourced.

Though I was not supposed to perform any infrastructure tasks, when investigating production performance, I had made manual backups of the production database and some of the most important files. Days after that, the RAID array of the main production server collapsed and they found out the outsourced backups restore only produced one from 3 months old.

Manually stitching everything together took more than a week. About a week later, they had hourly backups on different hardware and daily transfers to an external offline location.

The GitLab story

Back to the GitLab story: some great write-ups on the story are in the links below. Recommended reading.

Oh: and the reference pointing me to soup.io: [WayBack] Post like it is 2015 – The Isoblog.

One of the issues they want to solve is to make clear on which system they are working.

Part of that could be using liquidprompt using this ~/.config/liquidpromptrc (or ~/.liquidpromptrc) setting:

LP_HOSTNAME_ALWAYS=1 LP_ENABLE_SSH_COLORS=1

That will always show the hostname and have different prompt colours on each host.

Other references

[WayBack] MySQL Backup and Recovery from Kristian Köhntopp
- [WayBack] mysql-backup-and-recovery-15-638.jpg (638×359) – Nobody wants backup. Everybody wants restore. Martin Seegler.
[WayBack] Time Machine desiderata – No Such Weblog

–jeroen

More information about today's outage can be found at https://t.co/HxpJAgmYG6

— GitLab.com Status (@gitlabstatus) January 31, 2017

We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8

— GitLab.com Status (@gitlabstatus) February 1, 2017

This entry was posted on 2018/02/01 at 00:00 and is filed under DevOps, Power User. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	A/V Revolution on Link archive: A YouTube video…
	#omdenken on Post by @lookitup.baby (Ian Co…
	xyzzy, Relay Confere… on Sad and Useless about Competit…
	ZaqHydn on MeshCore – Off grid mesh…
	ZaqHydn on MeshCore – Off grid mesh…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Happy “check your backups day”; does your restore process work? And how is the rest of your admin process doing?

The Admin Zen

Keep it up and running.

Leave a comment Cancel reply

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Happy “check your backups day”; does your restore process work? And how is the rest of your admin process doing?

The Admin Zen

Keep it up and running.

Rate this:

Share this:

Related

Leave a comment Cancel reply