The CPU load average metric often is not a good one to alert on

All categories

April 2023
M	T	W	T	F	S	S
	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

The CPU load average metric often is not a good one to alert on

Posted by jpluimers on 2023/04/20

Boy I wish threads with more than one person could be saved by the ThreadReaderApp.

Anyway:

[WayBack] Thread by @mipsytipsy: oh boy.. i was just idly musing over how the single most ubiquitous/useless metric is “CPU load average”, lol i wonder if you could use CPU…

Charity Majors

Follow @mipsytipsy

Jun 4th 2020

oh boy.. i was just idly musing over how the single most ubiquitous/useless metric is “CPU load average”, lol

i wonder if you could use CPU load alerts to score how modern and powerful a team’s toolchain is, like a Waffle House Index for tooling. 🤔

https://twitter.com/drk/status/1261412705316925440

…oh oh! but i was gonna say, this thread between @drk and @shelbyspees is a killer nanotutorial in how to ask better questions about your code — where to start, how to drill down and dig in, how to instrument, and how to approach such an open-ended exploratory jaunt. 👏🐝❤️

it’s a really good illustration of this thing we end up saying all the time, which is “don’t fear the future, it is simpler and clearer and *easier* here! the way you are doing it NOW is the hard way!” 😖

time for cpu load average to go the way of the PC LOAD LETTER …

0:00

/ 0:01

[WayBack] Thread by @isotopp: This. CPU load or %CPU used is a useful metric, because it tells you how “full” the compute part of a thing is. Alerting on it is a very wei…

Kristian Köhntopp

Follow @isotopp

3 hours ago, 12 tweets, 3 min read

Bookmark Save as PDF My Authors

This.

CPU load or %CPU used is a useful metric, because it tells you how “full” the compute part of a thing is.

Alerting on it is a very weird idea, though, and still I find people doing this all of the time. Usually these are people in dire need of a better education.

https://twitter.com/mipsytipsy/status/1268418428542443520

If you right size your infrastructure (there are a lot of preconditions that go into even being able to do that), your goal is to have as little overhead in provisioned resources as possible, provision as little as needed.

Of course, then an alert will go off all the time.

That is, because not having a CPU load alert triggering means you have provisioned idle capacity.

Why would you do that?

Well, your load may be spiky, so you can’t provision for a median or low 90ies percentile, but you have to provision for max or 99.9 in order to be able to ride the waves.

But that also means that alerting on CPU load is too late already. If you could alert, you could also size.

Or your code may be unpredictable, because you change it a lot with experiments going in and out of the codebase or from 1%-on to full-on.

When experimenting, it is important to expose code to users really quickly. That code being efficient is not a priority, because …

… most of it will be scapped as net-negative anyway. Not worth putting engineering into code before you have the business side of things right.

Being able to run experiments means you need to overprovision capacity.

But of course that means you need to alert on change in variants, and compare code path variants not only wrt to business metrics change, but also to technical metrics change in order to offset business wins (“This code make us x Euro/h richer than the base variant”)…

… with alternative costs (“Running this will cost x Euro/h more due to more load” vs. “Refactoring this for efficiency will cost z Euro in engineering time over a potential lifecycle of n hours, so x Euro/h”)

But in order to pull this off you need to have accurate CAPACITY metrics, which CPU load is not. Testing in production, specifically automated load tests with actual production users, provide these numbers.

They are also a way to go from reactive scaling (“CPU load too high, raise capacity”) to predictive scaling (“Our capacity is x req/m per box, and we have n. Evening peak will be m, so we need y boxen more.”)

Which brings us back to the spiky loads from where we started.

CPU load alerts: en.wikipedia.org/wiki/Waffle_Ho… of tooling, indeed.

Thank you, @mipsytipsy, totally going to use this.

The thread

–jeroen

This entry was posted on 2023/04/20 at 12:00 and is filed under *nix, Cloud, Development, DevOps, Infrastructure, Power User, Software Development, Systems Architecture. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

	Jeroen Wiert Pluimer… on Arjen Lentz Crystal Ball Vulne…
	Jeroen Wiert Pluimer… on Digitale toegankelijkheid als…
	Jeroen Wiert Pluimer… on Digitale toegankelijkheid als…
	Vereniging NLUUG on Digitale toegankelijkheid als…
	jpluimers on Sony STR-DE205 Receiver…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

The CPU load average metric often is not a good one to alert on

The thread

Leave a comment Cancel reply

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

The CPU load average metric often is not a good one to alert on

The thread

Rate this:

Share this:

Related

Leave a comment Cancel reply