The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

  • My badges

  • Twitter Updates

  • My Flickr Stream

  • Pages

  • All categories

  • Enter your email address to subscribe to this blog and receive notifications of new posts by email.

    Join 4,225 other subscribers

The CPU load average metric often is not a good one to alert on

Posted by jpluimers on 2023/04/20

Boy I wish threads with more than one person could be saved by the ThreadReaderApp.

Anyway:

[WayBack] Thread by @mipsytipsy: oh boy.. i was just idly musing over how the single most ubiquitous/useless metric is “CPU load average”, lol i wonder if you could use CPU…

oh boy.. i was just idly musing over how the single most ubiquitous/useless metric is “CPU load average”, lol

i wonder if you could use CPU load alerts to score how modern and powerful a team’s toolchain is, like a Waffle House Index for tooling. 🤔

 

…oh oh! but i was gonna say, this thread between @drk and @shelbyspees is a killer nanotutorial in how to ask better questions about your code — where to start, how to drill down and dig in, how to instrument, and how to approach such an open-ended exploratory jaunt. 👏🐝❤️

it’s a really good illustration of this thing we end up saying all the time, which is “don’t fear the future, it is simpler and clearer and *easier* here! the way you are doing it NOW is the hard way!” 😖

time for cpu load average to go the way of the PC LOAD LETTER …

0:00
/ 0:01

 

 

[WayBack] Thread by @isotopp: This. CPU load or %CPU used is a useful metric, because it tells you how “full” the compute part of a thing is. Alerting on it is a very wei…

 Bookmark  Save as PDF  My Authors

This.

CPU load or %CPU used is a useful metric, because it tells you how “full” the compute part of a thing is.

Alerting on it is a very weird idea, though, and still I find people doing this all of the time. Usually these are people in dire need of a better education.

 

If you right size your infrastructure (there are a lot of preconditions that go into even being able to do that), your goal is to have as little overhead in provisioned resources as possible, provision as little as needed.

Of course, then an alert will go off all the time.

That is, because not having a CPU load alert triggering means you have provisioned idle capacity.

Why would you do that?

Well, your load may be spiky, so you can’t provision for a median or low 90ies percentile, but you have to provision for max or 99.9 in order to be able to ride the waves.

But that also means that alerting on CPU load is too late already. If you could alert, you could also size.

Or your code may be unpredictable, because you change it a lot with experiments going in and out of the codebase or from 1%-on to full-on.

When experimenting, it is important to expose code to users really quickly. That code being efficient is not a priority, because …

… most of it will be scapped as net-negative anyway. Not worth putting engineering into code before you have the business side of things right.

Being able to run experiments means you need to overprovision capacity.

But of course that means you need to alert on change in variants, and compare code path variants not only wrt to business metrics change, but also to technical metrics change in order to offset business wins (“This code make us x Euro/h richer than the base variant”)…
… with alternative costs (“Running this will cost x Euro/h more due to more load” vs. “Refactoring this for efficiency will cost z Euro in engineering time over a potential lifecycle of n hours, so x Euro/h”)
But in order to pull this off you need to have accurate CAPACITY metrics, which CPU load is not. Testing in production, specifically automated load tests with actual production users, provide these numbers.

They are also a way to go from reactive scaling (“CPU load too high, raise capacity”) to predictive scaling (“Our capacity is x req/m per box, and we have n. Evening peak will be m, so we need y boxen more.”)

Which brings us back to the spiky loads from where we started.

CPU load alerts: en.wikipedia.org/wiki/Waffle_Ho… of tooling, indeed.
Thank you, @mipsytipsy, totally going to use this.

The thread

 

–jeroen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

 
%d bloggers like this: