March 2026
M	T	W	T	F	S	S
	1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Archive for the ‘x64’ Category

Online x86 and x64 Intel Instruction Assembler

Posted by jpluimers on 2025/11/25

[Wayback/Archive] Online x86 and x64 Intel Instruction Assembler

The source starts in these two files:

Posted in Assembly Language, Conference Topics, Conferences, Development, Event, Software Development, x64, x86 | Leave a Comment »

For my link archive: Counting the leading zeroes and ones in a binary number with C#

Posted by jpluimers on 2025/03/13

From a while back, but still interesting:

Especially the first link explains the algorithm very well and is similar to links referred to from the Stack Overflow question as it is based on counting ones (and leading ones are basically leading zeros but bit-inverted).

It also explains a cool thing for leading zeros: modern CPU have instructions which .NET Core.

Read the rest of this entry »

Posted in .NET, AArch64/arm64, Algorithms, ARM, Assembly Language, C, C#, C++, Delphi, Development, Software Development, x64, x86 | Tagged: csharp, dotnet, dotnetcore | Leave a Comment »

Raymond Chen on The AArch64 processor (aka arm64) in many parts

Posted by jpluimers on 2025/01/14

For my link archive: below a series of articles my Raymond Chen on “The AArch64 processor (aka arm64)” in the order of appearance from a few years back and still very relevant today.

It is part of a few more series on processors that (were) supported by Windows. A good reference to find which version supported which processor architecture is the tables in List of Microsoft Windows versions – Wikipedia.

Read the rest of this entry »

Posted in AArch64/arm64, ARM, Assembly Language, Development, History, MIPS R4000, PowerPC, Software Development, The Old New Thing, Windows Development, x64, x86 | Leave a Comment »

GitHub – chip-red-pill/MicrocodeDecryptor

Posted by jpluimers on 2024/09/18

A few years back the way Intel Microcode updates were distributed deciphered so it became possible to extract and research the microcode of some processor models.

Repository: [Wayback/Archive] chip-red-pill/MicrocodeDecryptor

Read the rest of this entry »

Posted in Assembly Language, Development, Software Development, x64, x86 | Leave a Comment »

x86_opcode_structure_and_instruction_overview.pdf on -= pnx.tf =-

Posted by jpluimers on 2024/06/18

It is more than a decade old but still the best reference around [Wayback/Archive] -= pnx.tf =- has [Wayback] x86_opcode_structure_and_instruction_overview.pdf

I found it via [Wayback/Archive] Alice Climent-Pommeret on Twitter: “I’ve just discovered this amazing document showing super clearly the relation between the opcode and the instruction 🤯 …”

Read the rest of this entry »

Posted in Assembly Language, Development, Software Development, x64, x86 | Leave a Comment »

Reminder to self: check to see if Delphi improved support for MMX/SSE/AVX instructions

Posted by jpluimers on 2024/06/13

This is from a long time ago [Wayback/Archive] Does Delphi support all MMX/SSE instructions? – Stack Overflow:

Delphi 2007 supports the MMX and SSE instruction sets. Certainly, Delphi 2010 and XE support up to the SSE4.2 instruction sets (but so far no support for AVX).

The [Wayback] Delphi 2005 Language Guide explained a bit, but no more recent PDF is available and the [Wayback/Archive] Embarcadero/IDERA Documentation Wiki is very much outdated on this information as per [Wayback/Archive] Talk:Assembler Syntax – RAD Studio:

Re: “Instruction Opcodes” The information on available instruction sets is outdated. D2010 and Fulcrum support the SIMD instruction sets all the way up to SSE4.2 (i.e., SSE, SSE2, SSE3, SSSE3, SSE4.1 and SSE4.2).

–jeroen

Posted in Assembly Language, Delphi, Development, Software Development, x64, x86 | Leave a Comment »

Very useful link: Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

Posted by jpluimers on 2023/02/14

If I ever need to go deep into optimisation again, there is lots I can still learn from [Wayback/Archive] Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

Thanks [Archive] Kris on Twitter: “@Kharkerlake @unixtippse Agner Fog ist eigentlich ein Anthropologe, aber er reversed interne Strukturen von Intel CPUs, und …, speziell 3. The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers ist die HPC Bibel.” / Twitter!

Must watch video with Agner about Warlike and Peaceful Societies below the signature.

–jeroen

Read the rest of this entry »

Posted in Assembly Language, C++, Development, Software Development, x64, x86 | Leave a Comment »

When floating point code suddenly becomes orders magnitudes slower (via C++ – Why does changing 0.1f to 0 slow down performance by 10x? – Stack Overflow)

Posted by jpluimers on 2022/01/26

When working with converging algorithms, sometimes floating code can become very slow. That is: orders of magnitude slower than you would expect.

A very interesting answer to [Wayback] c++ – Why does changing 0.1f to 0 slow down performance by 10x? – Stack Overflow.

I’ve only quoted a few bits, read the full question and answer for more background information.

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can’t handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Basically, the convergence uses some values closer to zero than a normal floating point representation dan store, so a trick is used called “denormal numbers or denormalized numbers (now often called subnormal numbers)” as described in Denormal number – Wikipedia:

…

In a normal floating-point value, there are no leading zeros in the significand; rather, leading zeros are removed by adjusting the exponent (for example, the number 0.0123 would be written as 1.23 × 10⁻²). Denormal numbers are numbers where this representation would result in an exponent that is below the smallest representable exponent (the exponent usually having a limited range). Such numbers are represented using leading zeros in the significand.

…

Since a denormal number is a boundary case, many processors do not optimise for this.

–jeroen

Posted in .NET, Algorithms, ARM, Assembly Language, C, C#, C++, Delphi, Development, Software Development, x64, x86 | Leave a Comment »

Some notes on loosing performance because of using AVX

Posted by jpluimers on 2019/03/20

It looks like AVX can be a curse most of the times. Below are some (many) links that lead me to this conclusion, based on a thread started by Kelly Sommers.

My conclusion

Running AVX instructions will affect the processor frequency, which means that non-AVX code will slow down, so you will only benefit when the gain of using AVX code outweighs the non-AVX loss on anything running on that processor in the same time frame.

In practice, this means you need to long term gain from AVX on many cores. If you don’t, then the performance penalty on all cores, including the initial AVX performance, will degrade, often a lot (dozens of %).

Tweets and pages linked by them

[WayBack] Kelly Sommers on Twitter: “So here’s a real question. What does Amazon and Microsoft and other kubernetes cloud services do to prevent your containers from losing 11ghz of performance because someone deployed some AVX optimized algorithm on the same host?”
[WayBack] Jeroen Pluimers on Twitter: “Where do I learn more on side effects of AVX?… “
[WayBack] Kelly Sommers on Twitter: “Not the greatest link but the quickest one I found lol https://t.co/NUvEl1CEp5… “
- [WayBack] E-class CPUs down clock when AVX is in the execution stack? Is this true, if so why would it?
  - The Core i7 processors that are referred to as “Haswell-E” and “Broadwell-E” are minor variants of the Xeon E5 v3 “Haswell-EP” and Xeon E5 v4 “Broadwell-EP” processors. These have lower “maximum Turbo” frequencies for each core count when 256-bit registers are being used.
  - Certain AVX workloads may run at lower peak turbo frequencies, or drop below the Non-AVX Base Frequency of the SKU. This type of behavior is due to power, thermal, and electrical constraints.
  - [WayBack] PDF: Optimizing Performance with Intel® Advanced Vector Extensions
[WayBack] 🤓science_dot on Twitter: “https://t.co/YAZWbuo9Mn contains the frequency tables for Skylake Xeon for non-AVX, AVX2 and AVX-512. There are nuances that don’t fit into a tweet… (#ImIntel)… https://t.co/vSRFSd9GAb”
- [WayBack] PDF: Intel® Xeon® Processor Scalable Family; Specification Update; February 2018
[WayBack] Phil Dennis-Jordan on Twitter: “Basically, PC games don’t use AVX for this exact reason. AVX is great if >90% of your CPU time is spent in AVX code across all cores, for long-lasting workloads. Which pretty much means it gets used for HPC and maybe CPU-intensive content creation and that’s about it.… https://t.co/HMfmhPeSHH”
- [WayBack] Phil Dennis-Jordan on Twitter: “As soon as you use an AVX instruction on recent Intel CPUs, you trip its protective AVX clock circuitry. This means that for the next N milliseconds, it’s limited to whatever the rated maximum AVX clock rate is.… https://t.co/2poSaQDmXF”
- [WayBack] Phil Dennis-Jordan on Twitter: “It doesn’t matter if you’re pegging the core for long running calculations or just briefly switched to AVX because it has a convenient instruction. For this defined amount of time, anything running will be subject to the AVX clock limit.… https://t.co/ZxgXAzbu3R”
[WayBack] svacko on Twitter: “also check wikichip that has perfect resources on this AVX/AVX2/AVX512 downscaling https://t.co/1nn3Wn1dO1 i was also pretty shocked when i found this as we are massively using AVX2 in our shop.. i think there are no GHz guarantees, the clouders guarantees you only vCPUs…… https://t.co/UwDTsmWT3W”
- [WayBack] Frequency Behavior – Intel – WikiChip: The Frequency Behavior of Intel’s CPUs is complex and is governed by multiple mechanisms that perform dynamic frequency scaling based on the available headroom.
[WayBack] Ben Adams on Twitter: “Cloudflare did write up about AVX2: On the dangers of Intel’s frequency scaling https://t.co/JYbhjTwiD0… “
- [WayBack] On the dangers of Intel’s frequency scaling: While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena. When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well.
[WayBack] Vlad Krasnov on Twitter: “https://t.co/gtcQHjJFLQ When compiling with -mavx512dq it runs 25% on a Gold with 40 threads On Silver with 24 threads: ~30% slower… https://t.co/UmwyYY0ISR”
[WayBack] Ryan Zezeski on Twitter: “I haven’t done much reading (or any testing) about this myself, but I did find this summary interesting: https://t.co/Yjsh8kB789… https://t.co/MZqBXlkiHz”
- [WayBack] avx_sigh.md · GitHub
  
  why doesn’t radfft support AVX on PC?
  - The short version is that unless you’re planning to run AVX-intensive code on that core for at least the next 10ms or so, you are completely shooting yourself in the foot by using AVX float.A complex RADFFT at N=2048 (relevant size for Bink Audio, Miles sometimes uses larger FFTs) takes about 19k cycles when computed 128b-wide without FMAs. That means that the actual FFT runs and completes long before we ever get the higher power license grant, and then when we do get the higher power license, all we’ve done is docked the core frequency by about 15% (25%+ when using AVX-512) for the next couple milliseconds, when somebody else’s code runs.That’s a Really Bad Thing for middleware to be doing, so we don’t.
  - These older CPUs are somewhat faster to grant the higher power license level (but still on the order of 150k cycles), but if there is even one core using AVX code, all cores (well, everything on the same package if you’re in a multi-socket CPU system) get limited to the max AVX frequency.And they don’t seem to have the “light” vs. “heavy” distinction either. Use anything 256b wide, even a single instruction, and you’re docking the max turbo for all cores for the next couple milliseconds.That’s an even worse thing for middleware to be doing, so again, we try not to.
[WayBack] Ben Higgins on Twitter: “I’d have to hunt down the specifics but we saw a perf regression we attributed to AVX512 additions added to glibc. Disabling the AVX512 version improved perf for our app.… https://t.co/IZIzuBO2F8”
[WayBack] Bartek Ogryczak on Twitter: “TIL, thanks for mentioning this. Intel is saying “workaround: none, fix: none” 😱 https://t.co/6BVJqV8rff… https://t.co/STEbs6M2tK”
[WayBack] Ben Higgins on Twitter: “My colleagues tracked it down to an avx512-specific variant of memmove in glibc, despite not calling it very often we saw a big speedup on skylake when we stopped using the avx512 version.… https://t.co/vz2u8dCKbX”
[WayBack] Trent Lloyd 🦆on Twitter: “Though not quite the same issue, this bug has some really interesting insight into reasons you can kill performance using various variants of SSE/AVX… https://t.co/pNghRVJZO0 – was fascinating to me.… https://t.co/b3jb9WFWne”

Kelly raised a bunch of interesting questions and remarks because of the above:

I collected the above links because of [WayBack] GitHub – maximmasiutin/FastMM4-AVX: FastMM4 fork with AVX support and multi-threaded enhancements (faster locking), where it is unclear which parts of the gains are because of AVX and which parts are because of other optimizations. It looks like that under heavy loads on data center like conditions, the total gain is about 30%. The loss for traditional processing there has not been measured, but from the above my estimate it is at least 20%.

Full tweets below.

Read the rest of this entry »

Posted in Assembly Language, Development, Software Development, x64, x86 | Leave a Comment »

performance – Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? – Stack Overflow

Posted by jpluimers on 2019/02/28

Geek pr0n at [WayBack] performance – Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? – Stack Overflow

Via: [WayBack] Very nice #Geekpr0n “Why is C++ faster than my hand-written assembly code?” The comments are of high quality i… – Jan Wildeboer – Google+

–jeroen

Posted in Assembly Language, C, C++, Development, Software Development, x64, x86 | Leave a Comment »

« Previous Entries

	Jeroen Wiert Pluimer… on Pie Comic by John McNamee: Mov…
	Attila Kovacs on Crowbarring Windows 95 into Wi…
	Jeroen Wiert Pluimer… on Does Odido (the old T-Mobile N…
	Lars Fosdal on Security alarm provider Woonve…
	Thomas Mueller on Question got closed in May 202…

The Wiert Corner – irregular stream of stuff

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

Twitter Updates

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘x64’ Category

Online x86 and x64 Intel Instruction Assembler

For my link archive: Counting the leading zeroes and ones in a binary number with C#

Raymond Chen on The AArch64 processor (aka arm64) in many parts

GitHub – chip-red-pill/MicrocodeDecryptor

x86_opcode_structure_and_instruction_overview.pdf on -= pnx.tf =-

Reminder to self: check to see if Delphi improved support for MMX/SSE/AVX instructions

Very useful link: Software optimization resources. C++ and assembly. Windows, Linux, BSD, Mac OS X

When floating point code suddenly becomes orders magnitudes slower (via C++ – Why does changing 0.1f to 0 slow down performance by 10x? – Stack Overflow)

Some notes on loosing performance because of using AVX

My conclusion

Tweets and pages linked by them

performance – Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? – Stack Overflow

Jeroen W. Pluimers on .NET, C#, Delphi, databases, and personal interests

Subscribe

Archives

Recent Comments

Recent Posts

Blog Stats

Meta title

Tag Cloud Title

Top Clicks

Top Posts

My badges

My Flickr Stream

Pages

All categories

Email Subscription

Archive for the ‘x64’ Category

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

Rate this:

Share this:

My conclusion

Tweets and pages linked by them

Rate this:

Share this:

Rate this:

Share this: