Posted by jpluimers on 2019/03/20
It looks like AVX can be a curse most of the times. Below are some (many) links that lead me to this conclusion, based on a thread started by Kelly Sommers.
My conclusion
Running AVX instructions will affect the processor frequency, which means that non-AVX code will slow down, so you will only benefit when the gain of using AVX code outweighs the non-AVX loss on anything running on that processor in the same time frame.
In practice, this means you need to long term gain from AVX on many cores. If you don’t, then the performance penalty on all cores, including the initial AVX performance, will degrade, often a lot (dozens of %).
Tweets and pages linked by them
- [WayBack] Kelly Sommers on Twitter: “So here’s a real question. What does Amazon and Microsoft and other kubernetes cloud services do to prevent your containers from losing 11ghz of performance because someone deployed some AVX optimized algorithm on the same host?”
- [WayBack] Jeroen Pluimers on Twitter: “Where do I learn more on side effects of AVX?… “
- [WayBack] Kelly Sommers on Twitter: “Not the greatest link but the quickest one I found lol https://t.co/NUvEl1CEp5… “
- [WayBack] 🤓science_dot on Twitter: “https://t.co/YAZWbuo9Mn contains the frequency tables for Skylake Xeon for non-AVX, AVX2 and AVX-512. There are nuances that don’t fit into a tweet… (#ImIntel)… https://t.co/vSRFSd9GAb”
- [WayBack] Phil Dennis-Jordan on Twitter: “Basically, PC games don’t use AVX for this exact reason. AVX is great if >90% of your CPU time is spent in AVX code across all cores, for long-lasting workloads. Which pretty much means it gets used for HPC and maybe CPU-intensive content creation and that’s about it.… https://t.co/HMfmhPeSHH”
- [WayBack] svacko on Twitter: “also check wikichip that has perfect resources on this AVX/AVX2/AVX512 downscaling https://t.co/1nn3Wn1dO1 i was also pretty shocked when i found this as we are massively using AVX2 in our shop.. i think there are no GHz guarantees, the clouders guarantees you only vCPUs…… https://t.co/UwDTsmWT3W”
- [WayBack] Ben Adams on Twitter: “Cloudflare did write up about AVX2: On the dangers of Intel’s frequency scaling https://t.co/JYbhjTwiD0… “
- [WayBack] On the dangers of Intel’s frequency scaling: While I was writing the post comparing the new Qualcomm server chip, Centriq, to our current stock of Intel Skylake-based Xeons, I noticed a disturbing phenomena. When benchmarking OpenSSL 1.1.1dev, I discovered that the performance of the cipher ChaCha20-Poly1305 does not scale very well.

- [WayBack] Vlad Krasnov on Twitter: “https://t.co/gtcQHjJFLQ When compiling with -mavx512dq it runs 25% on a Gold with 40 threads On Silver with 24 threads: ~30% slower… https://t.co/UmwyYY0ISR”
- [WayBack] Ryan Zezeski on Twitter: “I haven’t done much reading (or any testing) about this myself, but I did find this summary interesting: https://t.co/Yjsh8kB789… https://t.co/MZqBXlkiHz”
- [WayBack] avx_sigh.md · GitHub
why doesn’t radfft support AVX on PC?
- The short version is that unless you’re planning to run AVX-intensive code on that core for at least the next 10ms or so, you are completely shooting yourself in the foot by using AVX float.A complex RADFFT at N=2048 (relevant size for Bink Audio, Miles sometimes uses larger FFTs) takes about 19k cycles when computed 128b-wide without FMAs. That means that the actual FFT runs and completes long before we ever get the higher power license grant, and then when we do get the higher power license, all we’ve done is docked the core frequency by about 15% (25%+ when using AVX-512) for the next couple milliseconds, when somebody else’s code runs.That’s a Really Bad Thing for middleware to be doing, so we don’t.
- These older CPUs are somewhat faster to grant the higher power license level (but still on the order of 150k cycles), but if there is even one core using AVX code, all cores (well, everything on the same package if you’re in a multi-socket CPU system) get limited to the max AVX frequency.And they don’t seem to have the “light” vs. “heavy” distinction either. Use anything 256b wide, even a single instruction, and you’re docking the max turbo for all cores for the next couple milliseconds.That’s an even worse thing for middleware to be doing, so again, we try not to.
- [WayBack] Ben Higgins on Twitter: “I’d have to hunt down the specifics but we saw a perf regression we attributed to AVX512 additions added to glibc. Disabling the AVX512 version improved perf for our app.… https://t.co/IZIzuBO2F8”
- [WayBack] Bartek Ogryczak on Twitter: “TIL, thanks for mentioning this. Intel is saying “workaround: none, fix: none” 😱 https://t.co/6BVJqV8rff… https://t.co/STEbs6M2tK”
- [WayBack] Ben Higgins on Twitter: “My colleagues tracked it down to an avx512-specific variant of memmove in glibc, despite not calling it very often we saw a big speedup on skylake when we stopped using the avx512 version.… https://t.co/vz2u8dCKbX”
- [WayBack] Trent Lloyd 🦆on Twitter: “Though not quite the same issue, this bug has some really interesting insight into reasons you can kill performance using various variants of SSE/AVX… https://t.co/pNghRVJZO0 – was fascinating to me.… https://t.co/b3jb9WFWne”
Kelly raised a bunch of interesting questions and remarks because of the above:
I collected the above links because of [WayBack] GitHub – maximmasiutin/FastMM4-AVX: FastMM4 fork with AVX support and multi-threaded enhancements (faster locking), where it is unclear which parts of the gains are because of AVX and which parts are because of other optimizations. It looks like that under heavy loads on data center like conditions, the total gain is about 30%. The loss for traditional processing there has not been measured, but from the above my estimate it is at least 20%.
Full tweets below.
Read the rest of this entry »
Posted in Assembly Language, Development, Software Development, x64, x86 | Leave a Comment »
Posted by jpluimers on 2018/09/11
On x86/x64/ARM/…:
It’s where the function is going to return to, not where it came from.
And:
Bonus chatter: This reminds me of a quirk of the 6502 processor: When it pushed the return address onto the stack, it actually pushed the return address minus one. This is an artifact of the way the 6502 is implemented, but it results in the nice feature that the stack trace gives you the line number of the call instruction.
Of course, this is all hypothetical, because 6502 debuggers didn’t have fancy features like stack traces or line numbers.
Source: [WayBack] Remember that in a stack trace, the addresses are return addresses, not call addresses – The Old New Thing
Which resulted in these comments at [WayBack] CC +mos6502 – Jeroen Wiert Pluimers – Google+:
- mos6502: And don’t forget the crucial difference in PC on 6502 between RTS and RTI!
- Jeroen Wiert Pluimers: +mos6502 I totally forgot about that one. Thanks for reminding me
<<Note that unlike RTS, the return address on the stack is the actual address rather than the address-1.>>
References:
[WayBack] 6502.org: Tutorials and Aids – RTI
RTI retrieves the Processor Status Word (flags) and the Program Counter from the stack in that order (interrupts push the PC first and then the PSW).
Note that unlike RTS, the return address on the stack is the actual address rather than the address-1.
[WayBack] 6502.org: Tutorials and Aids – RTS
RTS pulls the top two bytes off the stack (low byte first) and transfers program control to that address+1. It is used, as expected, to exit a subroutine invoked via JSR which pushed the address-1.
RTS is frequently used to implement a jump table where addresses-1 are pushed onto the stack and accessed via RTS eg. to access the second of four routines.
–jeroen
Posted in 6502, 6502 Assembly, Assembly Language, Development, History, Software Development, The Old New Thing, Windows Development, x64, x86 | Leave a Comment »
Posted by jpluimers on 2018/08/17
It seems there are a few, but only loading the binary is the sure method to know what the process will be using: [WayBack] How to check if a binary is 32 or 64 bit on Windows? – Super User and [WayBack] How do I determine if a .NET application is 32 or 64 bit? – Stack Overflow.
Details in the answers of these questions, here are a few highlights:
- The first few characters in the binary header reveal what it was originally designed for.
- A .NET executable might still have an x64 header for bootstrapping.
- The Windows SDK has a tool
dumpbin.exe with the /headers option.
- You can use
sigcheck.exe from SysInternals.
- The
file utility (e.g. from cygwin, which comes with msysgit) will distinguish between 32- and 64-bit executables.
- Use the command line
7z.exe on the PE file (Exe or DLL) in question which gives you a CPU line.
- Virustotal
File detail is a way to find out if a binary is 32 bit or 64 bit.
- Even an executable marked as 32-bit can run as 64-bit if, for example, it’s a .NET executable that can run as 32- or 64-bit. For more information see https://stackoverflow.com/questions/3782191/how-do-i-determine-if-a-net-application-is-32-or-64-bit, which has an answer that says that the
CORFLAGS utility can be used to determine how a .NET application will run.
–jeroen
Search terms: win64, win32, x64, x86_64, x86
Posted in Assembly Language, Development, Power User, Windows, x64, x86 | Leave a Comment »
Posted by jpluimers on 2018/02/07
Via [WayBack] Hi all,I’m a bit stuck here with a “simple” task.Looks like Outlook 2016 doesn’t supports “MAPISendMail”, at least, if i trigger this, Thunderbird… – Attila Kovacs – Google+:
Basically only MAPISendMail works cross architecture and only if you fill all fields.
This edited [WayBack] email – MAPI Windows 7 64 bit – Stack Overflow answer by [WayBack] epotter is very insightful (thanks [WayBack] Rik van Kekem – Google+):
Calls to MAPISendMail should work without a problem.
For all other MAPI method and function calls to work in a MAPI application, the bitness (32 or 64) of the MAPI application must be the same as the bitness of the MAPI subsystem on the computer that the application is targeted to run on.
In general, a 32-bit MAPI application must not run on a 64-bit platform (64-bit Outlook on 64-bit Windows) without first being rebuilt as a 64-bit application.
For a more detailed explanation, see the MSDN page on Building MAPI Applications on 32-Bit and 64-Bit Platforms
–jeroen
Posted in Delphi, Delphi x64, Development, Office, Outlook, Power User, Software Development, x86 | Leave a Comment »
Posted by jpluimers on 2017/11/02
Quoted in full because even 2.5 years later, it’s just too funny:
- Python: What if everything was a dict?
- Java: What if everything was an object?
- JavaScript: What if everything was a dict *and* an object?
- C: What if everything was a pointer?
- APL: What if everything was an array?
- Tcl: What if everything was a string?
- Prolog: What if everything was a term?
- LISP: What if everything was a pair?
- Scheme: What if everything was a function?
- Haskell: What if everything was a monad?
- Assembly: What if everything was a register?
- Coq: What if everything was a type/proposition?
- COBOL: WHAT IF EVERYTHING WAS UPPERCASE?
- C#: What if everything was like Java, but different?
- Ruby: What if everything was monkey patched?
- Pascal: BEGIN What if everything was structured? END
- C++: What if we added everything to the language?
- C++11: What if we forgot to stop adding stuff?
- Rust: What if garbage collection didn’t exist?
- Go: What if we tried designing C a second time?
- Perl: What if shell, sed, and awk were one language?
- Perl6: What if we took the joke too far?
- PHP: What if we wanted to make SQL injection easier?
- VB: What if we wanted to allow anyone to program?
- VB.NET: What if we wanted to stop them again?
- Forth: What if everything was a stack?
- ColorForth: What if the stack was green?
- PostScript: What if everything was printed at 600dpi?
- XSLT: What if everything was an XML element?
- Make: What if everything was a dependency?
- m4: What if everything was incomprehensibly quoted?
- Scala: What if Haskell ran on the JVM?
- Clojure: What if LISP ran on the JVM?
- Lua: What if game developers got tired of C++?
- Mathematica: What if Stephen Wolfram invented everything?
- Malbolge: What if there is no god?
–jeroen
Read the rest of this entry »
Posted in .NET, APL, Assembly Language, BASIC, C, C#, C++, COBOL, Development, EPS/PostScript, Fun, Go (golang), Java, Java Platform, JavaScript/ECMAScript, LISP, Makefile, Pascal, Perl, PHP, Python, Quotes, Ruby, Rust, Scala, Scripting, Smalltalk, Software Development, T-Shirt quotes, TCL, Turbo Prolog, VB.NET, Visual BASIC, XML/XSD, XSLT | Leave a Comment »