Sorting is a good 'random problem' but matrix multiplication is WOW when you get all the low level details right.
L1 blocking, prefetching, column vs row, etc etc. You can probably spend a whole semester just optimizing matrix multiplication and understanding each incremental low level optimization.Reply
It appears that sorting is done on the memory-mapped buffer? That seems like it could have extra costs (lazily paging in from disk) that we wouldn’t see for an in-memory buffer, and which could affect the benchmark.Reply
Depends on the necessity. If you really need more performance, then they absolutely matter.
In general it makes sense to program for the platform/CPU, not the compiler. Respect the CPU first, the compiler second. This turns into second nature just like everything else.
From my experience, 99% of the people out there disagree, doing so for a good reason. They code for the compiler, not the system. When that's good enough, then that it is.Reply
For sorting, not really. The big development in compute power is parallelism. ips4o facto (https://github.com/SaschaWitt/ips4o), if you want to sort large vectors really fast it makes more sense to sort in-place and in parallel. Parallel in-place radix sort is also choice, but way less flexible than comparison sort.Reply
> let us begin with a file of a hundred-million totally random integers [...] This is objectively the worst case, containing the greatest possible entropy; but also the best case, hiding no surprises.
There is no single worst case for sorting, and in fact most benchmarks use a several different inputs, and even the best algorithms win some and lose some. It's important not to overfit to a specific distribution.
For example, fully random integers can benefit MSB-radix-type sorts, because it's enough to look at a log(n) prefix of the integer to have the data almost sorted, so the later passes have very predictable behavior.
Also the chance of duplicates is negligible, and duplicates are problematic for some quicksort implementations.Reply
This article is missing the important reference to the prior work by Edelkamp and Weiss, on branchless quicksort .
I've long since implemented that in pdqsort , along with other techniques described by me in  which is a drop-in replacement for std::sort that's ~twice as fast as std::sort for small data with branchless comparisons. It's also available in Boost.sort .Reply
Author here. I didn't know this was on HN until just now.Reply
The headline is misleading; the article isn't really about whether or not low-level optimisations matter. With that being said, the idea that one of the big speedups that computers have gotten in the past two decades is better caching and branch prediction is interesting, and the idea that we use neural nets for it is known to me but still really cool.Reply
Branches matter a lot in many cases. In some cases compilers are smart enough to optimize them out and in some cases they aren’t.Reply
For those interested in the question, it's worth looking at how zstd was made. I have to archive hundreds of TB of JSON about tweets, so I did a compressor bake-off last year. I was surprised at how much better zstd did. Turns out designing for the low-level details of modern processors can make a big difference: https://engineering.fb.com/2016/08/31/core-data/smaller-and-...Reply
> Do Low-Level Optimizations Matter?
1. They matter a _lot_.
2. (Mentioned but not covered in the article) adaptation to known features of your input distribution also matters a lot.Reply
A couple of dumb questions:
1) In practice, how much of low level optimization is writing code differently, vs compiler tweaking? How much is done by offloading to other (specialized) hardware?
2) How could someone learn to do something like this. I’ve never had any situations where micro optimization is necessary or even useful. Never had to work with extremely large amounts of data, worry about tight timing of concerns, etc. and it seems like outside of an industrial setting it seems like getting access to the types of problems requiring this is the most difficult part. I mean, I suppose simply doing sorts on massive quantities of numbers is one thing, but it feels limited.Reply
The history of the CMOV instructions mentioned in the article is incorrect.
While ARM had such instructions from the beginning and other computers had them even earlier, in the Intel/AMD x86 instruction set the CMOV instructions (including FCMOV for floating-point) have been introduced in the Pentium Pro, in November 1995 and all later Intel models have them. AMD has added the CMOV/FCMOV instructions in Athlon, in June 1999.
The 64-bit extension of x86 has just inherited them from the 32-bit ISA and Itanium had them later than Pentium Pro.
The CMOV instructions were an important new feature of Pentium Pro, besides the out-of-order execution and triple instruction decoding and execution.Reply
The article is indeed interesting underlining several cache miss aspects and performance impacts, but sorting 1e8 ints in 7s with standard lib vs 4s (with complex and subtle insights, and maybe sensitive to cpu arch) might not be worth it if you're dealing with a database, or with not that experienced coworkers, or deaking with a more complex computation. So does 2x matter? The article doesn't say, but it is interesting nonetheless.Reply
>Why doesn’t G++ generate cmov instructions? Older releases did, in fact, often enough to generate bug reports11. It turned out that, any time a subsequent loop iteration depends on the result, and the branch would have been predicted correctly, cmov may be slower than a branch, sometimes much slower. cmov can stall speculation all by itself.
>The latest designs from Intel and AMD are said to avoid the cmov pipeline stall, often, but Gcc has not caught up with them yet.
This is why you don't optimize first - even things that you think you know might not be true in the next HW iteration, compiler version, etc. This pattern shows up all the time. The only sane way to do optimization is to measure, write tests to catch regressions, otherwise always write for readability first.Reply