Hacker News Re-Imagined

55GiB/s FizzBuzz

  • 1260 points
  • 23 days ago

  • @luu
  • Created a post

55GiB/s FizzBuzz


@EastToWest 22 days

Replying to @luu 🎙

My naive implementation (as shown in the linked question) in Rust was mind boggling slow. C could do around 150MiB/s, while Rust could only do 10MiB/s. A simple `objdump -d` shows Rust is generating a lot more code. I'm not sure how much of that is relevant though.

At that point my coffee time ran out. I wish I had more time to figure out why. :-(

Reply


@jart 18 days

Replying to @luu 🎙

What's cool here is if you can make FizzBuzz go 55GB/s then you have the skill needed to make a whole bunch of other stuff go that speed too. My personal favorites are that Intel architecture lets us do population count, polynomial division, and eight tap filters at that speed too, which is wonderful.

Reply


@kens 23 days

Replying to @luu 🎙

You could probably get some blazing performance out of an FPGA. I made an FPGA version of FizzBuzz a few years ago, but it was optimized for pointless VGA animation rather than performance.

https://www.righto.com/2018/04/fizzbuzz-hard-way-generating-...

Reply


@fnord77 23 days

Replying to @luu 🎙

    > // Most FizzBuzz routines produce output with `write` or a similar
    > // system call, but these have the disadvantage that they need to copy
    > // the data being output from userspace into kernelspace. It turns out
    > // that when running full speed (as seen in the third phase), FizzBuzz
    > // actually runs faster than `memcpy` does, so `write` and friends are
    > // unusable when aiming for performance - this program runs five times
    > // faster than an equivalent that uses `write`-like system calls.
Why can't `write` use a reference like vmsplice?

Reply


@nobrains 23 days

Replying to @luu 🎙

Did anyone else get recommended and see the FizzBuzz video on YouTube ( https://youtu.be/QPZ0pIK_wsc ) just before this article or did I just witness an incredible coincidence?

Reply


@mastax 23 days

Replying to @luu 🎙

Only getting 7GiB/s on a Ryzen 7 1800X w/DDR4 3000. Zen 1 executes AVX2 instructions at half speed, but that doesn't account for all of the difference. I guess I need a new computer. To run FizzBuzz.

Reply


@jedberg 23 days

Replying to @luu 🎙

I wonder if it slows down with really big numbers.

Reply


@dmitrygr 23 days

Replying to @luu 🎙

The amount of completely unnecessary effort unleashed on this problem in this post is amazing. I want to meet the author and shake his hand! It has everything from Linux kernel trivia in terms of output speed to intel AVX2 code. So unnecessary and so awesome!

I've done stuff like this before, and I imagine the satisfaction of completing it! A tip of my hat to you, sir!

Reply


@jiggawatts 23 days

Replying to @luu 🎙

“ This is faster than a number of standard functions. In particular, it's faster than memcpy, which presents interesting challenges when it comes to I/O”

Wow.

Reply


@kvathupo 23 days

Replying to @luu 🎙

> 64 bytes of FizzBuzz per 4 clock cycles

That's borderline incredulous, given a single AVX2 instruction can last multiple clock-cycles. The reciprocal throughput also doesn't go below ~0.3 to my, admittedly shallow, knowledge. A remarkable piece of engineering!

Reply


@dw-im-here 22 days

Replying to @luu 🎙

Wow the competition for a slot in the corporation sure is fierce these days

Reply


@ot 23 days

Replying to @luu 🎙

To me it's almost as impressive that pv does not become the bottleneck.

Reply


@HALtheWise 23 days

Replying to @luu 🎙

Now I'm kinda curious to see how much faster you could go on an M1 Max with the GPU generating the data. Once his solution gets to the point of being a bytecode interpreter, it's trivially paralellizable and the M1 has _fantastic_ memory bandwidth. Does anyone know if the implementation of pv or /dev/null actually requires loading the data into CPU cache?

Reply


@monksy 23 days

Replying to @luu 🎙

I'm not sure why there is coding done here.

A ticket for this work has been made at here: https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

I'm going to have to ask that the contributors attend the grooming sessions and talk about the business value of this performance enhancement.

Reply


@kruxigt 23 days

Replying to @luu 🎙

How common is optimization this skilled in parts of what we run everyday? Is it somewhere in some inner loop of the browser, some parts of the OS? Graphics drivers? BLAS and similar? The demo scene?

Reply


@arberx 23 days

Replying to @luu 🎙

How do people have so much time on their hands :(

Reply


@srcreigh 22 days

Replying to @luu 🎙

    ///// Third phase of output
    //
    // This is the heart of this program. It aims to be able to produce a
    // sustained output rate of 64 bytes of FizzBuzz per four clock cycles
    // in its main loop (with frequent breaks to do I/O, and rare breaks
    // to do more expensive calculations).
    //
    // The third phase operates primarily using a bytecode interpreter; it
    // generates a program in "FizzBuzz bytecode", for which each byte of
    // bytecode generates one byte of output. The bytecode language is
    // designed so that it can be interpreted using SIMD instructions; 32
    // bytes of bytecode can be loaded from memory, interpreted, and have
    // its output stored back into memory using just four machine
    // instructions.

Reply


@sireat 22 days

Replying to @luu 🎙

The amount of optimizations remind me of story of Mel: http://www.catb.org/jargon/html/story-of-mel.html

Reply


@jtchang 23 days

Replying to @luu 🎙

The amount of low level CPU architecture knowledge to write such a program is mind boggling. Just goes to show how much room for improvement a lot of programs have.

Reply


@digitallyfree 23 days

Replying to @luu 🎙

As an aside, it would be fun to see a programming challenge website focused on performance and optimizations, rather than code golf (shortest program) or edge case correctness (interview type sites). I took a course like this in uni where we learned low-level optimization and got to compete with other classmates to see who had the fastest program - a fun experience that I don't really see much of online.

Reply


@e12e 22 days

Replying to @luu 🎙

But does it support arbitrarily large integers? ;)

Ed: no, but does pretty well:

> The program outputs a quintillion lines of FizzBuzz and then exits (going further runs into problems related to the sizes of registers). This would take tens of years to accomplish, so hopefully counts as "a very high astronomical number" (although it astonishes me that it's a small enough timespan that it might be theoretically possible to reach a number as large as a quintillion without the computer breaking).

Reply


@Scaevolus 23 days

Replying to @luu 🎙

    This is the heart of this program. It aims to be able to produce a
    sustained output rate of 64 bytes of FizzBuzz per four clock cycles
    in its main loop (with frequent breaks to do I/O, and rare breaks
    to do more expensive calculations).
    
    The third phase operates primarily using a bytecode interpreter; it
    generates a program in "FizzBuzz bytecode", for which each byte of
    bytecode generates one byte of output. The bytecode language is
    designed so that it can be interpreted using SIMD instructions; 32
    bytes of bytecode can be loaded from memory, interpreted, and have
    its output stored back into memory using just four machine
    instructions. This makes it possible to speed up the FizzBuzz
    calculations by hardcoding some of the calculations into the
    bytecode (this is similar to how JIT compilers can create a version
    of the program with some variables hardcoded, and throw it away on
    the rare occasions that those variables' values change).

    The bytecode format is very simple (in order to allow it to be
    interpreted in just a couple of machine instructions):
    - A negative byte represents a literal character (e.g. to produce
      a literal 'F', you use the bytecode -'F', i.e. -70 = 0xba)
    - A byte 0..7 represents the hundreds..billions digit of the line
      number respectively, and asserts that the hundreds digit of the
      line number is even
    - A byte 8..15 represents the hundreds..billions digit of the line
      number respectively, and asserts that the hundreds digit of the
      line number is odd

    In other words, the bytecode program only ever needs to read from
    LINENO_MID; the information stored in LINENO_LOW and LINENO_TOP
    therefore has to be hardcoded into it. The program therefore needs
    to be able to generate 600 lines of output (as the smallest number
    that's divisible by 100 to be able to hardcode the two low digits,
    200 to be able to get the assertions about the hundreds digits
    correct, and 3 and 5 to get the Fizzes and Buzzes in the right
    place).

    The bytecode interpreter consists of four instructions:
    1. Load the bytecode from memory into %ymm2;
    2. Use it as a shuffle mask to shuffle LINENO_MID_TEMP;
    3. Subtract the bytecode from the shuffle result;
    4. Output the result of the subtraction.

    #define INTERPRET_BYTECODE(bc_offset, buf_offset) \
      vmovdqu %ymm2, [%rdx + bc_offset]; \
      vpshufb %ymm0, LINENO_MID_TEMP, %ymm2; \
      vpsubb %ymm0, %ymm0, %ymm2; \
      vmovdqa [OUTPUT_PTR + buf_offset], %ymm0

Reply


@wanderingmind 22 days

Replying to @luu 🎙



@wellthisisgreat 23 days

Replying to @luu 🎙

Sorry for the super layman question, but can someone explain why is this so amazing?

Reply


@savant_penguin 23 days

Replying to @luu 🎙

"future-proofed where possible to be able to run faster if the relevant processor bottleneck – L2 cache write speed – is ever removed)."

Being able to write a function limited mostly by the l2 cache size and able to realize that is rad

And btw this is an interesting example of how hand optimized assembly can be much much faster than any other solution. Can you get as fast as this solution with mostly C/C++? It uses interesting tricks to avoid memcopy (calling it slow rofl)

Reply


@m0ck 23 days

Replying to @luu 🎙

Thanks for my daily dose of software engineer imposter syndrome.

Reply


@DeathArrow 22 days

Replying to @luu 🎙

The next level would be designing an ASIC to do FizzBuzz. :)

Reply


@robertwt7 23 days

Replying to @luu 🎙

I feel stupid after reading this, is this normal?

Reply


@wly_cdgr 23 days

Replying to @luu 🎙

Literal 10x programmer moment

Reply


@Ericson2314 23 days

Replying to @luu 🎙

Now we just need someone to write a machine proof of correctness.

Reply


@DeathArrow 22 days

Replying to @luu 🎙

I'm imagining the author of this being called in an interview and writing this on the whiteboard. Followed by the CTO offering his job to him.

Reply


@stefanos82 23 days

Replying to @luu 🎙

I personally need some assembly renaissance, around books and new content that is.

This is mind blowing to say the least!

Reply


@optymizer 23 days

Replying to @luu 🎙

How does one debug the L2 cache? i.e. how does one inspect the data in the L2 cache and verify that loading some data from RAM actually cached X number of bytes, and the correct data has been loaded in the cache lanes, and the next instruction will use the data from there?

Reply


@booleandilemma 23 days

Replying to @luu 🎙

Wow. There is programming and then there is programming. Whenever I see something like this I feel like what I do for a living is duplo blocks in comparison.

Reply


@DeathArrow 22 days

Replying to @luu 🎙

It would be nice to compare its performance with a safe variant written in Rust, with a variant written in Java and one in Python to see what % of the performance we lose for high abstractions, convenience and safety.

Reply


@rustybolt 22 days

Replying to @luu 🎙

I’ve printed his entry and there are some neat tricks I learned, but ‘stage 3’, where, according to his comments, a bytecode interpreter is used, is still beyond my understanding. Also not sure why this helps, a bytecode interpreter sounds like a lot of overhead to me…

Reply


@brundolf 23 days

Replying to @luu 🎙

> Save it as fizzbuzz.S (that's a capital S as the extension)

What's this about?

Reply


@lvncelot 22 days

Replying to @luu 🎙

Somehow this reminds me of Aphyr's Technical Interview series[1], and the thought of some assembler dark wizard actually coding this in an interview is highly amusing.

[1] https://aphyr.com/posts/340-reversing-the-technical-intervie...

Reply


@brokenmachine 23 days

Replying to @luu 🎙

I love it that humans are so diverse - there's always someone prepared to dedicate their life to the most obscure things.

It's so impressive and hilarious to me that he actually spent months on this. Well done!!

Reply


@miguelmurca 22 days

Replying to @luu 🎙

From the comments: "@chx: I already have a master's thesis. This was harder."

Reply


@ummonk 23 days

Replying to @luu 🎙

I guess is faster than memcpy because both are memory bound but this only has to write to memory whereas memcpy has to both read from and write to memory?

Edit: actually it sounds like it’s more cache efficient because it can always use the same block of cache memory?

Reply


@no_time 23 days

Replying to @luu 🎙

next up: custom FizzBuzz ASIC with 128 fizz and 128 buzz cores for maximum throughput.

Reply


@Snelius 23 days

Replying to @luu 🎙

Good work!

Reply


@1vuio0pswjnm7 22 days

Replying to @luu 🎙

Interesting how he uses the C preprocessor as an Assembly preprocessor; cpp as a general purpose macro processor. Maybe this is more portable than using assembler built-in macro features that vary from assembler to assembler

Reply


@jerf 23 days

Replying to @luu 🎙

"The Grid. A digital frontier. I tried to picture clusters of information as they moved through the computer. What did they look like? Ships, motorcycles? Were the circuits like freeways? I kept dreaming of a world I thought I'd never see. And then, one day I got in...

It turns out The Grid is just a guy sitting in a chair, shouting about "Fizz!" and "Buzz!" as fast as he can.

It wasn't really what I had in mind.

(The image of this poor program, stuck shouting "fizz!" and "buzz!" for subjectively centuries at a time struck me...)

Reply


About Us

site design / logo © 2021 Box Piper