Farewell, CUDA OOM: Automatic Gradient Accumulation
4 points • 0 comments
Recent @ffast-math Activity
Farewell, CUDA OOM: Automatic Gradient Accumulation
4 points • 0 comments
I named it this in 2017 and was only worried about name collisions with other GitHub repos and ML algorithms. Also it's a backronym for Based On Lookup Tables + sounds at least somewhat evocative of going fast, so it was the best name for an algorithm I could come up with.
Nope. I'd love to though.
This master's thesis sort of does it for individual layers, but it doesn't have any fine-tuning yet so it completely wrecks the accuracy: https://github.com/joennlae/halutmatmul.
If someone worked on contributing this functionality to Composer  I'd be down to help out. I can't justify building it all on my own right now since we're 100% focused on training speedup, but I could definitely meet and talk through it, help code tricky parts, review PRs, etc.
No ML frameworks implement it yet, though I'd be happy to work with people from the PyTorch/TF/JAX/CUDNN/CUTLASS/etc. teams (or volunteers) if anyone wants to make this happen.
Also, while you can get 200x compression, I do want to emphasize that there's a speed vs quality tradeoff and the results will vary by problem. We have much more careful statements in the paper about the exact problem setup, tradeoffs, etc. Also, as I've mentioned in other comments, it probably won't help too much on modern GPUs due to their acceleration of dense GEMMs but not shuffles. CPU inference is the killer app here.
IMO it would be super cool and I hope someone does it. There are a lot of interesting tradeoffs around which techniques to use for which matrix sizes and under which assumptions about read vs write ratios, what you have a training set for, whether you can fuse compression intro previous ops, etc.
email. <my first name>@mosaicml.com
Thanks for posting it!
It should be possible to get large speedups on CPUs, but the trick will be gradually approximating each of the layers in the model (see my reply to sibling comment). It's not conceptually difficult, but will require a fair amount of C++ work to port the code to GPUs* for training; and it will probably go slower than dense ops on modern GPUs due to tensor cores not supporting our memory layout.
I think of this paper as the first in a two-part series, where the next one takes these fast ops and gets them working in full neural nets. (If anyone wants to do this project, happy to coadvise you / talk about it whenever; I won't have bandwidth to do it myself for the foreseeable future).
*Someone recently started doing this as part of their master's thesis: https://github.com/joennlae/halutmatmul
Yes. It's another research project to make this happen, but I think it would be fairly straightforward. The issue is that you can't backprop through the assignment step, so you get no gradient with respect to the input. This mandates a progressive layer freezing strategy. I don't think it would be too hard to get working though; you'd likely just need to train for longer, or start with a pretrained model and fine-tune it as you freeze + approximate the layers.
We found sparse, truncated PCA to be the most competitive baseline. We beat it by a lot (see the paper ), but the other big drawback is that trading off the rank vs sparsity was an ugly hyperparameter tuning problem. By ugly, I mean that the results were really sensitive to getting this right, it wasn't easy to set a priori, and took a while to iterate on because the sparse PCA trained much more slowly than any other practical alternative.
There are situations where PCA/SVD is the right approach though. Namely, if you need really little error, our method often can't do that, whereas throwing away dims that explain almost no variance can. Also it's just easier to implement.
Exactly. You can run it on sparse inputs. It's just that our implementation doesn't exploit the sparsity, so we don't claim that it will work better.
Definitely. On CPUs, you could make this 2x faster pretty easily with just another execution port for vpshufb / vtbl and a 4bit lo and hi unpack instruction.
Though the real speedup would be allowing dense matmul ASICs to operate on 16-byte tables and 4-bit indices as operands. The reason Bolt and MADDNESS end up so fast is that they produce "sparse" representations that are still contiguous, strided arrays in memory. So the kernels and access patterns are just like those of dense GEMMs (and therefore vectorize-able, etc), but with lookup-adds instead of multiply-adds.
Hopefully-clarifying image: https://imgur.com/a/trOB69U
There's definitely a tradeoff between speed and accuracy. We characterize this for various problems in the paper (https://arxiv.org/pdf/2106.10860.pdf), but tl;dr is that it speeds things up more at a given level of error when there's more redundancy in your matrices.
Back-of-the-envelope calculation suggests that this won't beat tensor cores on NVIDIA GPUs. This is basically because ~half the die is an ASIC for dense (and 2:4 sparse) matmuls, with no support for the sparsity structure we induce. If 1:16 sparsity were supported or there were a batched warp_shuffle instruction, we'd get similar speedups for GPUs as we do on CPUs.
Author here. Ask me anything--happy to answer questions.
Also, if you like this kind of work, you might like what I've been building for the past year: Composer . It speeds up neural net training by a lot (e.g., 7x faster for ResNet-50)  and, in contrast to Bolt/MADDNESS, is polished, documented code you can get working in <5min.
I think we only claim to be able to preprocess a matrix at "up to" 100GB/s/core. The overall matrix product will take longer and depend on the matrix shapes.
To simplify Section 1.1, we help when:
1) You need to perform a matrix product more quickly and can tolerate approximation error
2) You have a training set for the larger matrix
3) The smaller matrix is either a) fixed or b) skinny relative to how tall the larger matrix is.
Re: "an impressive, but dangerous, tool to people who don't know what they're doing."
I believe you are overestimating the usability of my code :). But more seriously, I suspect that people attempting to use our method in contexts it wasn't designed for will quickly discover that they either can't actually call the API the way they wanted to, or that the method is no faster for their purposes. We also characterize our method at least as thoroughly as any approximate matrix multiplication algorithm I'm aware of, and have a variety of (admittedly loose) theoretical guarantees. So I hope that at least those who thoroughly read the paper will have a clear idea of what it can do. Overall, I guess my current thinking is that 1) I'm not sure how to introduce a method any more responsibly, but 2) if I can be of help in ensuring that it gets used well, feel free to reach out.
We have a generalization guarantee in Section 4.5. It's not especially tight though; in practice, the errors from different codebooks tend to be mostly independent, and you get nice Gaussian-like concentration. I would look at the empirical results in Section 5 to get a better feel for how it performs in practice.
Almost certainly not, sadly, unless maybe there's a ton of correlation in the input and you can tolerate quite a bit of error.
So this misses a few aspects of why the method works:
- You can't actually get a speedup from the proposed approach. You'd need a lookup table of size b^2 to multiply two b-bit numbers, which will be much slower than just doing the multiplication.
- Most of what we do is perform fewer operations. We're actually slower per op than multiply-adds because those are supported better in hardware. We just compress the matrices by so much that it's worth it. Relatedly, for sufficiently few codebooks, we're actually sublinear in the input size--i.e., we don't even read it all.
- We have some lightweight machine learning in there to make the sublinearity and other approximations not suck. Getting the ML fast enough to beat a BLAS GEMM is far from trivial.
I'm actually not quite sure what you mean by breaking down into two-dimensional operations. We use operations on pairs of vectors, but nothing is assumed to be two-dimensional, and I don't think we suggest imagining anything as a plane? The vectors can be arbitrary and there's no assumption about curvature. Happy to clarify if I can help.
You could still optimize the prototypes, so fine-tuning with this in place would be possible (see, e.g., ). But we don't yet have data on how well this would work using our exact method, how early in training you could do the op replacement, etc.