Researchers Upend AI Status Quo By Eliminating Matrix Multiplication In LLMs 72
Researchers from UC Santa Cruz, UC Davis, LuxiTech, and Soochow University have developed a new method to run AI language models more efficiently by eliminating matrix multiplication, potentially reducing the environmental impact and operational costs of AI systems. Ars Technica's Benj Edwards reports: Matrix multiplication (often abbreviated to "MatMul") is at the center of most neural network computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations in parallel. [...] In the new paper, titled "Scalable MatMul-free Language Modeling," the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar performance to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU's power draw). The implication is that a more efficient FPGA "paves the way for the development of more efficient and hardware-friendly architectures," they write.
The paper doesn't provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, in our experience, you can run a 2.7B parameter version of Llama 2 competently on a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), that would be a 38-fold decrease in power usage. The technique has not yet been peer-reviewed, but the researchers -- Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian -- claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones. [...]
The researchers say that scaling laws observed in their experiments suggest that the MatMul-free LM may also outperform traditional LLMs at very large scales. The researchers project that their approach could theoretically intersect with and surpass the performance of standard LLMs at scales around 10^23 FLOPS, which is roughly equivalent to the training compute required for models like Meta's Llama-3 8B or Llama-2 70B. However, the authors note that their work has limitations. The MatMul-free LM has not been tested on extremely large-scale models (e.g., 100 billion-plus parameters) due to computational constraints. They call for institutions with larger resources to invest in scaling up and further developing this lightweight approach to language modeling.
The paper doesn't provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, in our experience, you can run a 2.7B parameter version of Llama 2 competently on a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), that would be a 38-fold decrease in power usage. The technique has not yet been peer-reviewed, but the researchers -- Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian -- claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones. [...]
The researchers say that scaling laws observed in their experiments suggest that the MatMul-free LM may also outperform traditional LLMs at very large scales. The researchers project that their approach could theoretically intersect with and surpass the performance of standard LLMs at scales around 10^23 FLOPS, which is roughly equivalent to the training compute required for models like Meta's Llama-3 8B or Llama-2 70B. However, the authors note that their work has limitations. The MatMul-free LM has not been tested on extremely large-scale models (e.g., 100 billion-plus parameters) due to computational constraints. They call for institutions with larger resources to invest in scaling up and further developing this lightweight approach to language modeling.
That's amazing (Score:1)
Will I be able to generate more sausage finger porn with this method?
I Took A Look... (Score:3)
This is really pretty clever [arxiv.org]. Yeah a lot of what LLMs do is really excess computation for what gets accomplished in the end.
Re:I Took A Look... (Score:5, Informative)
Since when can a RTX 3060 do a 7B llama model at 2k tokens/s? Are you sure you don't mean, like, 45 tokens/s? Are you maybe talking prompt eval time rather than generation time or total time? I have a 7B model (8bit GGUF) running right now on a power-limited (285W) 3090 and here's a typical result:
prompt eval time = 943.80 ms / 2389 tokens ( 0.40 ms per token, 2531.27 tokens per second)
generation eval time = 6649.69 ms / 472 runs ( 14.09 ms per token, 70.98 tokens per second)
total time = 7593.49 ms
Re: (Score:2)
Obviously, ymmv- especially based on your backend. [bentoml.com]
Re: (Score:2)
Those high figures involve large numbers of concurrent users; they're not the figures for a single generation run.
Re: (Score:3)
Thanks!
Re: (Score:1)
Re: I Took A Look... (Score:2)
Re: (Score:2)
Hunting for karma, or just bored?
Re:I Took A Look... (Score:5, Interesting)
In addition to what Rei pointed out already about the distinction between generation and evaluation, don't forget that that's 13W with an FPGA, rather than dedicated hardware. If Rei's numbers are correct, we're talking about what is essentially a general purpose device delivering 1/3 the speed with just 1/25 the wattage. That bodes well for if this approach.
Re: (Score:3)
Please stop mixing up multi-user batch performance with single-user performance.
Re: (Score:1)
we're talking about what is essentially a general purpose device delivering
FPGA is the exact opposite of a general purpose device. The beauty of FPGA's is that they can be configured into specialized processors.
I guess you can say that they're general purpose in that sense, but when people talk about general purpose computing, they're generally referring to CPUs that don't need to be reconfigured / customized for different applications. FPGA's require first programming the FPGA for the task, then using it for that task.
Re: (Score:2)
The beauty of FPGA's is that they can be configured into specialized processors.
I guess you can say that they're general purpose in that sense
That was the intended sense in which I used it. I knew the audience here should have the requisite knowledge to understand what I was getting at, but I tried to acknowledge I was overloading terminology a bit by tossing in that “what is effectively a”, since I wanted to give an indication that I was aware I was stretching the term’s common usage by applying it to FPGAs.
But you get it, right? Going from an FPGA to an ASIC should result in tremendous performance and power efficiency gains. T
Re: (Score:2)
Re: (Score:2)
3060 was a mistake.
Re: (Score:2)
No.
Re: (Score:2)
"tokens/s"
You keep using that word. I do no think it means what you think it means.
Re: (Score:2)
I mentioned 300W card, but said 3060- 3060 is a 170W card. A100 is a 300W card.
As you can see, unless there is something I'm missing, the GPU is simply more efficient.
It may not run at lower power- but it does more tokens/s/W.
Re: (Score:2)
That is ridiculous on multiple levels. First, we all know you are full of shit and were pulling the wrong stat RE a 3060 rather than total generation. Second, a 3060 is $300 piece of hardware vs AT LEAST $8k for an A100 and either has a major fab advantage over any FPGA they are going to be comparing with them. Finally, you are comparing their single user generation rate vs batching 10-100 requests into what would otherwise be a single user generation.
Maybe you could slide if you didn't run around going out
Re: (Score:1)
Re: (Score:3)
They're confusing batch processing vs. single-user processing. Basically, you merge many requests into a single rotating context window, masked off from each other, so they're all processed simultaneously (and can even share parts of the same context if they're similar). So you get much higher tokens per second out of it. But it's an apples-to-oranges comparison; you can do batch processing on any LLM. Including ternary ones.
OH SHIT (Score:4, Funny)
Re: (Score:2)
Re: (Score:2)
I think they're definitely at risk of being outmoded by algorithmic innovation.
I do wonder (Score:2)
Normally you would expect a gold rush of investors throwing money at hardware companies or the very least investing in Intel and AMD. But these days investors don't seem interested in anyone that might compete with whoever is the top dog. Better than
Re: (Score:2)
I can't speak to AMD, but people are keeping their money away from Intel because they are grossly mismanaged. Investors are still interested in ARM, TSMC, Micron, and a few others. There is also a lot of investment via companies like OpenAI in transformative chip plays.
NVDA is just the easy one to invest in.
Time to market and useful life (Score:2)
Crypto went big for ~5 to 7 years, will AI be big for 7 years?
From a hardware perspective, will there be a Moore's law of computation increase for AI geared processors? If so, would investing in the research and building out 1,000,000 chips per production lines be even worth it?
Re: (Score:2)
Normally you would expect a gold rush of investors throwing money at hardware companies or the very least investing in Intel and AMD. But these days investors don't seem interested in anyone that might compete with whoever is the top dog. Better than just buy stock in the top dog and let everyone else fall by the wayside so that you're investment in the that top dog doesn't get undermined.
I don't think this is correct. There has been a ton of money thrown at finding an Nvidia alternative over many y
Wonder how this affects CA SB942 (Score:5, Insightful)
California is working on SB 942 [ca.gov], a bill to regulate AI. A key part of the regulation is classifying AI models by the FLOPS needed to train and run the model. If this approach dramatically decreases the compute needed, that might throw a monkey wrench in the regulatory approach.
Personally, I'm not surprised. Classifying AI models by compute requirements seems fraught with difficulty given how fast the technology is changing.
Re: (Score:3)
is it? maybe you got the wrong bill? this one is about ai transparency and all i can find is this:
(b) “Covered provider” means a person that creates, codes, or otherwise produces a generative artificial intelligence system that has over 1,000,000 monthly visitors or users and is publicly accessible within the geographic boundaries of the state.
which doesn't make a lot of sense indeed with regards to protecting privacy or enforcing transparency, but has nothing to do with flops or computer power or energy cost which would have been really clueless. politicians being what they are, though, and californians no less, this was a distinct possibility, so i read through the whole thing and now i feel a little betrayed inside. :-/
Re: (Score:3)
potentially reducing the environmental impact
No it won't [greenchoices.org].
Re: (Score:2)
Re: (Score:2)
Stupid premature legislation. Won't help anyone.
Re: (Score:2)
Given that energy use is the problem, regulating based on energy use per query seems like a better classification system.
Re: (Score:2)
Given that energy use is the problem, regulating based on energy use per query seems like a better classification system.
That's not the problem the fine people in Sacramento were trying to solve. They were trying to stop SkyNet. The gist of the bill, as I understand it, was to mandate that any sufficiently powerful AI have a kill switch. The bill gets pretty involved trying to define how "sufficiently powerful" is measured and that's where the FLOPs come in. The idea was to benchmark something like GPT-4 and say any model which required the same amount of FLOPs to train (adjusting for Moore's Law) was covered.
Don't take that
Can one "convert" a matmul model's weights? (Score:3)
Interesting read (the paper).
a) I wonder how it would perform on a classical CISC CPU such as the x86_64 vs classical RISC CPU such as Arm? (Apple may have a real leg up here?)
b) How hard would it be to take an existing model, such as Llama-3-8b-Instruct, and "quantize" the weights to be compatible with the non-MatMul architecture?
c) Is the GPU market going to collapse in 6-months after someone releases a new Llama-nonMatMul or Mixtral-nonMatMul model using the new architecture? Or does a GPU still have the advantage here? (I haven't done low-level machine level programming in decades... not sure how SIMD, et. al., relate to all this).
Re: (Score:2)
I was reading this and wondering "gee, this text really looks like it was generated by AI". and indeed, it was. might as well just post the prompt.
Re: (Score:2)
Re:Can one "convert" a matmul model's weights? (Score:4, Informative)
a) Possibly worse than traditional models. They replacing the general multiply-accumulate (MAC) operation with a MAC where the weight operand is trinary (-1, 0 or +1). This can use simpler hardware but currently needs more CPU instructions. Memory bottlenecks may change the trade-offs, though.
b) it's easy to do post-training quantization of a model, but usually the results are poor compared to [nvidia.com] using quantization-aware training (see Table 1 of that link).
c) No, GPUs -- or NPUs -- are still good for running general models, but they might get support for this quantization scheme in future hardware. This was done in an FPGA because that doesn't require a new ASIC fabrication for each experimental quantization scheme.
I'm waiting... (Score:2)
I'm waiting for the Doom-style, "Can it run on a toaster?" Can it by powered by a potato-battery? Can it run on a pregnancy tester?
How significant is the FPGA implementation? (Score:2)
The article suggests that the FPGA hardware might be a GPU alternative. However, if there is true merit in what the FPGA is doing, then the key functions will likely become candidates for fixed function units in the next Nvidia GPU. Any FPGA solution will still need a high bandwidth memory subsystem. I imagine it would be easier to add the fixed function units to a GPU than to add an alternative to the GPU memory subsystem to FPGAs.
In a way, it almost seems like Nvidia is expecting these types of competing
It will not reduce invironmental impact and costs (Score:2, Insightful)
It won't do this. It just means, if implemented, the companies running the models will be enabled to do more and do it faster at the same costs as in the old situations. So it will simply accelerate the path that the companies are already on. Instead of being able to (economically) do X and Y in a year, they can do X, Y and even Z this year.
Exactly (Score:3)
It just means, if implemented, the companies running the models will be enabled to do more and do it faster at the same costs as in the old situations.
Exactly. Jevons Paradox [wikipedia.org] is a bitch.
Re: (Score:1)
Does it matter? (Score:2)
Memory bandwidth is the limiting factor with these things and it takes lots of current to operate ultra fast memory controllers and buses.
For example tried running a llama2 7B on a GPU. While spitting out about 70t/s the GPUs memory consumed about 75 watts vs 66 for GPU chip.
I assume in batch mode the GPU would be loaded and something like this could reduce power consumption with current hardware... yet in the grand scheme of things GPUs are going away for inference and custom chips for example doing matri
Re: (Score:2)
This also reduced RAM use. It's similar to other quantisation, except that it reduces the number range enough that multiplication becomes meaningless.
AI chips already gravitate to having RAM near processing, and this extra reduction in storage could make this even more practical.
Re: (Score:2)
There was a recent post by Noam Shazeer (now of character.ai, but previously from Google where he was the main architect of the transformer architecture used by LLMs) where he describes having optimized his transformer implementation to the point that memory bandwidth is no longer the bottleneck for large batch LLM inference.
https://research.character.ai/... [character.ai]
Hope this is not some hoax (Score:3)
So I can afford a new gaming video card.
Re: (Score:2)
Get a better job. Much better direction than hoping this will help reduce GPU prices. Which it won't even if successful.
Analogue hardware (Score:2)
Is it cost effective yet to build VLSI semiconductor analogue electronic hardware that mimics biological brains yet? If so this probably would have a far lower power requirement.
This will make everything worse (Score:3)
So there's this thing called Jevons Paradox.
The paradox states that increasing how efficiently you use a resource leads to an increase in how much of that resource gets consumed, not less. This is because more often than not, the consumption is input constrained rather than output constrained. If efficiency doubles, meaning the cost of doing business is halved, I'm not going to do half as much business for the same revenue... I'm going to do twice as much business for the same outlay cost and make twice as much revenue.
More efficient AI computation means the returns on investment are higher, meaning there will be more data centers built and more energy consumed, not less...
=Smidge=
Re: (Score:3)
Only if:
(A) AI returns economically justify the investment, e.g. are massively successful and creating huge amounts of economic activity; and
(B) Simpler LLMs are never "good enough", that everyone always wants ever-better LLMs, rather than cheaper ones.
Re: (Score:2)
(A) Notwithstanding that making something more efficient makes it more likely to be economical, clearly it already is given the amount of investment and growth in the market. At the most optimistic, there will be a bubble of AI infrastructure deployment that eventually pops laving a massive economic and ecological crater, but that actualyl seems unlikely.
(B) There will never be a thing as "good enough" because the one and only utility of AI is that it's better than existing tools. As soon as someone makes a
Re: (Score:2)
I must disagree.
(A) AI today isn't being paid for by its revenue. It's being paid for by speculation about its future revenue. There are some companies that are doing well on-net with sustainable business models, but by and large it's speculative. At some point, that speculative future has to realize, or your future investment - rather than continuing to grow exponentially as modeled - will instead crash.
(B) There absolutely is "good enough". For every task, there's a given failure rate for a given mode
Re: (Score:2)
(A) Exactly what I mean by bubble. Again, the optimistic scenario is this improvement in efficiency means a bigger, faster-growing bubble. For better or worse I do think LLMs will reach cost parity well before venture capital dries up as the potential market is absolutely huge. Just customer service alone is something like $12B/yr. Tech support about $47B/yr. I can't imagine what government contracts might be worth if/when they replace every desk clerk with an LLM...
(B) You're thinking in terms of individua
Is Nvidia still relevant? (Score:1)
bad math (Score:2)
>> "You can run ... RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), that would be a 38-fold decrease in power usage"
No it really wouldn't.
1) The maximum potential power of your PSU is irrelevant. You calciulate based on what is actually being used.
2) The max power of the 3060 is 170W not 200W.
So... even assuming the GPU is running at peak 100% of the total time (which it almo
Per TFS (Score:3)
According to TFS:
I run a 13-billion parameter LLM (GPT4All [gpt4all.io]) on an M1 Mac Studio Ultra. Maximum power consumption for this computer (per Apple) is 370 watts, but it doesn't use anywhere near that much power when running the LLM. My UPS indicates 80 watts; but it indicates 290 watts when I run X-Plane @4k (no scaling, that's the monitor's native resolution), as a point of reference.
It'd be fascinating to see how much power this non-MM engine would consume on the same hardware. Much less a more recent CPU of this class.
Use AI to design AI chips? (Score:2)
Could they use AI and/or genetic algorithms to evolve the most efficient AI chip design rather than just rely on pondering humans?
Bloody obvious (Score:2)
The next step is an LLM compiler that can eliminate low latency requirements during training. If you "map/reduce" neural networks, you can stream data and produce weights distributed and merge generated graphs after. The trick is, we need a "graph multiply" so to say. Meaning, we need an operation like matrix multiply for graphs of arbitrary size. The trick is node alignment.
Expect 10^23 to drop to 10^~15 and NUMA issues to disappear
AI advancements are so fascinating! (Score:1)