Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
The Matrix AI Math

Researchers Upend AI Status Quo By Eliminating Matrix Multiplication In LLMs 70

Researchers from UC Santa Cruz, UC Davis, LuxiTech, and Soochow University have developed a new method to run AI language models more efficiently by eliminating matrix multiplication, potentially reducing the environmental impact and operational costs of AI systems. Ars Technica's Benj Edwards reports: Matrix multiplication (often abbreviated to "MatMul") is at the center of most neural network computational tasks today, and GPUs are particularly good at executing the math quickly because they can perform large numbers of multiplication operations in parallel. [...] In the new paper, titled "Scalable MatMul-free Language Modeling," the researchers describe creating a custom 2.7 billion parameter model without using MatMul that features similar performance to conventional large language models (LLMs). They also demonstrate running a 1.3 billion parameter model at 23.8 tokens per second on a GPU that was accelerated by a custom-programmed FPGA chip that uses about 13 watts of power (not counting the GPU's power draw). The implication is that a more efficient FPGA "paves the way for the development of more efficient and hardware-friendly architectures," they write.

The paper doesn't provide power estimates for conventional LLMs, but this post from UC Santa Cruz estimates about 700 watts for a conventional model. However, in our experience, you can run a 2.7B parameter version of Llama 2 competently on a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), that would be a 38-fold decrease in power usage. The technique has not yet been peer-reviewed, but the researchers -- Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, and Jason Eshraghian -- claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones. [...]

The researchers say that scaling laws observed in their experiments suggest that the MatMul-free LM may also outperform traditional LLMs at very large scales. The researchers project that their approach could theoretically intersect with and surpass the performance of standard LLMs at scales around 10^23 FLOPS, which is roughly equivalent to the training compute required for models like Meta's Llama-3 8B or Llama-2 70B. However, the authors note that their work has limitations. The MatMul-free LM has not been tested on extremely large-scale models (e.g., 100 billion-plus parameters) due to computational constraints. They call for institutions with larger resources to invest in scaling up and further developing this lightweight approach to language modeling.

Researchers Upend AI Status Quo By Eliminating Matrix Multiplication In LLMs

Comments Filter:
  • by Anonymous Coward

    Will I be able to generate more sausage finger porn with this method?

  • by crunchygranola ( 1954152 ) on Tuesday June 25, 2024 @09:09PM (#64578455)

    This is really pretty clever [arxiv.org]. Yeah a lot of what LLMs do is really excess computation for what gets accomplished in the end.

  • OH SHIT (Score:4, Funny)

    by Anonymous Coward on Tuesday June 25, 2024 @09:22PM (#64578471)
    Selling all my NVDA at market open tmrw.
    • Whether or not Nvidia is going to face serious competition from things like Asics and custom built hardware. For the longest time huge swaths of crypto were built around video cards but now that ethereum is running off proof of stake there's a lot less of it.

      Normally you would expect a gold rush of investors throwing money at hardware companies or the very least investing in Intel and AMD. But these days investors don't seem interested in anyone that might compete with whoever is the top dog. Better than
      • I can't speak to AMD, but people are keeping their money away from Intel because they are grossly mismanaged. Investors are still interested in ARM, TSMC, Micron, and a few others. There is also a lot of investment via companies like OpenAI in transformative chip plays.

        NVDA is just the easy one to invest in.

        • Crypto went big for ~5 to 7 years, will AI be big for 7 years?

          From a hardware perspective, will there be a Moore's law of computation increase for AI geared processors? If so, would investing in the research and building out 1,000,000 chips per production lines be even worth it?

      • Normally you would expect a gold rush of investors throwing money at hardware companies or the very least investing in Intel and AMD. But these days investors don't seem interested in anyone that might compete with whoever is the top dog. Better than just buy stock in the top dog and let everyone else fall by the wayside so that you're investment in the that top dog doesn't get undermined.

        I don't think this is correct. There has been a ton of money thrown at finding an Nvidia alternative over many y

  • by smoot123 ( 1027084 ) on Tuesday June 25, 2024 @09:38PM (#64578511)

    California is working on SB 942 [ca.gov], a bill to regulate AI. A key part of the regulation is classifying AI models by the FLOPS needed to train and run the model. If this approach dramatically decreases the compute needed, that might throw a monkey wrench in the regulatory approach.

    Personally, I'm not surprised. Classifying AI models by compute requirements seems fraught with difficulty given how fast the technology is changing.

    • by znrt ( 2424692 )

      is it? maybe you got the wrong bill? this one is about ai transparency and all i can find is this:

      (b) “Covered provider” means a person that creates, codes, or otherwise produces a generative artificial intelligence system that has over 1,000,000 monthly visitors or users and is publicly accessible within the geographic boundaries of the state.

      which doesn't make a lot of sense indeed with regards to protecting privacy or enforcing transparency, but has nothing to do with flops or computer power or energy cost which would have been really clueless. politicians being what they are, though, and californians no less, this was a distinct possibility, so i read through the whole thing and now i feel a little betrayed inside. :-/

    • potentially reducing the environmental impact

      No it won't [greenchoices.org].

    • That kind of regulation is a good way to send AI companies to Texas.
    • by cowdung ( 702933 )

      Stupid premature legislation. Won't help anyone.

    • Given that energy use is the problem, regulating based on energy use per query seems like a better classification system.

  • Interesting read (the paper).

    a) I wonder how it would perform on a classical CISC CPU such as the x86_64 vs classical RISC CPU such as Arm? (Apple may have a real leg up here?)

    b) How hard would it be to take an existing model, such as Llama-3-8b-Instruct, and "quantize" the weights to be compatible with the non-MatMul architecture?

    c) Is the GPU market going to collapse in 6-months after someone releases a new Llama-nonMatMul or Mixtral-nonMatMul model using the new architecture? Or does a GPU still have the advantage here? (I haven't done low-level machine level programming in decades... not sure how SIMD, et. al., relate to all this).

    • by Entrope ( 68843 ) on Wednesday June 26, 2024 @05:06AM (#64579013) Homepage

      a) Possibly worse than traditional models. They replacing the general multiply-accumulate (MAC) operation with a MAC where the weight operand is trinary (-1, 0 or +1). This can use simpler hardware but currently needs more CPU instructions. Memory bottlenecks may change the trade-offs, though.

      b) it's easy to do post-training quantization of a model, but usually the results are poor compared to [nvidia.com] using quantization-aware training (see Table 1 of that link).

      c) No, GPUs -- or NPUs -- are still good for running general models, but they might get support for this quantization scheme in future hardware. This was done in an FPGA because that doesn't require a new ASIC fabrication for each experimental quantization scheme.

  • I'm waiting for the Doom-style, "Can it run on a toaster?" Can it by powered by a potato-battery? Can it run on a pregnancy tester?

  • The article suggests that the FPGA hardware might be a GPU alternative. However, if there is true merit in what the FPGA is doing, then the key functions will likely become candidates for fixed function units in the next Nvidia GPU. Any FPGA solution will still need a high bandwidth memory subsystem. I imagine it would be easier to add the fixed function units to a GPU than to add an alternative to the GPU memory subsystem to FPGAs.

    In a way, it almost seems like Nvidia is expecting these types of competing

  • "potentially reducing the environmental impact and operational costs of AI systems."

    It won't do this. It just means, if implemented, the companies running the models will be enabled to do more and do it faster at the same costs as in the old situations. So it will simply accelerate the path that the companies are already on. Instead of being able to (economically) do X and Y in a year, they can do X, Y and even Z this year.
  • Memory bandwidth is the limiting factor with these things and it takes lots of current to operate ultra fast memory controllers and buses.

    For example tried running a llama2 7B on a GPU. While spitting out about 70t/s the GPUs memory consumed about 75 watts vs 66 for GPU chip.

    I assume in batch mode the GPU would be loaded and something like this could reduce power consumption with current hardware... yet in the grand scheme of things GPUs are going away for inference and custom chips for example doing matri

    • by ET3D ( 1169851 )

      This also reduced RAM use. It's similar to other quantisation, except that it reduces the number range enough that multiplication becomes meaningless.

      AI chips already gravitate to having RAM near processing, and this extra reduction in storage could make this even more practical.

    • There was a recent post by Noam Shazeer (now of character.ai, but previously from Google where he was the main architect of the transformer architecture used by LLMs) where he describes having optimized his transformer implementation to the point that memory bandwidth is no longer the bottleneck for large batch LLM inference.

      https://research.character.ai/... [character.ai]

  • by vbdasc ( 146051 ) on Wednesday June 26, 2024 @03:42AM (#64578921)

    So I can afford a new gaming video card.

    • by ET3D ( 1169851 )

      Get a better job. Much better direction than hoping this will help reduce GPU prices. Which it won't even if successful.

  • Is it cost effective yet to build VLSI semiconductor analogue electronic hardware that mimics biological brains yet? If so this probably would have a far lower power requirement.

  • by Smidge204 ( 605297 ) on Wednesday June 26, 2024 @07:04AM (#64579129) Journal

    So there's this thing called Jevons Paradox.

    The paradox states that increasing how efficiently you use a resource leads to an increase in how much of that resource gets consumed, not less. This is because more often than not, the consumption is input constrained rather than output constrained. If efficiency doubles, meaning the cost of doing business is halved, I'm not going to do half as much business for the same revenue... I'm going to do twice as much business for the same outlay cost and make twice as much revenue.

    More efficient AI computation means the returns on investment are higher, meaning there will be more data centers built and more energy consumed, not less...
    =Smidge=

    • by Rei ( 128717 )

      Only if:

      (A) AI returns economically justify the investment, e.g. are massively successful and creating huge amounts of economic activity; and
      (B) Simpler LLMs are never "good enough", that everyone always wants ever-better LLMs, rather than cheaper ones.

      • (A) Notwithstanding that making something more efficient makes it more likely to be economical, clearly it already is given the amount of investment and growth in the market. At the most optimistic, there will be a bubble of AI infrastructure deployment that eventually pops laving a massive economic and ecological crater, but that actualyl seems unlikely.

        (B) There will never be a thing as "good enough" because the one and only utility of AI is that it's better than existing tools. As soon as someone makes a

        • by Rei ( 128717 )

          I must disagree.

          (A) AI today isn't being paid for by its revenue. It's being paid for by speculation about its future revenue. There are some companies that are doing well on-net with sustainable business models, but by and large it's speculative. At some point, that speculative future has to realize, or your future investment - rather than continuing to grow exponentially as modeled - will instead crash.

          (B) There absolutely is "good enough". For every task, there's a given failure rate for a given mode

          • (A) Exactly what I mean by bubble. Again, the optimistic scenario is this improvement in efficiency means a bigger, faster-growing bubble. For better or worse I do think LLMs will reach cost parity well before venture capital dries up as the potential market is absolutely huge. Just customer service alone is something like $12B/yr. Tech support about $47B/yr. I can't imagine what government contracts might be worth if/when they replace every desk clerk with an LLM...

            (B) You're thinking in terms of individua

  • The whole Nvidia product line is based on killer MatMul. Now we may have a sea change in hardware and an open source framework to go with it. It seems like this may pop the market bubble in tech but anything to do with implementation is probably going to take off bigly.
  • >> "You can run ... RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply. So, if you could theoretically completely run an LLM in only 13 watts on an FPGA (without a GPU), that would be a 38-fold decrease in power usage"

    No it really wouldn't.
    1) The maximum potential power of your PSU is irrelevant. You calciulate based on what is actually being used.
    2) The max power of the 3060 is 170W not 200W.
    So... even assuming the GPU is running at peak 100% of the total time (which it almo

  • by fyngyrz ( 762201 ) on Wednesday June 26, 2024 @11:47AM (#64579767) Homepage Journal

    According to TFS:

    ...you can run a 2.7B parameter version of Llama 2 competently on a home PC with an RTX 3060 (that uses about 200 watts peak) powered by a 500-watt power supply

    I run a 13-billion parameter LLM (GPT4All [gpt4all.io]) on an M1 Mac Studio Ultra. Maximum power consumption for this computer (per Apple) is 370 watts, but it doesn't use anywhere near that much power when running the LLM. My UPS indicates 80 watts; but it indicates 290 watts when I run X-Plane @4k (no scaling, that's the monitor's native resolution), as a point of reference.

    It'd be fascinating to see how much power this non-MM engine would consume on the same hardware. Much less a more recent CPU of this class.

  • Could they use AI and/or genetic algorithms to evolve the most efficient AI chip design rather than just rely on pondering humans?

  • We've discussed this approach for over a year. It was painfully obvious.

    The next step is an LLM compiler that can eliminate low latency requirements during training. If you "map/reduce" neural networks, you can stream data and produce weights distributed and merge generated graphs after. The trick is, we need a "graph multiply" so to say. Meaning, we need an operation like matrix multiply for graphs of arbitrary size. The trick is node alignment.

    Expect 10^23 to drop to 10^~15 and NUMA issues to disappear

"I got a question for ya. Ya got a minute?" -- two programmers passing in the hall

Working...