Can we reach AGI with just LLMs?
Dr Waku Dr Waku
15.8K subscribers
18,313 views
1K

 Published On Feb 12, 2024

In this video we analyze a LessWrong blog post that outlines one possible path to AGI: leveraging heterogeneous architectures. Instead of just assuming that scaling up current LLMs will be sufficient to reach AGI, heterogeneous architectures combine several types of models and algorithms. The Transformer architecture is the ubiquitous model currently. A new architecture called Mamba, which is a selective state model, has recently been described as well.

We go into some detail about how the Transformer architecture works. It relies on an attention mechanism to resolve ambiguity in the input (for example, words that have multiple meanings). The attention mechanism relies on comparing every pair of words to see how related they are, which is a fairly slow quadratic operation.

Then we dive into the Mamba architecture. As a state space model, it is quite good at memorizing information about the input for the long term. Mamba has two primary innovations: the notion of selective SSM, and a very hardware-aware implementation. The selection mechanism it uses is equivalent to attention, but it's much more efficient and can be trained in linear time. The Mamba paper was created by optimization experts and so they use clever tricks to make sure it runs well on current GPUs.

#ai #transformer #mamba

AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them
https://www.lesswrong.com/posts/Btom6...

What is Mamba?
  / what-is-mamba  

Mamba: The Next Evolution in Sequence Modeling
https://anakin.ai/blog/mamba/

Mamba-Chat: A Chat LLM based on State Space Models
  / mambachat_a_chat_llm_based_on_state_space_...  

ChatGPT Doesn’t Have Human-Like Intelligence But New Models of AI May be Able to Compete with Human Intelligence Soon
https://www.digitalinformationworld.c...

Deep Learning: The Transformer
  / deep-learning-the-transformer  

The Illustrated Transformer
https://jalammar.github.io/illustrate...

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
https://jalammar.github.io/visualizin...

The Scaling Hypothesis
https://gwern.net/scaling-hypothesis#...

0:00 Intro
0:27 Contents
0:35 Part 1: Paths to AGI
1:28 LessWrong blog post: new architectures
2:16 Strengths of Transformers vs Mamba
2:40 Predictions for amount of algorithms for AGI
3:26 Lots of investment into transformers
3:37 Example: analogous to CPU architectures
4:31 Part 2: Transformer attention
4:55 Transformer history: attention is all you need
5:41 What the attention mechanism is
6:32 Basic problem is ambiguity
6:53 Example: ambiguous word "it"
7:24 Example: meanings of the word "rainbow"
7:58 Example: the word "set"
8:16 Attention allows each potential meaning to be calculated and ranked
8:59 Part 3: Next-generation Mamba blocks
9:17 SSM or state space model
9:41 Mamba's two main innovations
10:00 Innovation 1: Selective SSM operation (vs attention)
10:50 Linear-time scaling vs quadratic time
11:47 Transformers have quadratic time training
12:19 Selection much more efficient
12:56 Episodic vs long term memory
13:32 Innovation 2: Mamba's hardware-aware implementation
14:06 Mamba only expands matrices in fast SRAM
14:47 Performance results
15:21 Is Mamba strictly better than Transformers?
16:14 Conclusion
16:44 Attention (Transformers) vs Selection (Mamba)
17:52 Outro

show more

Share/Embed