NVIDIA has been exceptionally aggressive this year. The consensus among tech giants seems to be a relentless pursuit of computational power. Part of the reason is that many large models are built on the Transformer architecture, which demands high computational power. But if, in the iterative process, the Transformer is gradually replaced by architectures with lower computational requirements, could this become NVIDIA's "latent risk"?

Rob Toews, a prominent venture capitalist and partner at Radical Ventures who invested in Cohere, a competitor to OpenAI, highlighted in a column on September 3 that the Transformer supports parallelism during training, aligning perfectly with the GPU's surge in popularity. GPUs, with their multitude of stream processors, are ideal for parallel processing of dense data and concurrent calculations, making them perfectly suited for Transformer-based computational workloads.

There's no doubt that the Transformer architecture is powerful and has profoundly impacted the AI industry. However, its drawbacks are evident. As the length of content increases, the computational complexity becomes significantly high. Additionally, as model sizes continue to grow, the required computational power increases exponentially.

Toews noted that to address the issues with the Transformer, several solutions like Hyena, Monarch Mixer, BiGS, and MEGA have proposed using the Subquadratic method to reduce computational complexity and, in turn, computational demand.

Toews candidly stated that while these architectures still have a long way to go before they can challenge the Transformer's dominance, it's undeniable that in the evolution of AI, new innovations continually emerge. In this ever-evolving landscape, perhaps nothing remains unchallenged forever.

When computational demand surges, in some ways, those who possess NVIDIA GPUs hold the most valuable "currency" of the AI era. However, if future architectures that require less computational power replace the Transformer, it could pose a threat to NVIDIA, the leading "supplier" in this context.

The Enormous Computational Cost of Transformers

On June 12, 2017, the paper "Attention is All You Need" was published, introducing the game-changing Transformer architecture. As of September 4, the Transformer has been around for over six years, and this paper has been cited a whopping 87,345 times. Analysts point out that the continuously expanding models based on the Transformer come at a steep price in terms of performance and energy consumption. Thus, while the potential of AI might be boundless, physical resources and costs are finite.

Why does the Transformer demand such high computational power?

Toews explained that it primarily boils down to two reasons: 1. The computational complexity of the attention mechanism, and 2. The increasing size of models.

The fundamental principle of the Transformer is to use a self-attention mechanism to capture dependencies in sequential data, regardless of their distance.

The attention mechanism requires comparing every word in a sequence with every other word, leading to a computational complexity of O(n^2). This quadratic complexity means that as text length increases, the computational cost rises sharply.

Furthermore, the Transformer architecture can be scaled up efficiently, prompting researchers to train even larger models based on it. Current mainstream language models have parameters in the hundreds of billions or even trillions, demanding vast computational power. As model sizes grow, the required computational power increases exponentially.

Alphabet's CFO, Ruth Porat, mentioned in an earnings call that due to investments in AI infrastructure, capital expenditures would be "slightly higher" than last year's record levels.

Microsoft's latest report showed that its quarterly capital expenditures exceeded expectations. CFO Amy Hood attributed this to increased investments in AI infrastructure.

Earlier this year, Microsoft invested $100 billion in OpenAI to support the massive computational resources needed for training large language models. Inflection, a startup that's only 18 months old, raised over $1 billion to build a GPU cluster to train its extensive language model.

NVIDIA's GPUs are in such high demand that they've hit production bottlenecks. Their latest H100 chip has already sold out, with new orders expected to be fulfilled by the first or even the second quarter of 2024.

Toews emphasized that all these factors highlight the immense computational resource demands of Transformer-based models, so much so that the current AI boom has led to a global GPU shortage, with hardware manufacturers struggling to keep up with the surging demand.

Challenges Faced by the Transformer

Toews also pointed out that the Transformer has limitations in processing sentence lengths. Most existing methods use truncation, leading to information loss. Thus, pre-training on long texts remains a significant challenge.

This AI arms race is bound to continue. If companies like OpenAI or Anthropic keep using the Transformer architecture, their model's text sequence lengths will be limited.

Toews mentioned that various attempts have been made to update the Transformer architecture, still using the attention mechanism but better handling long sequences. However, these improved Transformer architectures, such as Longformer, Reformer, Performer, Linformer, and Big Bird, often sacrifice some performance and haven't been widely adopted.

Toews emphasized that nothing is perfect, and history never stops. While the Transformer currently holds a dominant position, it's not without its flaws, which opens doors for new architectures.

Are There Challengers to the Throne?

Toews believes that the search for an architecture to replace the Transformer is one of the most promising areas in AI research. One direction is to replace the attention mechanism with a new function. Several, including Hyena, Monarch Mixer, BiGS, and MEGA, have proposed using the Subquadratic method to reduce computational complexity and, consequently, computational demand.

Toews highlighted that researchers from Stanford and Mila introduced a new architecture called Hyena, which has the potential to replace the Transformer. It's an attention-free, convolutional structure that can match the quality of attention models while reducing computational costs. On sub-quadratic NLP tasks, it performs exceptionally well.

Reportedly, Hyena can achieve accuracy comparable to GPT-4 but uses 100 times less computational power. It's the first attention-free architecture that can match GPT's quality with a 20% reduction in total FLOPS, showing potential to become a universal deep learning operator for image classification.

Toews noted that the initial Hyena research was conducted on a relatively small scale. The largest Hyena model has 1.3 billion parameters, while GPT-3 has 175 billion, and GPT-4 is rumored to have 1.8 trillion. A crucial test for the Hyena architecture will be whether it can continue to demonstrate strong performance and efficiency improvements when scaled to the current Transformer model sizes.

Another potential successor to the Transformer is the liquid neural network. Inspired by the nematode Caenorhabditis elegans, two researchers from MIT created the so-called "liquid neural networks."

These networks are not only faster but also remarkably stable, meaning they can handle vast amounts of input without going haywire.

Toews believes this smaller architecture means that liquid neural networks are more transparent and easier for humans to understand than the Transformer.

After all, for humans, it's easier to explain what's happening in a network with 253 connections than one with 175 billion.

As architectures continue to evolve and reduce their reliance on computational power, could this signal a potential impact on NVIDIA's future revenues?