From "Eureka" to "Attention is all you need"

The way to Artificial Intelligence

These two slogans are historically spaced nearly 2,400 years apart, but they herald, one in the realm of fluid mechanics and the other in the domain of natural language processing (NLP), a true revolution. Archimedes' principle enabled the construction of the massive ships (even floating cities) that we see today, while the attention theory, since 2017, has paved the way to build artificial intelligence models that are claimed to surpass human intelligence. While Archimedes' principle is elementary knowledge for any high school student, the attention theory remains an enigma for many non-experts. So, what is it all about?

Firstly, the slogan "Attention is all you need" is the title of a paper penned in 2017 by a team of researchers from Google Brain, led by Ashish Vaswani and Noam Shazeer. The paper introduces the concept of "Transformers," an architecture model based on attention mechanisms rather than the architecture of recurrent neural networks (RNN). Put simply, in an RNN architecture, the outputs are dependent on several previous inputs. But, understandably, an architecture where the output is based on multiple previous inputs remains a sequential architecture. This is where the concept of "Transformers" introduces a break, as an architecture based on "Transformers" is parallel and not sequential.

More concretely, a model based on "Transformers" can simultaneously process all parts of an input without concern for their sequential order. For instance, when faced with the question, "what were the stakes of the Punic Wars?", a system based on recurrent neural networks (RNN) would handle the sequence word by word, maintaining an internal state across the words "stakes," "wars," and then "Punic." Conversely, a model based on Transformers would process all the words at once, assigning attention (or weight) variably to each word in the sequence based on its relative importance and its relations with other words.

Since the introduction of Transformers, numerous derivative models have been developed for various tasks. BERT (Bidirectional Encoder Representations from Transformers) is optimized for language understanding, while GPT (Generative Pre-trained Transformer) is designed for text generation.

Models based on Transformers are used in a plethora of NLP applications, including machine translation, question answering, text classification, text generation, and many more.


An Artificial Intelligence revolution that benefits DAT's customers