Mixtral 8x7B: the European alternative to GPT-3.5 by Mistral AI

Les grands modèles de langage, bien qu’ils n’aient de cesse de nous surprendre, ne sont pas encore parfaits. Le manque de fiabilité et de robustesse est toujours bien présent et ralenti leur déploiement en entreprise. Pour pallier ce manque, de nouveaux modèles apparaissent tous les jours pour tenter de surpasser leurs prédécesseurs.

Among the companies keen to develop these models is Mistral AI. This French startup, based in Paris and born in 2023, had already surprised the community with the launch of fully open source and particularly high-performance models. Their motivations did not fail to seduce us: "We are convinced that by forming our own models, publishing them openly and encouraging contributions from the community, we can build a credible alternative to the emerging AI oligopoly". Driven by these motivations, Mistral AI has just delivered a particularly promising new model to the community, which we will discuss in this article.

Large Language Models (LLMs), particularly energy-hungry technologies

According to the current paradigm, improving model performance means increasing their size. This is because language models are generally trained to have a very wide scope of action, which translates into mastery of a large number of languages and the ability to respond to numerous tasks (question answering, summary generation, translations, etc.). As a result, large language models are often delivered in several formats: 7B, 13B, 70B, etc. (i.e. "billion parameters"), with the largest formats delivering the best performance.

Or, cette augmentation en taille rime avec augmentation de la puissance de calcul que demandent leur entraînement et leur exploitation, qui elle-même est synonyme de consommation matérielle et énergétique considérable. Alors que la question de la sobriété énergétique est sur le devant de la scène, les IA génératives se retrouvent très justement montrées du doigt.

This is why efforts are being made to resolve to use the major language models in their smallest variations (7B), in return for a sacrifice in performance.

Mixtral 8x7B, the Mistral AI alternative

Mistral AI has just launched Mixtral 8x7B!

Pourquoi 8x7B ? Car il s’agit d’un modèle un peu particulier, composé de 8 modèles experts et d’un modèle de routage. A l’inférence, le modèle de routage décide lesquels des 2 sous-modèles parmi les 8 seront responsables du traitement du prompt fourni en entrée.

So, although the model has a total of 45 billion parameters (45B), only one sample (12 billion) is used to process each token in the prompt. Consequently, inference is performed with the same cost (in computing power) and latency as if the model had only 12 billion parameters, but benefits from the fact that each of the sub-models is specialized in a precise domain.

This strategy, known as SMoE (for Sparse Mixture of Experts), is not new and is enjoying renewed interest for use in deep learning. For Mistral AI, this paradigm shift has enabled them to claim that the model performs as well as, if not better than, the well-known GPT-3.5 and LLama2-70B, the most widely used OpenAI and Meta language models.

The model also supports a large pop-up window (32k tokens) and 5 languages (English, French, Italian, German and Spanish). An Instruct version of the model is also available.

Nous suivrons de très près les avancées des grands modèles de langage qui font appel à cette technique, et il ne fait aucun doute que Mixture-8x7B sera ajouté à notre banc d’essai afin d’être intégré dans des usages métiers via la plateforme Wikit Semantics.