“`html
Generated with DALL-E 3
In the rapidly evolving landscape of Natural Language Processing, 2023 emerged as a pivotal year, witnessing groundbreaking research in the realm of Large Language Models (LLMs). These LLMs, characterized by their vast parameter sizes and impressive capabilities, played a central role in shaping the future of AI applications. This introduction provides a glimpse into the transformative research that unfolded in the field, where language models have been refined, scaled down, and even integrated with external tools to tackle a diverse range of tasks.
If you’d like to skip around, here are the research papers we featured:
- LLaMA by Meta AI
- LLaMA 2 by Meta AI
- GPT-4 by OpenAI
- Sparks of AGI by Microsoft
- BLIP-2 by Salesforce
- InstructBLIP by Salesforce
- PALM-E by Google
- PALM-2 by Google
- Toolformer by Meta AI
- Tree of Thoughts by Princeton University and Google DeepMind
If such research summaries are useful for you, subscribe to our AI mailing list to be alerted when we release new material.
Top LLM Research Papers 2023
1. LLaMA by Meta AI
Summary
The Meta AI team asserts that smaller models trained on more tokens are easier to retrain and fine-tune for specific product applications. Therefore, they introduced LLaMA (Large Language Model Meta AI), a collection of foundational language models with 7B to 65B parameters. LLaMA 33B and 65B were trained on 1.4 trillion tokens, while the smallest model, LLaMA 7B, was trained on one trillion tokens. They exclusively used publicly available datasets, without depending on proprietary or restricted data. The team also implemented key architectural enhancements and training speed optimization techniques. Consequently, LLaMA-13B outperformed GPT-3, being over 10 times smaller, and LLaMA-65B exhibited competitive performance with PaLM-540B.
Where to learn more about this research?
Where can you get implementation code?
- The code implementation of the original LLaMA-1 model is available here on GitHub.
2. LLaMA 2 by Meta AI
Summary
LLaMA 2 is an enhanced version of its predecessor, trained on a new data mix, featuring a 40% larger pretraining corpus, doubled context length, and grouped-query attention. The LLaMA 2 series of models includes LLaMA 2 and LLaMA 2-Chat, optimized for dialogue, with sizes ranging from 7 to 70 billion parameters. These models exhibit superior performance in helpfulness and safety benchmarks compared to open-source counterparts and are comparable to some closed-source models. The development process involved rigorous safety measures, including safety-specific data annotation and red-teaming. The paper aims to contribute to the responsible development of LLMs by providing detailed descriptions of fine-tuning methodologies and safety improvements.
Where to learn more about this research?
Where can you get implementation code?
- Meta AI released LLaMA 2 models to individuals, creators, researchers, and businesses of all sizes. You can access model weights and starting code for pretrained and fine-tuned LLaMA 2 language models through GitHub.
3. GPT-4 by OpenAI
Summary
GPT-4 is a large-scale, multimodal model that accepts image and text inputs and generates text outputs. Due to competitive and safety concerns, specific details about the model’s architecture and training are withheld. In terms of performance, GPT-4 surpasses previous language models on traditional benchmarks and shows significant improvements in user intent understanding and safety properties. The model also achieves human-level performance on various exams, including a top 10% score on a simulated Uniform Bar Examination.
Where to learn more about this research?
Where can you get implementation code?
- Code implementation of GPT-4 is not available.
4. Sparks of AGI by Microsoft
Summary
In this research paper, a team from Microsoft Research analyzes an early version of OpenAI’s GPT-4, which was still under active development at the time. The team argues that GPT-4 represents a new class of large language models, exhibiting more generalized intelligence compared to previous AI models. Their investigation reveals GPT-4’s expansive capabilities across various domains, including mathematics, coding, vision, medicine, law, and psychology. They highlight that GPT-4 can solve complex and novel tasks without specialized prompting, often achieving performance close to human level.
The Microsoft team also emphasizes the potential of GPT-4 to be considered an early, albeit incomplete, form of artificial general intelligence (AGI). They focus on identifying GPT-4’s limitations and discuss the challenges in progressing towards more advanced and comprehensive AGI versions. This includes considering new paradigms beyond the current next-word prediction model.
Where to learn more about this research?
Where can you get implementation code?
5. BLIP-2 by Salesforce
Summary
BLIP-2 is an efficient and generic pre-training framework for vision-and-language models, designed to circumvent the increasingly prohibitive cost of pre-training large-scale models. BLIP-2 leverages off-the-shelf frozen pre-trained image encoders and frozen large language models to bootstrap vision-language pre-training, incorporating a lightweight Querying Transformer pre-trained in two stages. The first stage initiates vision-language representation learning from a frozen image encoder, and the second stage propels vision-to-language generative learning from a frozen language model.
Despite having significantly fewer trainable parameters, BLIP-2 outperforms state-of-the-art methods, surpassing DeepMind’s Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters. The model also exhibits promising zero-shot image-to-text generation capabilities following natural language instructions.
Where to learn more about this research?
Where can you get implementation code?
6. InstructBLIP by Salesforce
Summary
InstructBLIP is a novel framework for vision-language instruction tuning, enabling general-purpose models to process a wide range of visual tasks using natural language instructions. This study builds on the pre-trained BLIP-2 model, incorporating an image encoder, a large language model, and a Querying Transformer (Q-Former) to integrate the two. The instruction tuning involves fine-tuning the Q-Former while keeping the image encoder and LLM frozen. For comprehensive study and evaluation, the researchers transformed 26 datasets into instruction tuning format, using 13 datasets for instruction tuning and 13 for zero-shot evaluation. A key innovation is the instruction-aware visual feature extraction, allowing the model to extract relevant features based on given instructions.
InstructBLIP models demonstrate state-of-the-art zero-shot performance across various vision-language tasks, significantly outperforming BLIP-2 and larger Flamingo models, as well as leading to state-of-the-art performance, when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts).