Google |
Papers Blog |
12/2017 |
Attention Is All You Need (Transformers) |
10/2018 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
11/2019 |
Fast Transformer Decoding: One Write-Head is All You Need |
02/2020 |
GLU Variants Improve Transformer |
09/2020 |
Efficient Transformers: A Survey |
01/2021 |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
09/2021 |
Finetuned Language Models Are Zero-Shot Learners (Flan) |
11/2021 |
Sparse is Enough in Scaling Transformers |
12/2021 |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |
01/2022 |
LaMDA: Language Models for Dialog Applications |
01/2022 |
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
04/2022 |
PaLM: Scaling Language Modeling with Pathways |
10/2022 |
Scaling Instruction-Finetuned Language Models (Flan-Palm) |
10/2022 |
Large Language Models Can Self-Improve |
11/2022 |
Efficiently Scaling Transformer Inference |
03/2023 |
PaLM-E: An Embodied Multimodal Language Model |
04/2023 |
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference |
05/2023 |
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes |
05/2023 |
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction |
|
|
OpenAI |
Papers Blog |
04/2019 |
Generating Long Sequences with Sparse Transformers |
01/2020 |
Scaling Laws for Neural Language Models |
05/2020 |
Language Models are Few-Shot Learners (GPT-3) |
01/2022 |
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets |
03/2022 |
Training language models to follow instructions with human feedback (InstructGPT) |
07/2022 |
Efficient Training of Language Models to Fill in the Middle |
03/2023 |
GPT-4 Technical Report |
04/2023 |
Consistency Models |
|
|
Deepmind |
Papers Blog |
12/2021 |
Scaling Language Models: Methods, Analysis & Insights from Training Gopher |
12/2021 |
Improving language models by retrieving from trillions of tokens(RETRO) |
02/2022 |
Competition-Level Code Generation with AlphaCode |
02/2022 |
Unified Scaling Laws for Routed Language Models |
03/2022 |
Training Compute-Optimal Large Language Models (Chinchilla) |
04/2022 |
Flamingo: a Visual Language Model for Few-Shot Learning |
05/2022 |
A Generalist Agent (GATO) |
07/2022 |
Formal Algorithms for Transformers |
|
|
Meta |
Papers Blog |
04/2019 |
fairseq: A Fast, Extensible Toolkit for Sequence Modeling |
08/2021 |
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation |
05/2022 |
OPT: Open Pre-trained Transformer Language Models |
11/2022 |
Galactica: A Large Language Model for Science |
02/2023 |
LLaMA: Open and Efficient Foundation Language Models |
02/2023 |
Toolformer: Language Models Can Teach Themselves to Use Tools |
03/2023 |
Scaling Expert Language Models with Unsupervised Domain Discovery |
03/2023 |
SemDeDup: Data-efficient learning at web-scale through semantic deduplication |
04/2023 |
Segment Anything |
04/2023 |
A Cookbook of Self-Supervised Learning |
05/2023 |
Learning to Reason and Memorize with Self-Notes |
|
|
Microsoft |
Papers Blog |
01/2022 |
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale |
03/2022 |
DeepNet: Scaling Transformers to 1,000 Layers |
01/2023 |
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases |
02/2023 |
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
03/2023 |
Sparks of Artificial General Intelligence: Early experiments with GPT-4 |
03/2023 |
TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs |
04/2023 |
Instruction Tuning with GPT-4 |
04/2023 |
Inference with Reference: Lossless Acceleration of Large Language Models |
04/2023 |
Low-code LLM: Visual Programming over LLMs |
04/2023 |
WizardLM: Empowering Large Language Models to Follow Complex Instructions |
04/2023 |
MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks |
04/2023 |
ResiDual: Transformer with Dual Residual Connections |
|
|
Anthropic |
Papers Blog |
06/2022 |
Softmax Linear Units |
07/2022 |
Language Models (Mostly) Know What They Know |
12/2022 |
Constitutional AI: Harmlessness from AI Feedback (Claude) |
|
|
Hazy Research (Stanford) |
Papers Blog |
10/2021 |
Efficiently Modeling Long Sequences with Structured State Spaces (S4) |
04/2022 |
Monarch: Expressive Structured Matrices for Efficient and Accurate Training |
05/2022 |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |
12/2022 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models |
02/2023 |
Simple Hardware-Efficient Long Convolutions for Sequence Modeling |
02/2023 |
Hyena Hierarchy: Towards Larger Convolutional Language Models |
|
|
THUDM (Tsinghua University) |
Papers Github |
10/2022 |
GLM-130B: An Open Bilingual Pre-Trained Model |
03/2023 |
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X |
04/2023 |
DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task |
|
|
Open Models |
|
06/2021 |
GPT-J-6B: 6B JAX-Based Transformer |
09/2021 |
Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning |
03/2022 |
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis |
04/2022 |
GPT-NeoX-20B: An Open-Source Autoregressive Language Model |
11/2022 |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
04/2023 |
Visual Instruction Tuning (LLaVA) |
05/2023 |
StarCoder: May the source be with you! |
05/2023 |
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages |
05/2023 |
MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs |
05/2023 |
Otter: A Multi-Modal Model with In-Context Instruction Tuning |
|
|
Surveys |
|
02/2023 |
A Survey on Efficient Training of Transformers |
02/2023 |
Transformer models: an introduction and catalog |
02/2023 |
A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT |
03/2023 |
A Survey of Large Language Models |
04/2023 |
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review |
|
|
Various |
|
09/2014 |
Neural Machine Translation by Jointly Learning to Align and Translate |
10/2019 |
Root Mean Square Layer Normalization |
01/2021 |
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks |
03/2021 |
The Low-Rank Simplicity Bias in Deep Networks |
06/2021 |
LoRA: Low-Rank Adaptation of Large Language Models |
03/2022 |
Memorizing Transformers |
04/2022 |
UL2: Unifying Language Learning Paradigms |
06/2022 |
nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models |
08/2022 |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale |
09/2022 |
Petals: Collaborative Inference and Fine-tuning of Large Models |
10/2022 |
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |
10/2022 |
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation |
11/2022 |
An Algorithm for Routing Vectors in Sequences |
12/2022 |
Self-Instruct: Aligning Language Model with Self Generated Instructions |
12/2022 |
Parallel Context Windows Improve In-Context Learning of Large Language Models |
12/2022 |
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor |
12/2022 |
Pretraining Without Attention |
12/2022 |
The case for 4-bit precision: k-bit Inference Scaling Laws |
12/2022 |
Prompting Is Programming: A Query Language for Large Language Models |
01/2023 |
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient |
01/2023 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot |
01/2023 |
Memory Augmented Large Language Models are Computationally Universal |
02/2023 |
Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models |
02/2023 |
The Wisdom of Hindsight Makes Language Models Better Instruction Followers |
03/2023 |
COLT5: Faster Long-Range Transformers with Conditional Computation |
03/2023 |
High-throughput Generative Inference of Large Language Models with a Single GPU |
03/2023 |
Meet in the Middle: A New Pre-training Paradigm |
03/2023 |
Reflexion: an autonomous agent with dynamic memory and self-reflection |
03/2023 |
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning |
03/2023 |
FP8 versus INT8 for efficient deep learning inference |
03/2023 |
Self-Refine: Iterative Refinement with Self-Feedback |
04/2023 |
RPTQ: Reorder-based Post-training Quantization for Large Language Models |
04/2023 |
REFINER: Reasoning Feedback on Intermediate Representations |
04/2023 |
Generative Agents: Interactive Simulacra of Human Behavior |
04/2023 |
Compressed Regression over Adaptive Networks |
04/2023 |
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise |
04/2023 |
RRHF: Rank Responses to Align Language Models with Human Feedback without tears |
04/2023 |
CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society |
04/2023 |
Automatic Gradient Descent: Deep Learning without Hyperparameters |
04/2023 |
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models |
04/2023 |
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study |
04/2023 |
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling |
04/2023 |
Scaling Transformer to 1M tokens and beyond with RMT |
04/2023 |
Answering Questions by Meta-Reasoning over Multiple Chains of Thought |
04/2023 |
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables |
04/2023 |
We're Afraid Language Models Aren't Modeling Ambiguity |
04/2023 |
The Internal State of an LLM Knows When its Lying |
04/2023 |
Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks |
05/2023 |
Towards Unbiased Training in Federated Open-world Semi-supervised Learning |
05/2023 |
Unlimiformer: Long-Range Transformers with Unlimited Length Input |
05/2023 |
FreeLM: Fine-Tuning-Free Language Model |
05/2023 |
Cuttlefish: Low-rank Model Training without All The Tuning |
05/2023 |
AttentionViz: A Global View of Transformer Attention |
|
|
Articles |
|
03/2019 |
Rich Sutton - The Bitter Lesson |
04/2021 |
EleutherAI - Rotary Embeddings: A Relative Revolution |
01/2023 |
Lilian Weng - The Transformer Family Version 2.0 |
01/2023 |
Lilian Weng - Large Transformer Model Inference Optimization |
01/2023 |
Semianalysis - overview of OpenAI Triton And PyTorch 2.0 |
03/2023 |
Stanford - Alpaca: A Strong, Replicable Instruction-Following Model |
04/2023 |
Yohei Nakajima - AsymmeTrix: Asymmetric Vector Embeddings for Directional Similarity Search |