|
Google Papers Blog |
12/2017 |
Attention Is All You Need (Transformers) |
10/2018 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
10/2019 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) |
11/2019 |
Fast Transformer Decoding: One Write-Head is All You Need |
02/2020 |
GLU Variants Improve Transformer |
03/2020 |
Talking-Heads Attention |
05/2020 |
Conformer: Convolution-augmented Transformer for Speech Recognition |
09/2020 |
Efficient Transformers: A Survey |
12/2020 |
RealFormer: Transformer Likes Residual Attention |
01/2021 |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
09/2021 |
Finetuned Language Models Are Zero-Shot Learners (Flan) |
09/2021 |
Primer: Searching for Efficient Transformers for Language Modeling |
11/2021 |
Sparse is Enough in Scaling Transformers |
12/2021 |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |
01/2022 |
LaMDA: Language Models for Dialog Applications |
01/2022 |
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
04/2022 |
PaLM: Scaling Language Modeling with Pathways |
10/2022 |
Scaling Instruction-Finetuned Language Models (Flan-Palm) |
10/2022 |
Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models |
10/2022 |
Large Language Models Can Self-Improve |
11/2022 |
Efficiently Scaling Transformer Inference |
11/2022 |
Fast Inference from Transformers via Speculative Decoding |
02/2023 |
Symbolic Discovery of Optimization Algorithms (Lion) |
03/2023 |
PaLM-E: An Embodied Multimodal Language Model |
04/2023 |
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference |
05/2023 |
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes |
05/2023 |
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction |
05/2023 |
PaLM 2 Technical Report |
05/2023 |
Symbol tuning improves in-context learning in language models |
05/2023 |
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models |
05/2023 |
Towards Expert-Level Medical Question Answering with Large Language Models (Med-Palm 2) |
05/2023 |
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining |
05/2023 |
How Does Generative Retrieval Scale to Millions of Passages? |
05/2023 |
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoint |
05/2023 |
Small Language Models Improve Giants by Rewriting Their Outputs |
06/2023 |
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners |
06/2023 |
AudioPaLM: A Large Language Model That Can Speak and Listen |
06/2023 |
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting |
07/2023 |
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models |
09/2023 |
Uncovering mesa-optimization algorithms in Transformers |
10/2023 |
Think before you speak: Training Language Models With Pause Tokens |
10/2023 |
SpecTr: Fast Speculative Decoding via Optimal Transport |
11/2023 |
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs |
11/2023 |
Automatic Engineering of Long Prompts |
|
|
|
OpenAI Papers Blog |
07/2017 |
Proximal Policy Optimization Algorithms |
04/2019 |
Generating Long Sequences with Sparse Transformers |
01/2020 |
Scaling Laws for Neural Language Models |
05/2020 |
Language Models are Few-Shot Learners (GPT-3) |
01/2022 |
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets |
03/2022 |
Training language models to follow instructions with human feedback (InstructGPT) |
07/2022 |
Efficient Training of Language Models to Fill in the Middle |
03/2023 |
GPT-4 Technical Report |
04/2023 |
Consistency Models |
05/2023 |
Let's Verify Step by Step |
10/2023 |
Improving Image Generation with Better Captions (DALL·E 3) |
|
|
|
Deepmind (Google Deepmind as of 4/2023) Papers Blog |
10/2019 |
Stabilizing Transformers for Reinforcement Learning |
12/2021 |
Scaling Language Models: Methods, Analysis & Insights from Training Gopher |
12/2021 |
Improving language models by retrieving from trillions of tokens (RETRO) |
02/2022 |
Competition-Level Code Generation with AlphaCode |
02/2022 |
Unified Scaling Laws for Routed Language Models |
03/2022 |
Training Compute-Optimal Large Language Models (Chinchilla) |
04/2022 |
Flamingo: a Visual Language Model for Few-Shot Learning |
05/2022 |
A Generalist Agent (GATO) |
07/2022 |
Formal Algorithms for Transformers |
02/2023 |
Accelerating Large Language Model Decoding with Speculative Sampling |
05/2023 |
Tree of Thoughts: Deliberate Problem Solving with Large Language Models |
05/2023 |
Block-State Transformer |
05/2023 |
Randomized Positional Encodings Boost Length Generalization of Transformers |
08/2023 |
From Sparse to Soft Mixtures of Experts |
09/2023 |
Large Language Models as Optimizers |
09/2023 |
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset (MT Model) |
09/2023 |
Scaling Laws for Sparsely-Connected Foundation Models |
09/2023 |
Language Modeling Is Compression |
09/2023 |
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution |
10/2023 |
Large Language Models as Analogical Reasoners |
10/2023 |
Controlled Decoding from Language Models |
10/2023 |
A General Theoretical Paradigm to Understand Learning from Human Preferences |
|
|
|
Meta (Facebook AI Research) Papers Blog |
04/2019 |
fairseq: A Fast, Extensible Toolkit for Sequence Modeling |
07/2019 |
Augmenting Self-attention with Persistent Memory |
11/2019 |
Improving Transformer Models by Reordering their Sublayers |
08/2021 |
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation |
05/2022 |
OPT: Open Pre-trained Transformer Language Models |
07/2022 |
Beyond neural scaling laws: beating power law scaling via data pruning |
11/2022 |
Galactica: A Large Language Model for Science |
01/2023 |
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA) |
02/2023 |
LLaMA: Open and Efficient Foundation Language Models |
02/2023 |
Toolformer: Language Models Can Teach Themselves to Use Tools |
03/2023 |
Scaling Expert Language Models with Unsupervised Domain Discovery |
03/2023 |
SemDeDup: Data-efficient learning at web-scale through semantic deduplication |
04/2023 |
Segment Anything (SAM) |
04/2023 |
A Cookbook of Self-Supervised Learning |
05/2023 |
Learning to Reason and Memorize with Self-Notes |
05/2023 |
ImageBind: One Embedding Space To Bind Them All |
05/2023 |
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers |
05/2023 |
LIMA: Less Is More for Alignment |
05/2023 |
Scaling Speech Technology to 1,000+ Languages |
05/2023 |
READ: Recurrent Adaptation of Large Transformers |
05/2023 |
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models |
06/2023 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles |
06/2023 |
Simple and Controllable Music Generation (MusicGen) |
06/2023 |
Improving Open Language Models by Learning from Organic Interactions (BlenderBot 3x) |
06/2023 |
Extending Context Window of Large Language Models via Positional Interpolation |
06/2023 |
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale |
07/2023 |
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3leon) |
07/2023 |
Llama 2: Open Foundation and Fine-Tuned Chat Models |
08/2023 |
SeamlessM4T—Massively Multilingual & Multimodal Machine Translation |
08/2023 |
D4: Improving LLM Pretraining via Document De-Duplication and Diversification |
08/2023 |
Code Llama: Open Foundation Models for Code |
08/2023 |
Nougat: Neural Optical Understanding for Academic Documents |
09/2023 |
Contrastive Decoding Improves Reasoning in Large Language Models |
09/2023 |
Effective Long-Context Scaling of Foundation Models |
09/2023 |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model |
09/2023 |
Vision Transformers Need Registers |
10/2023 |
RA-DIT: Retrieval-Augmented Dual Instruction Tuning |
10/2023 |
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation |
10/2023 |
Generative Pre-training for Speech with Flow Matching |
11/2023 |
Emu Edit: Precise Image Editing via Recognition and Generation Tasks |
|
|
|
Microsoft Papers Blog |
12/2015 |
Deep Residual Learning for Image Recognition |
05/2021 |
EL-Attention: Memory Efficient Lossless Attention for Generation |
01/2022 |
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale |
03/2022 |
DeepNet: Scaling Transformers to 1,000 Layers |
12/2022 |
A Length-Extrapolatable Transformer |
01/2023 |
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases |
02/2023 |
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
03/2023 |
Sparks of Artificial General Intelligence: Early experiments with GPT-4 |
03/2023 |
TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs |
04/2023 |
Instruction Tuning with GPT-4 |
04/2023 |
Inference with Reference: Lossless Acceleration of Large Language Models |
04/2023 |
Low-code LLM: Visual Programming over LLMs |
04/2023 |
WizardLM: Empowering Large Language Models to Follow Complex Instructions |
04/2023 |
MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks |
04/2023 |
ResiDual: Transformer with Dual Residual Connections |
05/2023 |
Code Execution with Pre-trained Language Models |
05/2023 |
Small Models are Valuable Plug-ins for Large Language Models |
05/2023 |
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing |
06/2023 |
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 |
06/2023 |
Augmenting Language Models with Long-Term Memory |
06/2023 |
WizardCoder: Empowering Code Large Language Models with Evol-Instruct |
06/2023 |
Textbooks Are All You Need (phi-1) |
07/2023 |
In-context Autoencoder for Context Compression in a Large Language Model |
07/2023 |
Retentive Network: A Successor to Transformer for Large Language Models |
08/2023 |
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference |
09/2023 |
Efficient RLHF: Reducing the Memory Usage of PPO |
09/2023 |
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models |
09/2023 |
Textbooks Are All You Need II (phi-1.5) |
09/2023 |
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training |
09/2023 |
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models |
09/2023 |
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models |
10/2023 |
Sparse Backpropagation for MoE Training |
10/2023 |
Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models |
10/2023 |
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness |
10/2023 |
Augmented Embeddings for Custom Retrievals |
10/2023 |
Guiding Language Model Reasoning with Planning Tokens |
10/2023 |
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V |
10/2023 |
CodeFusion: A Pre-trained Diffusion Model for Code Generation |
10/2023 |
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery |
10/2023 |
FP8-LM: Training FP8 Large Language Models |
11/2023 |
Orca 2: Teaching Small Language Models How to Reason |
|
|
|
Hazy Research (Stanford) Papers Blog |
10/2021 |
Efficiently Modeling Long Sequences with Structured State Spaces (S4) |
04/2022 |
Monarch: Expressive Structured Matrices for Efficient and Accurate Training |
05/2022 |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |
12/2022 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models |
02/2023 |
Simple Hardware-Efficient Long Convolutions for Sequence Modeling |
02/2023 |
Hyena Hierarchy: Towards Larger Convolutional Language Models |
06/2023 |
TART: A plug-and-play Transformer module for task-agnostic reasoning |
07/2023 |
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning |
11/2023 |
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores |
|
|
|
THUDM (Tsinghua University) Papers Github |
10/2022 |
GLM-130B: An Open Bilingual Pre-Trained Model |
03/2023 |
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X |
04/2023 |
DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task |
06/2023 |
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences |
09/2023 |
GPT Can Solve Mathematical Problems Without a Calculator (MathGLM) |
10/2023 |
AgentTuning: Enabling Generalized Agent Abilities for LLMs (AgentLM) |
11/2023 |
CogVLM: Visual Expert for Pretrained Language Models |
|
|
|
Open Models |
06/2021 |
GPT-J-6B: 6B JAX-Based Transformer |
09/2021 |
Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning |
03/2022 |
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis |
04/2022 |
GPT-NeoX-20B: An Open-Source Autoregressive Language Model |
11/2022 |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
04/2023 |
Visual Instruction Tuning (LLaVA) |
05/2023 |
StarCoder: May the source be with you! |
05/2023 |
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages |
05/2023 |
Otter: A Multi-Modal Model with In-Context Instruction Tuning |
05/2023 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
05/2023 |
CodeT5+: Open Code Large Language Models for Code Understanding and Generation |
05/2023 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities |
05/2023 |
RWKV: Reinventing RNNs for the Transformer Era |
05/2023 |
Lion: Adversarial Distillation of Closed-Source Large Language Model |
05/2023 |
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training |
06/2023 |
Segment Anything in High Quality |
06/2023 |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding |
06/2023 |
High-Fidelity Audio Compression with Improved RVQGAN |
06/2023 |
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models |
06/2023 |
Anticipatory Music Transformer |
06/2023 |
RepoFusion: Training Code Models to Understand Your Repository |
06/2023 |
MPT-30B: Raising the bar for open-source foundation models |
06/2023 |
Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity |
06/2023 |
ViNT: A Foundation Model for Visual Navigation |
06/2023 |
How Long Can Open-Source LLMs Truly Promise on Context Length? (LongChat) |
07/2023 |
Hierarchical Open-vocabulary Universal Image Segmentation |
07/2023 |
Focused Transformer: Contrastive Training for Context Scaling (LongLLaMA |
07/2023 |
Rhythm Modeling for Voice Conversion (Urhythmic) |
07/2023 |
Scaling TransNormer to 175 Billion Parameters |
08/2023 |
Separate Anything You Describe |
08/2023 |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data |
09/2023 |
RADIO: Reference-Agnostic Dubbing Video Synthesis |
09/2023 |
Matcha-TTS: A fast TTS architecture with conditional flow matching |
09/2023 |
DreamLLM: Synergistic Multimodal Comprehension and Creation |
09/2023 |
Baichuan 2: Open Large-scale Language Models |
09/2023 |
Qwen Technical Report |
09/2023 |
Mistral 7B |
10/2023 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning |
10/2023 |
Improved Baselines with Visual Instruction Tuning (LLaVA 1.5) |
10/2023 |
LLark: A Multimodal Foundation Model for Music |
10/2023 |
SALMONN: Towards Generic Hearing Abilities for Large Language Models |
10/2023 |
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents |
11/2023 |
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models |
11/2023 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition |
11/2023 |
YUAN 2.0: A Large Language Model with Localized Filtering-based Attention |
|
|
|
Various |
09/2014 |
Neural Machine Translation by Jointly Learning to Align and Translate |
06/2019 |
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View |
10/2019 |
Root Mean Square Layer Normalization |
10/2019 |
Transformers without Tears: Improving the Normalization of Self-Attention |
12/2019 |
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection |
02/2020 |
On Layer Normalization in the Transformer Architecture |
04/2020 |
Longformer: The Long-Document Transformer |
06/2020 |
Memory Transformer |
07/2020 |
Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity |
12/2020 |
ERNIE-Doc: A Retrospective Long-Document Modeling Transformer |
01/2021 |
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks |
03/2021 |
The Low-Rank Simplicity Bias in Deep Networks |
04/2021 |
RoFormer: Enhanced Transformer with Rotary Position Embedding |
06/2021 |
LoRA: Low-Rank Adaptation of Large Language Models |
07/2023 |
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention |
03/2022 |
Memorizing Transformers |
04/2022 |
UL2: Unifying Language Learning Paradigms |
05/2022 |
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (IA3) |
06/2022 |
nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models |
07/2022 |
Language Models (Mostly) Know What They Know |
08/2022 |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale |
09/2022 |
Petals: Collaborative Inference and Fine-tuning of Large Models |
10/2022 |
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |
10/2022 |
Truncation Sampling as Language Model Desmoothing |
10/2022 |
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation |
11/2022 |
An Algorithm for Routing Vectors in Sequences |
12/2022 |
Self-Instruct: Aligning Language Model with Self Generated Instructions |
12/2022 |
Parallel Context Windows Improve In-Context Learning of Large Language Models |
12/2022 |
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor |
12/2022 |
Pretraining Without Attention |
12/2022 |
The case for 4-bit precision: k-bit Inference Scaling Laws |
12/2022 |
Prompting Is Programming: A Query Language for Large Language Models |
01/2023 |
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient |
01/2023 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot |
01/2023 |
Memory Augmented Large Language Models are Computationally Universal |
01/2023 |
Progress measures for grokking via mechanistic interpretability |
02/2023 |
Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models |
02/2023 |
The Wisdom of Hindsight Makes Language Models Better Instruction Followers |
02/2023 |
End-to-End Deep Learning Framework for Real-Time Inertial Attitude Estimation using 6DoF IMU |
02/2023 |
The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation |
03/2023 |
COLT5: Faster Long-Range Transformers with Conditional Computation |
03/2023 |
High-throughput Generative Inference of Large Language Models with a Single GPU |
03/2023 |
Meet in the Middle: A New Pre-training Paradigm |
03/2023 |
Reflexion: an autonomous agent with dynamic memory and self-reflection |
03/2023 |
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning |
03/2023 |
FP8 versus INT8 for efficient deep learning inference |
03/2023 |
Self-Refine: Iterative Refinement with Self-Feedback |
04/2023 |
RPTQ: Reorder-based Post-training Quantization for Large Language Models |
04/2023 |
REFINER: Reasoning Feedback on Intermediate Representations |
04/2023 |
Generative Agents: Interactive Simulacra of Human Behavior |
04/2023 |
Compressed Regression over Adaptive Networks |
04/2023 |
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise |
04/2023 |
RRHF: Rank Responses to Align Language Models with Human Feedback without tears |
04/2023 |
CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society |
04/2023 |
Automatic Gradient Descent: Deep Learning without Hyperparameters |
04/2023 |
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models |
04/2023 |
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study |
04/2023 |
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling |
04/2023 |
Scaling Transformer to 1M tokens and beyond with RMT |
04/2023 |
Answering Questions by Meta-Reasoning over Multiple Chains of Thought |
04/2023 |
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables |
04/2023 |
We're Afraid Language Models Aren't Modeling Ambiguity |
04/2023 |
The Internal State of an LLM Knows When its Lying |
04/2023 |
Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks |
05/2023 |
Towards Unbiased Training in Federated Open-world Semi-supervised Learning |
05/2023 |
Unlimiformer: Long-Range Transformers with Unlimited Length Input |
05/2023 |
FreeLM: Fine-Tuning-Free Language Model |
05/2023 |
Cuttlefish: Low-rank Model Training without All The Tuning |
05/2023 |
AttentionViz: A Global View of Transformer Attention |
05/2023 |
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models |
05/2023 |
A Frustratingly Easy Improvement for Position Embeddings via Random Padding |
05/2023 |
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision |
05/2023 |
Explanation-based Finetuning Makes Models More Robust to Spurious Cues |
05/2023 |
An automatically discovered chain-of-thought prompt generalizes to novel models and datasets |
05/2023 |
Recommender Systems with Generative Retrieval |
05/2023 |
Fast Distributed Inference Serving for Large Language Models |
05/2023 |
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models |
05/2023 |
Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach |
05/2023 |
Active Retrieval Augmented Generation |
05/2023 |
Scalable Coupling of Deep Learning with Logical Reasoning |
05/2023 |
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca |
05/2023 |
StructGPT: A General Framework for Large Language Model to Reason over Structured Data |
05/2023 |
Pre-Training to Learn in Context |
05/2023 |
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings |
05/2023 |
Accelerating Transformer Inference for Translation via Parallel Decoding |
05/2023 |
Cooperation Is All You Need |
05/2023 |
PTQD: Accurate Post-Training Quantization for Diffusion Models |
05/2023 |
LLM-Pruner: On the Structural Pruning of Large Language Models |
05/2023 |
SelfzCoT: a Self-Prompt Zero-shot CoT from Semantic-level to Code-level for a Better Utilization of LLMs |
05/2023 |
QLoRA: Efficient Finetuning of Quantized LLMs |
05/2023 |
"According to ..." Prompting Language Models Improves Quoting from Pre-Training Data |
05/2023 |
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training |
05/2023 |
Landmark Attention: Random-Access Infinite Context Length for Transformers |
05/2023 |
Scaling Data-Constrained Language Models |
05/2023 |
Fine-Tuning Language Models with Just Forward Passes |
05/2023 |
Intriguing Properties of Quantization at Scale |
05/2023 |
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time |
05/2023 |
Blockwise Parallel Transformer for Long Context Large Models |
05/2023 |
The Impact of Positional Encoding on Length Generalization in Transformers |
05/2023 |
Adapting Language Models to Compress Contexts |
05/2023 |
Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
06/2023 |
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration |
06/2023 |
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention |
06/2023 |
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training |
06/2023 |
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |
06/2023 |
Fine-Tuning Language Models with Advantage-Induced Policy Alignment |
06/2023 |
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards |
06/2023 |
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model |
06/2023 |
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories |
06/2023 |
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion |
06/2023 |
Word sense extension |
06/2023 |
Mitigating Transformer Overconfidence via Lipschitz Regularization |
06/2023 |
Recurrent Attention Networks for Long-text Modeling |
06/2023 |
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning |
06/2023 |
SqueezeLLM: Dense-and-Sparse Quantization |
06/2023 |
Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training |
06/2023 |
Propagating Knowledge Updates to LMs Through Distillation |
06/2023 |
Full Parameter Fine-tuning for Large Language Models with Limited Resources |
06/2023 |
A Simple and Effective Pruning Approach for Large Language Models |
06/2023 |
InRank: Incremental Low-Rank Learning |
06/2023 |
Evaluating the Zero-shot Robustness of Instruction-tuned Language Models |
06/2023 |
Learning to Generate Better Than Your LLM (RLGF) |
06/2023 |
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing |
06/2023 |
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model |
06/2023 |
FLuRKA: Fast fused Low-Rank & Kernel Attention |
06/2023 |
Stay on topic with Classifier-Free Guidance |
07/2023 |
AutoST: Training-free Neural Architecture Search for Spiking Transformers |
07/2023 |
Single Sequence Prediction over Reasoning Graphs for Multi-hop QA |
07/2023 |
Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models |
07/2023 |
Facing off World Model Backbones: RNNs, Transformers, and S4 |
07/2023 |
Improving Retrieval-Augmented Large Language Models via Data Importance Learning |
07/2023 |
Teaching Arithmetic to Small Transformers |
07/2023 |
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models |
07/2023 |
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates |
07/2023 |
Copy Is All You Need (CoG) |
07/2023 |
Multi-Method Self-Training: Improving Code Generation With Text, And Vice Versa |
07/2023 |
Divide & Bind Your Attention for Improved Generative Semantic Nursing |
07/2023 |
Challenges and Applications of Large Language Models |
07/2023 |
Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models |
07/2023 |
QuIP: 2-Bit Quantization of Large Language Models With Guarantees |
07/2023 |
CoRe Optimizer: An All-in-One Solution for Machine Learning |
07/2023 |
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time |
08/2023 |
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation |
08/2023 |
EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models |
08/2023 |
Activation Addition: Steering Language Models Without Optimization |
08/2023 |
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models |
08/2023 |
Accelerating LLM Inference with Staged Speculative Decoding |
08/2023 |
YaRN: Efficient Context Window Extension of Large Language Models |
08/2023 |
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models |
09/2023 |
Making Large Language Models Better Reasoners with Alignment |
09/2023 |
Data-Juicer: A One-Stop Data Processing System for Large Language Models |
09/2023 |
Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices |
09/2023 |
SLiMe: Segment Like Me |
09/2023 |
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models |
09/2023 |
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale |
09/2023 |
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs |
09/2023 |
Efficient Memory Management for Large Language Model Serving with PagedAttention |
09/2023 |
Cure the headache of Transformers via Collinear Constrained Attention |
09/2023 |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity |
09/2023 |
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models |
09/2023 |
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation |
09/2023 |
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models |
09/2023 |
Improving Code Generation by Dynamic Temperature Sampling |
09/2023 |
Efficient Streaming Language Models with Attention Sinks |
10/2023 |
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models |
10/2023 |
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length |
10/2023 |
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models |
10/2023 |
Elephant Neural Networks: Born to Be a Continual Learner |
10/2023 |
Ring Attention with Blockwise Transformers for Near-Infinite Context |
10/2023 |
Retrieval meets Long Context Large Language Models |
10/2023 |
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines |
10/2023 |
LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers |
10/2023 |
Amortizing intractable inference in large language models (GFlowNet Tuning) |
10/2023 |
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF |
10/2023 |
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity |
10/2023 |
Let Models Speak Ciphers: Multiagent Debate through Embeddings |
10/2023 |
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining |
10/2023 |
CacheGen: Fast Context Loading for Language Model Applications |
10/2023 |
MatFormer: Nested Transformer for Elastic Inference |
10/2023 |
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models |
10/2023 |
Towards End-to-end 4-Bit Inference on Generative Large Language Models (QUIK) |
10/2023 |
Microscaling Data Formats for Deep Learning |
10/2023 |
xVal: A Continuous Number Encoding for Large Language Models |
10/2023 |
An Emulator for Fine-Tuning Large Language Models using Small Language Models |
10/2023 |
Frozen Transformers in Language Models Are Effective Visual Encoder Layers |
10/2023 |
LoBaSS: Gauging Learnability in Supervised Fine-tuning Data |
10/2023 |
Quality-Diversity through AI Feedback |
10/2023 |
DoGE: Domain Reweighting with Generalization Estimation |
10/2023 |
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity |
10/2023 |
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation |
10/2023 |
Personas as a Way to Model Truthfulness in Language Models |
10/2023 |
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving |
11/2023 |
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models |
11/2023 |
FlashDecoding++: Faster Large Language Model Inference on GPUs |
11/2023 |
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization |
11/2023 |
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs |
11/2023 |
REST: Retrieval-Based Speculative Decoding |
11/2023 |
DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines |
11/2023 |
Token-level Adaptation of LoRA Adapters for Downstream Task Generalization |
11/2023 |
Exponentially Faster Language Modelling |
11/2023 |
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning |
11/2023 |
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning |
11/2023 |
Token Recycling for Efficient Sequential Inference with Vision Transformers |
11/2023 |
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization |
|
|
|
Articles |
03/2019 |
Rich Sutton - The Bitter Lesson |
06/2022 |
Yann LeCun - A Path Towards Autonomous Machine Intelligence |
01/2023 |
Lilian Weng - The Transformer Family Version 2.0 |
01/2023 |
Lilian Weng - Large Transformer Model Inference Optimization |
03/2023 |
Stanford - Alpaca: A Strong, Replicable Instruction-Following Model |
05/2023 |
OpenAI - Language models can explain neurons in language models |
05/2023 |
Alex Turner - Steering GPT-2-XL by adding an activation vector |
06/2023 |
YyWang - Do We Really Need the KVCache for All Large Language Models |
06/2023 |
kaiokendev - Extending Context is Hard…but not Impossible |
06/2023 |
bloc97 - NTK-Aware Scaled RoPE |
07/2023 |
oobabooga - A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities |
07/2023 |
Jianlin Su - Carrying the beta position to the end (better NTK RoPe method) |
09/2023 |
FasterDecoding - Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads |
10/2023 |
Tri Dao - Flash-Decoding for Long-Context Inference |
10/2023 |
Evan Armstrong - Human-Sourced, AI-Augmented: a promising solution for open source conversational data |
11/2023 |
LMSYS - Break the Sequential Dependency of LLM Inference Using Lookahead Decoding |