|
Google Papers Blog |
12/2017 |
Attention Is All You Need (Transformers) |
10/2018 |
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |
10/2019 |
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) |
11/2019 |
Fast Transformer Decoding: One Write-Head is All You Need |
02/2020 |
GLU Variants Improve Transformer |
03/2020 |
Talking-Heads Attention |
05/2020 |
Conformer: Convolution-augmented Transformer for Speech Recognition |
09/2020 |
Efficient Transformers: A Survey |
12/2020 |
RealFormer: Transformer Likes Residual Attention |
01/2021 |
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity |
09/2021 |
Finetuned Language Models Are Zero-Shot Learners (Flan) |
09/2021 |
Primer: Searching for Efficient Transformers for Language Modeling |
11/2021 |
Sparse is Enough in Scaling Transformers |
12/2021 |
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |
01/2022 |
LaMDA: Language Models for Dialog Applications |
01/2022 |
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models |
04/2022 |
PaLM: Scaling Language Modeling with Pathways |
07/2022 |
Confident Adaptive Language Modeling |
10/2022 |
Scaling Instruction-Finetuned Language Models (Flan-Palm) |
10/2022 |
Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models |
10/2022 |
Large Language Models Can Self-Improve |
11/2022 |
Efficiently Scaling Transformer Inference |
11/2022 |
Fast Inference from Transformers via Speculative Decoding |
02/2023 |
Symbolic Discovery of Optimization Algorithms (Lion) |
03/2023 |
PaLM-E: An Embodied Multimodal Language Model |
04/2023 |
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference |
05/2023 |
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes |
05/2023 |
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction |
05/2023 |
PaLM 2 Technical Report |
05/2023 |
Symbol tuning improves in-context learning in language models |
05/2023 |
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models |
05/2023 |
Towards Expert-Level Medical Question Answering with Large Language Models (Med-Palm 2) |
05/2023 |
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining |
05/2023 |
How Does Generative Retrieval Scale to Millions of Passages? |
05/2023 |
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoint |
05/2023 |
Small Language Models Improve Giants by Rewriting Their Outputs |
06/2023 |
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners |
06/2023 |
AudioPaLM: A Large Language Model That Can Speak and Listen |
06/2023 |
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting |
07/2023 |
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models |
09/2023 |
Uncovering mesa-optimization algorithms in Transformers |
10/2023 |
Think before you speak: Training Language Models With Pause Tokens |
10/2023 |
SpecTr: Fast Speculative Decoding via Optimal Transport |
11/2023 |
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs |
11/2023 |
Automatic Engineering of Long Prompts |
12/2023 |
Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses |
12/2023 |
Style Aligned Image Generation via Shared Attention |
01/2024 |
A Minimaximalist Approach to Reinforcement Learning from Human Feedback (SPO) |
02/2024 |
Time-, Memory- and Parameter-Efficient Visual Adaptation (LoSA) |
02/2024 |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
03/2024 |
PERL: Parameter Efficient Reinforcement Learning from Human Feedback |
04/2024 |
TransformerFAM: Feedback attention is working memory |
05/2024 |
eXmY: A Data Type and Technique for Arbitrary Bit Precision Quantization |
05/2024 |
Faster Cascades via Speculative Decoding |
06/2024 |
Proofread: Fixes All Errors with One Tap |
08/2024 |
Natural Language Outlines for Code: Literate Programming in the LLM Era |
08/2024 |
Diffusion Models Are Real-Time Game Engines |
11/2024 |
LAUREL: Learned Augmented Residual Layer |
|
|
|
Deepmind (Google Deepmind as of 4/2023) Papers Blog |
10/2019 |
Stabilizing Transformers for Reinforcement Learning |
12/2021 |
Scaling Language Models: Methods, Analysis & Insights from Training Gopher |
12/2021 |
Improving language models by retrieving from trillions of tokens (RETRO) |
02/2022 |
Competition-Level Code Generation with AlphaCode |
02/2022 |
Unified Scaling Laws for Routed Language Models |
03/2022 |
Training Compute-Optimal Large Language Models (Chinchilla) |
04/2022 |
Flamingo: a Visual Language Model for Few-Shot Learning |
05/2022 |
A Generalist Agent (GATO) |
07/2022 |
Formal Algorithms for Transformers |
02/2023 |
Accelerating Large Language Model Decoding with Speculative Sampling |
05/2023 |
Tree of Thoughts: Deliberate Problem Solving with Large Language Models |
05/2023 |
Block-State Transformer |
05/2023 |
Randomized Positional Encodings Boost Length Generalization of Transformers |
08/2023 |
From Sparse to Soft Mixtures of Experts |
09/2023 |
Large Language Models as Optimizers |
09/2023 |
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset (MT Model) |
09/2023 |
Scaling Laws for Sparsely-Connected Foundation Models |
09/2023 |
Language Modeling Is Compression |
09/2023 |
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution |
10/2023 |
Large Language Models as Analogical Reasoners |
10/2023 |
Controlled Decoding from Language Models |
10/2023 |
A General Theoretical Paradigm to Understand Learning from Human Preferences |
11/2023 |
DiLoCo: Distributed Low-Communication Training of Language Models |
12/2023 |
Gemini: A Family of Highly Capable Multimodal Models |
12/2023 |
AlphaCode 2 Technical Report |
12/2023 |
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator |
12/2023 |
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models |
12/2023 |
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding |
01/2024 |
Solving olympiad geometry without human demonstrations |
02/2024 |
LiPO: Listwise Preference Optimization through Learning-to-Rank |
02/2024 |
Grandmaster-Level Chess Without Search |
02/2024 |
How to Train Data-Efficient LLMs |
02/2024 |
A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts |
02/2024 |
Gemma: Open Models Based on Gemini Research and Technology |
02/2024 |
Genie: Generative Interactive Environments |
02/2024 |
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models |
03/2024 |
DiPaCo: Distributed Path Composition |
04/2024 |
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models |
05/2024 |
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities |
06/2024 |
Transformers meet Neural Algorithmic Reasoners |
06/2024 |
Gemma 2: Improving Open Language Models at a Practical Size |
06/2024 |
Data curation via joint example selection further accelerates multimodal learning |
07/2024 |
PaliGemma: A versatile 3B VLM for transfer |
07/2024 |
LookupViT: Compressing visual information to a limited number of tokens |
07/2024 |
Mixture of Nested Experts: Adaptive Processing of Visual Tokens |
08/2024 |
Generative Verifiers: Reward Modeling as Next-Token Prediction |
09/2024 |
Imitating Language via Scalable Inverse Reinforcement Learning |
10/2024 |
Preference Optimization as Probabilistic Inference |
10/2024 |
Round and Round We Go! What makes Rotary Positional Encodings useful? |
10/2024 |
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA |
|
|
|
Meta (Facebook AI Research) Papers Blog |
04/2019 |
fairseq: A Fast, Extensible Toolkit for Sequence Modeling |
07/2019 |
Augmenting Self-attention with Persistent Memory |
11/2019 |
Improving Transformer Models by Reordering their Sublayers |
08/2021 |
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation |
03/2022 |
Training Logbook for OPT-175B |
05/2022 |
OPT: Open Pre-trained Transformer Language Models |
07/2022 |
Beyond neural scaling laws: beating power law scaling via data pruning |
11/2022 |
Galactica: A Large Language Model for Science |
01/2023 |
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (I-JEPA) |
02/2023 |
LLaMA: Open and Efficient Foundation Language Models |
02/2023 |
Toolformer: Language Models Can Teach Themselves to Use Tools |
03/2023 |
Scaling Expert Language Models with Unsupervised Domain Discovery |
03/2023 |
SemDeDup: Data-efficient learning at web-scale through semantic deduplication |
04/2023 |
Segment Anything (SAM) |
04/2023 |
A Cookbook of Self-Supervised Learning |
05/2023 |
Learning to Reason and Memorize with Self-Notes |
05/2023 |
ImageBind: One Embedding Space To Bind Them All |
05/2023 |
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers |
05/2023 |
LIMA: Less Is More for Alignment |
05/2023 |
Scaling Speech Technology to 1,000+ Languages |
05/2023 |
READ: Recurrent Adaptation of Large Transformers |
05/2023 |
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models |
05/2023 |
Physics of Language Models: Part 1, Learning Hierarchical Language Structures |
06/2023 |
Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles |
06/2023 |
Simple and Controllable Music Generation (MusicGen) |
06/2023 |
Improving Open Language Models by Learning from Organic Interactions (BlenderBot 3x) |
06/2023 |
Extending Context Window of Large Language Models via Positional Interpolation |
06/2023 |
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale |
07/2023 |
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3leon) |
07/2023 |
Llama 2: Open Foundation and Fine-Tuned Chat Models |
08/2023 |
SeamlessM4T—Massively Multilingual & Multimodal Machine Translation |
08/2023 |
D4: Improving LLM Pretraining via Document De-Duplication and Diversification |
08/2023 |
Code Llama: Open Foundation Models for Code |
08/2023 |
Nougat: Neural Optical Understanding for Academic Documents |
09/2023 |
Contrastive Decoding Improves Reasoning in Large Language Models |
09/2023 |
Effective Long-Context Scaling of Foundation Models |
09/2023 |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model |
09/2023 |
Vision Transformers Need Registers |
09/2023 |
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction |
09/2023 |
Physics of Language Models: Part 3.2, Knowledge Manipulation |
10/2023 |
RA-DIT: Retrieval-Augmented Dual Instruction Tuning |
10/2023 |
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation |
10/2023 |
Generative Pre-training for Speech with Flow Matching |
11/2023 |
Emu Edit: Precise Image Editing via Recognition and Generation Tasks |
12/2023 |
Audiobox: Unified Audio Generation with Natural Language Prompts |
12/2023 |
Universal Pyramid Adversarial Training for Improved ViT Performance |
01/2024 |
Self-Rewarding Language Models |
02/2024 |
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) |
02/2024 |
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases |
03/2024 |
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM |
03/2024 |
Reverse Training to Nurse the Reversal Curse |
04/2024 |
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws |
04/2024 |
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length |
04/2024 |
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding |
04/2024 |
Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding |
04/2024 |
MoDE: CLIP Data Experts via Clustering |
04/2024 |
Iterative Reasoning Preference Optimization |
04/2024 |
Better & Faster Large Language Models via Multi-token Prediction |
05/2024 |
Modeling Caption Diversity in Contrastive Vision-Language Pretraining (LLIP) |
05/2024 |
Chameleon: Mixed-Modal Early-Fusion Foundation Models |
05/2024 |
SpinQuant -- LLM quantization with learned rotations |
05/2024 |
Contextual Position Encoding: Learning to Count What's Important |
06/2024 |
The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More |
06/2024 |
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Reinforcemen |
07/2024 |
The Llama 3 Herd of Models |
07/2024 |
SAM 2: Segment Anything in Images and Videos |
07/2024 |
Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process |
07/2024 |
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts |
08/2024 |
Self-Taught Evaluators |
08/2024 |
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems |
08/2024 |
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model |
10/2024 |
The Perfect Blend: Redefining RLHF with Mixture of Judges (CGPO) |
10/2024 |
Movie Gen: A Cast of Media Foundation Models |
10/2024 |
Thinking LLMs: General Instruction Following with Thought Generation |
11/2024 |
Context Parallelism for Scalable Million-Token Inference |
11/2024 |
Adaptive Decoding via Latent Preference Optimization |
|
|
|
Microsoft Papers Blog |
12/2015 |
Deep Residual Learning for Image Recognition |
05/2021 |
EL-Attention: Memory Efficient Lossless Attention for Generation |
01/2022 |
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale |
03/2022 |
DeepNet: Scaling Transformers to 1,000 Layers |
12/2022 |
A Length-Extrapolatable Transformer |
01/2023 |
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases |
02/2023 |
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
03/2023 |
Sparks of Artificial General Intelligence: Early experiments with GPT-4 |
03/2023 |
TaskMatrix. AI: Completing Tasks by Connecting Foundation Models with Millions of APIs |
04/2023 |
Instruction Tuning with GPT-4 |
04/2023 |
Inference with Reference: Lossless Acceleration of Large Language Models |
04/2023 |
Low-code LLM: Visual Programming over LLMs |
04/2023 |
WizardLM: Empowering Large Language Models to Follow Complex Instructions |
04/2023 |
MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks |
04/2023 |
ResiDual: Transformer with Dual Residual Connections |
05/2023 |
Code Execution with Pre-trained Language Models |
05/2023 |
Small Models are Valuable Plug-ins for Large Language Models |
05/2023 |
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing |
06/2023 |
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 |
06/2023 |
Augmenting Language Models with Long-Term Memory |
06/2023 |
WizardCoder: Empowering Code Large Language Models with Evol-Instruct |
06/2023 |
Textbooks Are All You Need (phi-1) |
07/2023 |
In-context Autoencoder for Context Compression in a Large Language Model |
07/2023 |
Retentive Network: A Successor to Transformer for Large Language Models |
08/2023 |
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference |
09/2023 |
Efficient RLHF: Reducing the Memory Usage of PPO |
09/2023 |
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models |
09/2023 |
Textbooks Are All You Need II (phi-1.5) |
09/2023 |
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training |
09/2023 |
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models |
09/2023 |
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models |
10/2023 |
Sparse Backpropagation for MoE Training |
10/2023 |
Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Models |
10/2023 |
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness |
10/2023 |
Augmented Embeddings for Custom Retrievals |
10/2023 |
Guiding Language Model Reasoning with Planning Tokens |
10/2023 |
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V |
10/2023 |
CodeFusion: A Pre-trained Diffusion Model for Code Generation |
10/2023 |
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery |
10/2023 |
FP8-LM: Training FP8 Large Language Models |
11/2023 |
Orca 2: Teaching Small Language Models How to Reason |
12/2023 |
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks |
12/2023 |
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction |
01/2024 |
SliceGPT: Compress Large Language Models by Deleting Rows and Columns |
01/2024 |
RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture |
02/2024 |
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens |
02/2024 |
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits (BitNet) |
02/2024 |
ResLoRA: Identity Residual Mapping in Low-Rank Adaption |
03/2024 |
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression |
03/2024 |
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series |
04/2024 |
LongEmbed: Extending Embedding Models for Long Context Retrieval |
04/2024 |
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone |
05/2024 |
You Only Cache Once: Decoder-Decoder Architectures for Language Models (YOCO) |
06/2024 |
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling |
06/2024 |
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS |
06/2024 |
Automatic Instruction Evolving for Large Language Models |
07/2024 |
Arena Learning : Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena |
07/2024 |
Q-Sparse: All Large Language Models can be Fully Sparsely-Activated |
09/2024 |
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models |
10/2024 |
Differential Transformer |
11/2024 |
BitNet a4.8: 4-bit Activations for 1-bit LLMs |
|
|
|
OpenAI Papers Blog |
07/2017 |
Proximal Policy Optimization Algorithms |
04/2019 |
Generating Long Sequences with Sparse Transformers |
01/2020 |
Scaling Laws for Neural Language Models |
05/2020 |
Language Models are Few-Shot Learners (GPT-3) |
01/2022 |
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets |
03/2022 |
Training language models to follow instructions with human feedback (InstructGPT) |
07/2022 |
Efficient Training of Language Models to Fill in the Middle |
03/2023 |
GPT-4 Technical Report |
04/2023 |
Consistency Models |
05/2023 |
Let's Verify Step by Step |
10/2023 |
Improving Image Generation with Better Captions (DALL·E 3) |
10/2024 |
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering |
|
|
|
Hazy Research (Stanford) Papers Blog |
10/2021 |
Efficiently Modeling Long Sequences with Structured State Spaces (S4) |
04/2022 |
Monarch: Expressive Structured Matrices for Efficient and Accurate Training |
05/2022 |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |
12/2022 |
Hungry Hungry Hippos: Towards Language Modeling with State Space Models |
02/2023 |
Simple Hardware-Efficient Long Convolutions for Sequence Modeling |
02/2023 |
Hyena Hierarchy: Towards Larger Convolutional Language Models |
06/2023 |
TART: A plug-and-play Transformer module for task-agnostic reasoning |
07/2023 |
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning |
11/2023 |
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores |
|
|
|
DeepSeek Github |
01/2024 |
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism |
01/2024 |
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence |
02/2024 |
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models |
03/2024 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding |
05/2024 |
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |
06/2024 |
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence |
07/2024 |
Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models |
08/2024 |
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search |
08/2024 |
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning |
08/2024 |
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts |
10/2024 |
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation |
|
|
|
THUDM (Tsinghua University) Papers Github |
10/2022 |
GLM-130B: An Open Bilingual Pre-Trained Model |
03/2023 |
CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X |
04/2023 |
DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task |
06/2023 |
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences |
09/2023 |
GPT Can Solve Mathematical Problems Without a Calculator (MathGLM) |
10/2023 |
AgentTuning: Enabling Generalized Agent Abilities for LLMs (AgentLM) |
11/2023 |
CogVLM: Visual Expert for Pretrained Language Models |
12/2023 |
CogAgent: A Visual Language Model for GUI Agents |
01/2024 |
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding |
01/2024 |
LongAlign: A Recipe for Long Context Alignment of Large Language Models |
06/2024 |
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools |
08/2024 |
LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs |
11/2024 |
AutoGLM: Autonomous Foundation Agents for GUIs |
|
|
|
Articles |
03/2019 |
Rich Sutton - The Bitter Lesson |
06/2022 |
Yann LeCun - A Path Towards Autonomous Machine Intelligence |
01/2023 |
Lilian Weng - The Transformer Family Version 2.0 |
01/2023 |
Lilian Weng - Large Transformer Model Inference Optimization |
03/2023 |
Stanford - Alpaca: A Strong, Replicable Instruction-Following Model |
05/2023 |
OpenAI - Language models can explain neurons in language models |
05/2023 |
Alex Turner - Steering GPT-2-XL by adding an activation vector |
06/2023 |
YyWang - Do We Really Need the KVCache for All Large Language Models |
06/2023 |
kaiokendev - Extending Context is Hard…but not Impossible |
06/2023 |
bloc97 - NTK-Aware Scaled RoPE |
07/2023 |
oobabooga - A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities |
07/2023 |
Jianlin Su - Carrying the beta position to the end (better NTK RoPe method) |
08/2023 |
Charles Goddard - On Frankenllama |
10/2023 |
Tri Dao - Flash-Decoding for Long-Context Inference |
10/2023 |
Evan Armstrong - Human-Sourced, AI-Augmented: a promising solution for open source conversational data |
12/2023 |
Anthropic - Long context prompting for Claude 2.1 |
12/2023 |
Andrej Karpathy - On the "hallucination problem" (tweet.jpg) |
12/2023 |
HuggingFace - Mixture of Experts Explained |
01/2024 |
Vgel - Representation Engineering |
01/2024 |
Alex Alemi - KL is All You Need |
02/2024 |
Lilian Weng - Thinking about High-Quality Human Data |
03/2024 |
rayliuca - T-Ragx Project Write Up (Translation RAG) |
04/2024 |
Answer.Ai - Efficient finetuning of Llama 3 with FSDP QDoRA |
04/2024 |
Sam Paech - Creating MAGI: A hard subset of MMLU and AGIEval |
05/2024 |
LLaVA Team - LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild |
05/2024 |
Hazy Research - GPUs Go Brrr (ThunderKittens) |
05/2024 |
Anthropic - Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet |
06/2024 |
CharacterAI - Optimizing AI Inference |
07/2024 |
Lilian Weng - Extrinsic Hallucinations in LLMs |
07/2024 |
Andrej Karpathy - Let's reproduce GPT-2 (1.6B) |
07/2024 |
Pierre-Carl Langlais - Announcing Finance Commons and the Bad Data Toolbox |
07/2024 |
Zeyuan Allen-Zhu - Physics of Language Models ICML Talk (Video) |
|
|
|
Open Models |
06/2021 |
GPT-J-6B: 6B JAX-Based Transformer |
09/2021 |
Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning |
03/2022 |
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis |
04/2022 |
GPT-NeoX-20B: An Open-Source Autoregressive Language Model |
11/2022 |
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |
12/2022 |
DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders |
04/2023 |
Visual Instruction Tuning (LLaVA) |
05/2023 |
StarCoder: May the source be with you! |
05/2023 |
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages |
05/2023 |
Otter: A Multi-Modal Model with In-Context Instruction Tuning |
05/2023 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
05/2023 |
CodeT5+: Open Code Large Language Models for Code Understanding and Generation |
05/2023 |
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities |
05/2023 |
RWKV: Reinventing RNNs for the Transformer Era |
05/2023 |
Lion: Adversarial Distillation of Closed-Source Large Language Model |
05/2023 |
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training |
06/2023 |
Segment Anything in High Quality |
06/2023 |
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding |
06/2023 |
High-Fidelity Audio Compression with Improved RVQGAN (DAC) |
06/2023 |
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models |
06/2023 |
Anticipatory Music Transformer |
06/2023 |
RepoFusion: Training Code Models to Understand Your Repository |
06/2023 |
MPT-30B: Raising the bar for open-source foundation models |
06/2023 |
Vec2Vec: A Compact Neural Network Approach for Transforming Text Embeddings with High Fidelity |
06/2023 |
ViNT: A Foundation Model for Visual Navigation |
06/2023 |
How Long Can Open-Source LLMs Truly Promise on Context Length? (LongChat) |
07/2023 |
Hierarchical Open-vocabulary Universal Image Segmentation |
07/2023 |
Focused Transformer: Contrastive Training for Context Scaling (LongLLaMA |
07/2023 |
Rhythm Modeling for Voice Conversion (Urhythmic) |
07/2023 |
Scaling TransNormer to 175 Billion Parameters |
08/2023 |
Separate Anything You Describe |
08/2023 |
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data |
09/2023 |
RADIO: Reference-Agnostic Dubbing Video Synthesis |
09/2023 |
Matcha-TTS: A fast TTS architecture with conditional flow matching |
09/2023 |
DreamLLM: Synergistic Multimodal Comprehension and Creation |
09/2023 |
Baichuan 2: Open Large-scale Language Models |
09/2023 |
Qwen Technical Report |
09/2023 |
Mistral 7B |
10/2023 |
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning |
10/2023 |
Improved Baselines with Visual Instruction Tuning (LLaVA 1.5) |
10/2023 |
LLark: A Multimodal Foundation Model for Music |
10/2023 |
SALMONN: Towards Generic Hearing Abilities for Large Language Models |
10/2023 |
Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents |
11/2023 |
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models |
11/2023 |
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition |
11/2023 |
YUAN 2.0: A Large Language Model with Localized Filtering-based Attention |
12/2023 |
Making Large Multimodal Models Understand Arbitrary Visual Prompts (ViP-LLaVA) |
12/2023 |
Mamba: Linear-Time Sequence Modeling with Selective State Spaces |
12/2023 |
OpenVoice: Versatile Instant Voice Cloning |
12/2023 |
Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM) |
12/2023 |
Magicoder: Source Code Is All You Need |
12/2023 |
StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers |
12/2023 |
MMM: Generative Masked Motion Model |
12/2023 |
4M: Massively Multimodal Masked Modeling |
12/2023 |
LLM360: Towards Fully Transparent Open-Source LLMs |
12/2023 |
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling |
01/2024 |
Mixtral of Experts |
01/2024 |
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer |
01/2024 |
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications |
01/2024 |
Scalable Pre-training of Large Autoregressive Image Models |
01/2024 |
Orion-14B: Open-source Multilingual Large Language Models |
01/2024 |
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data |
01/2024 |
VMamba: Visual State Space Model |
01/2024 |
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models |
01/2024 |
LLaVA-1.6: Improved reasoning, OCR, and world knowledge |
01/2024 |
MiniCPM: Unveiling the Potential of End-side Large Language Models |
01/2024 |
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild |
02/2024 |
Graph-Mamba: Towards Long-Range Graph Sequence Modeling with Selective State Spaces |
02/2024 |
Introducing Qwen1.5 |
02/2024 |
BlackMamba: Mixture of Experts for State-Space Models |
02/2024 |
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss |
02/2024 |
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators |
02/2024 |
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion |
02/2024 |
Brant-2: Foundation Model for Brain Signals |
02/2024 |
CLLMs: Consistency Large Language Models |
03/2024 |
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3) |
03/2024 |
TripoSR: Fast 3D Object Reconstruction from a Single Image |
03/2024 |
Yi: Open Foundation Models by 01.AI |
03/2024 |
VideoMamba: State Space Model for Efficient Video Understanding |
03/2024 |
VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild |
03/2024 |
GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation |
03/2024 |
DBRX: A New State-of-the-Art Open LLM |
03/2024 |
AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation |
03/2024 |
Jamba: A Hybrid Transformer-Mamba Language Model |
04/2024 |
Advancing LLM Reasoning Generalists with Preference Trees (Eurus) |
04/2024 |
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR) |
04/2024 |
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence |
04/2024 |
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models |
05/2024 |
Language-Image Models with 3D Understanding (Cube-LLM) |
05/2024 |
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding |
05/2024 |
Pandora : Towards General World Model with Natural Language Actions and Video State |
05/2024 |
TerDiT: Ternary Diffusion Models with Transformers |
05/2024 |
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models |
05/2024 |
Phased Consistency Model |
05/2024 |
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series |
05/2024 |
YOLOv10: Real-Time End-to-End Object Detection |
05/2024 |
MegActor: Harness the Power of Raw Video for Vivid Portrait Animation |
06/2024 |
Bootstrap3D: Improving 3D Content Creation with Synthetic Data |
06/2024 |
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture |
06/2024 |
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec |
06/2024 |
GrootVL: Tree Topology is All You Need in State Space Model |
06/2024 |
An Independence-promoting Loss for Music Generation with Language Models (MusicGen-MMD) |
06/2024 |
Matching Anything by Segmenting Anything |
06/2024 |
Nemotron-4 340B Technical Report |
06/2024 |
TroL: Traversal of Layers for Large Language and Vision Models |
06/2024 |
Depth Anything V2 |
06/2024 |
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale |
06/2024 |
Network Bending of Diffusion Models for Audio-Visual Generation |
06/2024 |
Less is More: Accurate Speech Recognition & Translation without Web-Scale Data (Canary) |
07/2024 |
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control |
07/2024 |
Qwen2 Technical Report |
07/2024 |
Qwen2-Audio Technical Report |
07/2024 |
ColPali: Efficient Document Retrieval with Vision Language Models |
07/2024 |
Compact Language Models via Pruning and Knowledge Distillation (Minitron) |
08/2024 |
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models |
08/2024 |
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale |
08/2024 |
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection |
09/2024 |
OLMoE: Open Mixture-of-Experts Language Models |
09/2024 |
Sample-Efficient Diffusion for Text-To-Speech Synthesis (SESD) |
09/2024 |
Multi-Source Music Generation with Latent Diffusion (MSLDM) |
09/2024 |
Prithvi WxC: Foundation Model for Weather and Climate |
09/2024 |
DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency |
09/2024 |
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models |
09/2024 |
MIO: A Foundation Model on Multimodal Tokens |
10/2024 |
UniMuMo: Unified Text, Music and Motion Generation |
10/2024 |
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation |
10/2024 |
Aria: An Open Multimodal Native Mixture-of-Experts Model |
10/2024 |
Taipan: Efficient and Expressive State Space Language Models with Selective Attention |
10/2024 |
DreamCraft3D++: Efficient Hierarchical 3D Generation with Multi-Plane Reconstruction Model |
|
|
|
Various |
09/2014 |
Neural Machine Translation by Jointly Learning to Align and Translate |
06/2019 |
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View |
10/2019 |
Root Mean Square Layer Normalization |
10/2019 |
Transformers without Tears: Improving the Normalization of Self-Attention |
12/2019 |
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection |
02/2020 |
On Layer Normalization in the Transformer Architecture |
04/2020 |
Longformer: The Long-Document Transformer |
04/2020 |
Improved Natural Language Generation via Loss Truncation |
06/2020 |
Memory Transformer |
07/2020 |
Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity |
12/2020 |
ERNIE-Doc: A Retrospective Long-Document Modeling Transformer |
01/2021 |
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks |
03/2021 |
The Low-Rank Simplicity Bias in Deep Networks |
04/2021 |
RoFormer: Enhanced Transformer with Rotary Position Embedding |
06/2021 |
LoRA: Low-Rank Adaptation of Large Language Models |
07/2023 |
CrossFormer: A Versatile Vision Transformer Hinging on Cross-scale Attention |
03/2022 |
Memorizing Transformers |
04/2022 |
UL2: Unifying Language Learning Paradigms |
05/2022 |
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (IA3) |
06/2022 |
nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models |
07/2022 |
Language Models (Mostly) Know What They Know |
08/2022 |
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale |
09/2022 |
Petals: Collaborative Inference and Fine-tuning of Large Models |
10/2022 |
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers |
10/2022 |
Recurrent Memory Transformer |
10/2022 |
Truncation Sampling as Language Model Desmoothing |
10/2022 |
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation |
11/2022 |
An Algorithm for Routing Vectors in Sequences |
11/2022 |
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts |
12/2022 |
Self-Instruct: Aligning Language Model with Self Generated Instructions |
12/2022 |
Parallel Context Windows Improve In-Context Learning of Large Language Models |
12/2022 |
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor |
12/2022 |
Pretraining Without Attention |
12/2022 |
The case for 4-bit precision: k-bit Inference Scaling Laws |
12/2022 |
Prompting Is Programming: A Query Language for Large Language Models |
01/2023 |
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient |
01/2023 |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot |
01/2023 |
Memory Augmented Large Language Models are Computationally Universal |
01/2023 |
Progress measures for grokking via mechanistic interpretability |
01/2023 |
Adaptive Computation with Elastic Input Sequence |
02/2023 |
Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models |
02/2023 |
The Wisdom of Hindsight Makes Language Models Better Instruction Followers |
02/2023 |
The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation |
03/2023 |
COLT5: Faster Long-Range Transformers with Conditional Computation |
03/2023 |
High-throughput Generative Inference of Large Language Models with a Single GPU |
03/2023 |
Meet in the Middle: A New Pre-training Paradigm |
03/2023 |
Reflexion: an autonomous agent with dynamic memory and self-reflection |
03/2023 |
Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning |
03/2023 |
FP8 versus INT8 for efficient deep learning inference |
03/2023 |
Self-Refine: Iterative Refinement with Self-Feedback |
04/2023 |
RPTQ: Reorder-based Post-training Quantization for Large Language Models |
04/2023 |
REFINER: Reasoning Feedback on Intermediate Representations |
04/2023 |
Generative Agents: Interactive Simulacra of Human Behavior |
04/2023 |
Compressed Regression over Adaptive Networks |
04/2023 |
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise |
04/2023 |
RRHF: Rank Responses to Align Language Models with Human Feedback without tears |
04/2023 |
CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society |
04/2023 |
Automatic Gradient Descent: Deep Learning without Hyperparameters |
04/2023 |
SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models |
04/2023 |
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study |
04/2023 |
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling |
04/2023 |
Scaling Transformer to 1M tokens and beyond with RMT |
04/2023 |
Answering Questions by Meta-Reasoning over Multiple Chains of Thought |
04/2023 |
Towards Multi-Modal DBMSs for Seamless Querying of Texts and Tables |
04/2023 |
We're Afraid Language Models Aren't Modeling Ambiguity |
04/2023 |
The Internal State of an LLM Knows When its Lying |
04/2023 |
Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks |
05/2023 |
Towards Unbiased Training in Federated Open-world Semi-supervised Learning |
05/2023 |
Unlimiformer: Long-Range Transformers with Unlimited Length Input |
05/2023 |
FreeLM: Fine-Tuning-Free Language Model |
05/2023 |
Cuttlefish: Low-rank Model Training without All The Tuning |
05/2023 |
AttentionViz: A Global View of Transformer Attention |
05/2023 |
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models |
05/2023 |
A Frustratingly Easy Improvement for Position Embeddings via Random Padding |
05/2023 |
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision |
05/2023 |
Explanation-based Finetuning Makes Models More Robust to Spurious Cues |
05/2023 |
An automatically discovered chain-of-thought prompt generalizes to novel models and datasets |
05/2023 |
Recommender Systems with Generative Retrieval |
05/2023 |
Fast Distributed Inference Serving for Large Language Models |
05/2023 |
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models |
05/2023 |
Recommendation as Instruction Following: A Large Language Model Empowered Recommendation Approach |
05/2023 |
Active Retrieval Augmented Generation |
05/2023 |
Scalable Coupling of Deep Learning with Logical Reasoning |
05/2023 |
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca |
05/2023 |
StructGPT: A General Framework for Large Language Model to Reason over Structured Data |
05/2023 |
Pre-Training to Learn in Context |
05/2023 |
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings |
05/2023 |
Accelerating Transformer Inference for Translation via Parallel Decoding |
05/2023 |
Cooperation Is All You Need |
05/2023 |
PTQD: Accurate Post-Training Quantization for Diffusion Models |
05/2023 |
LLM-Pruner: On the Structural Pruning of Large Language Models |
05/2023 |
SelfzCoT: a Self-Prompt Zero-shot CoT from Semantic-level to Code-level for a Better Utilization of LLMs |
05/2023 |
QLoRA: Efficient Finetuning of Quantized LLMs |
05/2023 |
"According to ..." Prompting Language Models Improves Quoting from Pre-Training Data |
05/2023 |
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training |
05/2023 |
Landmark Attention: Random-Access Infinite Context Length for Transformers |
05/2023 |
Scaling Data-Constrained Language Models |
05/2023 |
Fine-Tuning Language Models with Just Forward Passes |
05/2023 |
Intriguing Properties of Quantization at Scale |
05/2023 |
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time |
05/2023 |
Blockwise Parallel Transformer for Long Context Large Models |
05/2023 |
The Impact of Positional Encoding on Length Generalization in Transformers |
05/2023 |
Adapting Language Models to Compress Contexts |
05/2023 |
Direct Preference Optimization: Your Language Model is Secretly a Reward Model |
06/2023 |
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration |
06/2023 |
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention |
06/2023 |
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training |
06/2023 |
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression |
06/2023 |
Fine-Tuning Language Models with Advantage-Induced Policy Alignment |
06/2023 |
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards |
06/2023 |
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model |
06/2023 |
Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models Memories |
06/2023 |
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion |
06/2023 |
Word sense extension |
06/2023 |
Mitigating Transformer Overconfidence via Lipschitz Regularization |
06/2023 |
Recurrent Attention Networks for Long-text Modeling |
06/2023 |
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning |
06/2023 |
SqueezeLLM: Dense-and-Sparse Quantization |
06/2023 |
Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training |
06/2023 |
Propagating Knowledge Updates to LMs Through Distillation |
06/2023 |
Full Parameter Fine-tuning for Large Language Models with Limited Resources |
06/2023 |
A Simple and Effective Pruning Approach for Large Language Models |
06/2023 |
InRank: Incremental Low-Rank Learning |
06/2023 |
Evaluating the Zero-shot Robustness of Instruction-tuned Language Models |
06/2023 |
Learning to Generate Better Than Your LLM (RLGF) |
06/2023 |
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing |
06/2023 |
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Model |
06/2023 |
FLuRKA: Fast fused Low-Rank & Kernel Attention |
06/2023 |
Stay on topic with Classifier-Free Guidance |
07/2023 |
AutoST: Training-free Neural Architecture Search for Spiking Transformers |
07/2023 |
Single Sequence Prediction over Reasoning Graphs for Multi-hop QA |
07/2023 |
Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models |
07/2023 |
Facing off World Model Backbones: RNNs, Transformers, and S4 |
07/2023 |
Improving Retrieval-Augmented Large Language Models via Data Importance Learning |
07/2023 |
Teaching Arithmetic to Small Transformers |
07/2023 |
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models |
07/2023 |
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates |
07/2023 |
Copy Is All You Need (CoG) |
07/2023 |
Multi-Method Self-Training: Improving Code Generation With Text, And Vice Versa |
07/2023 |
Divide & Bind Your Attention for Improved Generative Semantic Nursing |
07/2023 |
Challenges and Applications of Large Language Models |
07/2023 |
Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models |
07/2023 |
QuIP: 2-Bit Quantization of Large Language Models With Guarantees |
07/2023 |
CoRe Optimizer: An All-in-One Solution for Machine Learning |
07/2023 |
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time |
08/2023 |
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation |
08/2023 |
EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models |
08/2023 |
Activation Addition: Steering Language Models Without Optimization |
08/2023 |
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models |
08/2023 |
Accelerating LLM Inference with Staged Speculative Decoding |
08/2023 |
YaRN: Efficient Context Window Extension of Large Language Models |
08/2023 |
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models |
09/2023 |
Making Large Language Models Better Reasoners with Alignment |
09/2023 |
Data-Juicer: A One-Stop Data Processing System for Large Language Models |
09/2023 |
Delta-LoRA: Fine-Tuning High-Rank Parameters with the Delta of Low-Rank Matrices |
09/2023 |
SLiMe: Segment Like Me |
09/2023 |
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models |
09/2023 |
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale |
09/2023 |
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs |
09/2023 |
Efficient Memory Management for Large Language Model Serving with PagedAttention |
09/2023 |
Cure the headache of Transformers via Collinear Constrained Attention |
09/2023 |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity |
09/2023 |
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models |
09/2023 |
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation |
09/2023 |
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models |
09/2023 |
Improving Code Generation by Dynamic Temperature Sampling |
09/2023 |
Efficient Streaming Language Models with Attention Sinks |
10/2023 |
DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models |
10/2023 |
GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length |
10/2023 |
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models |
10/2023 |
Elephant Neural Networks: Born to Be a Continual Learner |
10/2023 |
Ring Attention with Blockwise Transformers for Near-Infinite Context |
10/2023 |
Retrieval meets Long Context Large Language Models |
10/2023 |
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines |
10/2023 |
LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers |
10/2023 |
Amortizing intractable inference in large language models (GFlowNet Tuning) |
10/2023 |
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF |
10/2023 |
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity |
10/2023 |
Let Models Speak Ciphers: Multiagent Debate through Embeddings |
10/2023 |
InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining |
10/2023 |
CacheGen: Fast Context Loading for Language Model Applications |
10/2023 |
MatFormer: Nested Transformer for Elastic Inference |
10/2023 |
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models |
10/2023 |
Towards End-to-end 4-Bit Inference on Generative Large Language Models (QUIK) |
10/2023 |
Microscaling Data Formats for Deep Learning |
10/2023 |
xVal: A Continuous Number Encoding for Large Language Models |
10/2023 |
An Emulator for Fine-Tuning Large Language Models using Small Language Models |
10/2023 |
Frozen Transformers in Language Models Are Effective Visual Encoder Layers |
10/2023 |
LoBaSS: Gauging Learnability in Supervised Fine-tuning Data |
10/2023 |
Quality-Diversity through AI Feedback |
10/2023 |
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (SEDD) |
10/2023 |
DoGE: Domain Reweighting with Generalization Estimation |
10/2023 |
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity |
10/2023 |
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation |
10/2023 |
Personas as a Way to Model Truthfulness in Language Models |
10/2023 |
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving |
10/2023 |
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models |
11/2023 |
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models |
11/2023 |
FlashDecoding++: Faster Large Language Model Inference on GPUs |
11/2023 |
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization |
11/2023 |
Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs |
11/2023 |
REST: Retrieval-Based Speculative Decoding |
11/2023 |
DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines |
11/2023 |
Token-level Adaptation of LoRA Adapters for Downstream Task Generalization |
11/2023 |
Exponentially Faster Language Modelling |
11/2023 |
MultiLoRA: Democratizing LoRA for Better Multi-Task Learning |
11/2023 |
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning |
11/2023 |
Token Recycling for Efficient Sequential Inference with Vision Transformers |
11/2023 |
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization |
12/2023 |
GIFT: Generative Interpretable Fine-Tuning Transformers |
12/2023 |
PEFA: Parameter-Free Adapters for Large-scale Embedding-based Retrieval Models |
12/2023 |
Improving Activation Steering in Language Models with Mean-Centring |
12/2023 |
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA |
12/2023 |
SparQ Attention: Bandwidth-Efficient LLM Inference |
12/2023 |
ESPN: Memory-Efficient Multi-Vector Information Retrieval |
12/2023 |
Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models |
12/2023 |
CBQ: Cross-Block Quantization for Large Language Models |
12/2023 |
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention |
12/2023 |
Weight subcloning: direct initialization of transformers using larger pretrained ones |
12/2023 |
Cascade Speculative Drafting for Even Faster LLM Inference |
12/2023 |
ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference |
12/2023 |
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy |
12/2023 |
A Semantic Space is Worth 256 Language Descriptions: Make Stronger Segmentation Models with Descriptive Properties |
12/2023 |
Algebraic Positional Encodings |
12/2023 |
Preference as Reward, Maximum Preference Optimization with Importance Sampling |
01/2024 |
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning |
01/2024 |
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models |
01/2024 |
LLaMA Pro: Progressive LLaMA with Block Expansion |
01/2024 |
Fast and Optimal Weight Update for Pruned Large Language Models |
01/2024 |
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon |
01/2024 |
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts |
01/2024 |
Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning |
01/2024 |
RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation |
01/2024 |
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models |
01/2024 |
AUTOACT: Automatic Agent Learning from Scratch via Self-Planning |
01/2024 |
Extreme Compression of Large Language Models via Additive Quantization (AQLM) |
01/2024 |
Knowledge Translation: A New Pathway for Model Compression |
01/2024 |
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks |
01/2024 |
Transformers are Multi-State RNNs |
01/2024 |
Extending LLMs' Context Window with 100 Samples (Entropy-ABF) |
01/2024 |
ChatQA: Building GPT-4 Level Conversational QA Models |
01/2024 |
AutoChunk: Automated Activation Chunk for Memory-Efficient Long Sequence Inference |
01/2024 |
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads |
01/2024 |
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation |
01/2024 |
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models |
01/2024 |
Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment |
01/2024 |
Dynamic Layer Tying for Parameter-Efficient Transformers |
01/2024 |
MambaByte: Token-free Selective State Space Model |
01/2024 |
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design |
01/2024 |
Accelerating Retrieval-Augmented Language Model Serving with Speculation |
01/2024 |
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities |
01/2024 |
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty |
01/2024 |
With Greater Text Comes Greater Necessity: Inference-Time Training Helps Long Text Generation (Temp LoRA) |
01/2024 |
YODA: Teacher-Student Progressive Learning for Language Models |
01/2024 |
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization |
01/2024 |
LOCOST: State-Space Models for Long Document Abstractive Summarization |
01/2024 |
Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model |
01/2024 |
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval |
02/2024 |
EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models |
02/2024 |
MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts |
02/2024 |
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding |
02/2024 |
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities |
02/2024 |
HiQA: A Hierarchical Contextual Augmentation RAG for Massive Documents QA |
02/2024 |
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache |
02/2024 |
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing |
02/2024 |
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks |
02/2024 |
Hydragen: High-Throughput LLM Inference with Shared Prefixes |
02/2024 |
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding |
02/2024 |
LESS: Selecting Influential Data for Targeted Instruction Tuning |
02/2024 |
Accurate LoRA-Finetuning Quantization of LLMs via Information Retention |
02/2024 |
AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers |
02/2024 |
X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics |
02/2024 |
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data |
02/2024 |
Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance |
02/2024 |
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference |
02/2024 |
Uncertainty Decomposition and Quantification for In-Context Learning of Large Language Models |
02/2024 |
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models |
02/2024 |
BitDelta: Your Fine-Tune May Only Be Worth One Bit |
02/2024 |
DoRA: Weight-Decomposed Low-Rank Adaptation |
02/2024 |
In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss |
02/2024 |
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning |
02/2024 |
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding |
02/2024 |
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts |
02/2024 |
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More |
02/2024 |
DB-LLM: Accurate Dual-Binarization for Efficient LLMs |
02/2024 |
Data Engineering for Scaling Language Models to 128K Context |
02/2024 |
EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs |
02/2024 |
HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts |
02/2024 |
Turn Waste into Worth: Rectifying Top |
02/2024 |
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive |
02/2024 |
Q-Probe: A Lightweight Approach to Reward Maximization for Language Models |
02/2024 |
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization |
02/2024 |
MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models |
02/2024 |
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing |
02/2024 |
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation |
02/2024 |
No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization |
02/2024 |
DropBP: Accelerating Fine-Tuning of Large Language Models by Dropping Backward Propagation |
02/2024 |
CoDream: Exchanging dreams instead of models for federated aggregation with heterogeneous models |
02/2024 |
Humanoid Locomotion as Next Token Prediction |
02/2024 |
KTO: Model Alignment as Prospect Theoretic Optimization |
02/2024 |
Noise Contrastive Alignment of Language Models with Explicit Rewards (NCA) |
02/2024 |
ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs |
02/2024 |
Training-Free Long-Context Scaling of Large Language Models (DCA) |
03/2024 |
Not all Layers of LLMs are Necessary during Inference |
03/2024 |
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models |
03/2024 |
DenseMamba: State Space Models with Dense Hidden Connection for Efficient Large Language Models |
03/2024 |
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection |
03/2024 |
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding |
03/2024 |
Scattered Mixture-of-Experts Implementation |
03/2024 |
AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning |
03/2024 |
BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences |
03/2024 |
Bifurcated Attention for Single-Context Large-Batch Sampling |
03/2024 |
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference |
03/2024 |
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering |
03/2024 |
Recurrent Drafter for Fast Speculative Decoding in Large Language Models |
03/2024 |
Arcee's MergeKit: A Toolkit for Merging Large Language Models |
03/2024 |
Rotary Position Embedding for Vision Transformer |
03/2024 |
BiLoRA: A Bi-level Optimization Framework for Overfitting-Resilient Low-Rank Adaptation of Large Pre-trained Models |
03/2024 |
Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition |
03/2024 |
DreamReward: Text-to-3D Generation with Human Preference |
03/2024 |
Evolutionary Optimization of Model Merging Recipes |
03/2024 |
Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance |
03/2024 |
When Do We Not Need Larger Vision Models? |
03/2024 |
FeatUp: A Model-Agnostic Framework for Features at Any Resolution |
03/2024 |
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching |
03/2024 |
The Unreasonable Ineffectiveness of the Deeper Layers |
03/2024 |
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs |
04/2024 |
LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models |
04/2024 |
Prompt-prompted Mixture of Experts for Efficient LLM Generation (GRIFFIN) |
04/2024 |
BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models |
04/2024 |
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget |
04/2024 |
CodecLM: Aligning Language Models with Tailored Synthetic Data |
04/2024 |
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation |
04/2024 |
Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs |
04/2024 |
Continuous Language Model Interpolation for Dynamic and Controllable Text Generation |
04/2024 |
RULER: What's the Real Context Size of Your Long-Context Language Models? |
04/2024 |
Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models |
04/2024 |
On Speculative Decoding for Multimodal Large Language Models |
04/2024 |
CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models |
04/2024 |
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs |
04/2024 |
Fewer Truncations Improve Language Modeling |
04/2024 |
When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes |
04/2024 |
Learn2Talk: 3D Talking Face Learns from 2D Talking Face |
04/2024 |
Weak-to-Strong Extrapolation Expedites Alignment (EXPO) |
04/2024 |
decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points |
04/2024 |
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation |
04/2024 |
Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding |
04/2024 |
Mixture of LoRA Experts |
04/2024 |
MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning |
04/2024 |
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts |
04/2024 |
Retrieval Head Mechanistically Explains Long-Context Factuality |
04/2024 |
Let's Think Dot by Dot: Hidden Computation in Transformer Language Models |
04/2024 |
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting |
05/2024 |
When to Retrieve: Teaching LLMs to Utilize Information Retrieval Effectively |
05/2024 |
A Careful Examination of Large Language Model Performance on Grade School Arithmetic |
05/2024 |
Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge |
05/2024 |
Parameter-Efficient Fine-Tuning with Discrete Fourier Transform |
05/2024 |
COPAL: Continual Pruning in Large Language Generative Models |
05/2024 |
Revisiting a Pain in the Neck: Semantic Phrase Processing Benchmark for Language Models |
05/2024 |
AlphaMath Almost Zero: process Supervision without process |
05/2024 |
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving |
05/2024 |
xLSTM: Extended Long Short-Term Memory |
05/2024 |
FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference |
05/2024 |
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models |
05/2024 |
HMT: Hierarchical Memory Transformer for Long Context Language Processing |
05/2024 |
The Future of Large Language Model Pre-training is Federated |
05/2024 |
Layer-Condensed KV Cache for Efficient Inference of Large Language Models |
05/2024 |
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning |
05/2024 |
SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model |
05/2024 |
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention |
05/2024 |
Bagging Improves Generalization Exponentially |
05/2024 |
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models |
05/2024 |
Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast |
05/2024 |
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum |
05/2024 |
T2 of Thoughts: Temperature Tree Elicits Reasoning in Large Language Models |
05/2024 |
ReALLM: A general framework for LLM compression and fine-tuning |
05/2024 |
SimPO: Simple Preference Optimization with a Reference-Free Reward |
05/2024 |
PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression |
05/2024 |
Removing Bias from Maximum Likelihood Estimation with Model Autophagy |
05/2024 |
RE-Adapt: Reverse Engineered Adaptation of Large Language Models |
05/2024 |
MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence |
05/2024 |
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining |
05/2024 |
Accelerating Transformers with Spectrum-Preserving Token Merging |
05/2024 |
A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training |
05/2024 |
MoEUT: Mixture-of-Experts Universal Transformers |
05/2024 |
Exploring Context Window of Large Language Models via Decomposed Positional Vectors |
05/2024 |
Transformers Can Do Arithmetic with the Right Embeddings |
05/2024 |
OwLore: Outlier-weighed Layerwise Sampled Low-Rank Projection for Memory-Efficient LLM Fine-tuning |
05/2024 |
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification |
05/2024 |
Self-Play Preference Optimization for Language Model Alignment |
05/2024 |
The Road Less Scheduled(Schedule-Free) |
06/2024 |
FineWeb: decanting the web for the finest text data at scale |
06/2024 |
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality (Mamba-2) |
06/2024 |
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization |
06/2024 |
DeCoOp: Robust Prompt Tuning with Out-of-Distribution Detection |
06/2024 |
MultiMax: Sparse and Multi-Modal Attention Learning |
06/2024 |
MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization |
06/2024 |
Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation |
06/2024 |
QuanTA: Efficient High-Rank Fine-Tuning of LLMs with Quantum-Informed Tensor Adaptation |
06/2024 |
SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining |
06/2024 |
Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models |
06/2024 |
VCR: Visual Caption Restoration |
06/2024 |
LoCoCo: Dropping In Convolutions for Long Context Compression |
06/2024 |
Low-Rank Quantization-Aware Training for LLMs |
06/2024 |
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters |
06/2024 |
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion |
06/2024 |
TernaryLLM: Ternarized Large Language Model |
06/2024 |
Image and Video Tokenization with Binary Spherical Quantization |
06/2024 |
Discovering Preference Optimization Algorithms with and for Large Language Models |
06/2024 |
ProTrain: Efficient LLM Training via Memory-Aware Techniques |
06/2024 |
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling |
06/2024 |
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing |
06/2024 |
Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs |
06/2024 |
HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning |
06/2024 |
LieRE: Generalizing Rotary Position Encodings |
06/2024 |
DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer |
06/2024 |
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference |
06/2024 |
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies |
06/2024 |
mDPO: Conditional Preference Optimization for Multimodal Large Language Models |
06/2024 |
QTIP: Quantization with Trellises and Incoherence Processing |
06/2024 |
Mixture-of-Subspaces in Low-Rank Adaptation (MoSLoRA) |
06/2024 |
Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization |
06/2024 |
Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models |
06/2024 |
DeciMamba: Exploring the Length Extrapolation Potential of Mamba |
06/2024 |
Optimised Grouped-Query Attention Mechanism for Transformers |
06/2024 |
MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression |
06/2024 |
Unsupervised Morphological Tree Tokenizer |
06/2024 |
Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation |
06/2024 |
What Matters in Transformers? Not All Attention is Needed |
06/2024 |
Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention |
06/2024 |
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models |
06/2024 |
Adam-mini: Use Fewer Learning Rates To Gain More |
06/2024 |
Large Language Models are Interpretable Learners |
06/2024 |
Selective Prompting Tuning for Personalized Conversations with LLMs |
06/2024 |
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs |
06/2024 |
Unsupervised Morphological Tree Tokenizer (TreeTok) |
07/2024 |
Eliminating Position Bias of Language Models: A Mechanistic Approach |
07/2024 |
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion |
07/2024 |
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs |
07/2024 |
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training |
07/2024 |
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning |
07/2024 |
Learning to (Learn at Test Time): RNNs with Expressive Hidden States (TTT) |
07/2024 |
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps |
07/2024 |
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules |
07/2024 |
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training |
07/2024 |
Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization |
07/2024 |
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients |
07/2024 |
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision |
07/2024 |
Lite-SAM Is Actually What You Need for Segment Everything |
07/2024 |
BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks |
07/2024 |
Tiled Bit Networks: Sub-Bit Neural Network Compression Through Reuse of Learnable Binary Vectors |
07/2024 |
Patch-Level Training for Large Language Models |
07/2024 |
Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization |
07/2024 |
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference |
07/2024 |
Hi-EF: Benchmarking Emotion Forecasting in Human-interaction |
07/2024 |
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads |
07/2024 |
MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning |
07/2024 |
Palu: Compressing KV-Cache with Low-Rank Projection |
07/2024 |
AI-Assisted Generation of Difficult Math Questions (MATH^2) |
07/2024 |
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models |
07/2024 |
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies |
08/2024 |
POA: Pre-training Once for Models of All Sizes |
08/2024 |
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion |
08/2024 |
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters |
08/2024 |
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion |
08/2024 |
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression |
08/2024 |
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery |
08/2024 |
Post-Training Sparse Attention with Double Sparsity |
08/2024 |
A Spitting Image: Modular Superpixel Tokenization in Vision Transformers (SPiT) |
08/2024 |
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations |
08/2024 |
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models |
08/2024 |
HMoE: Heterogeneous Mixture of Experts for Language Modeling |
08/2024 |
First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models |
08/2024 |
LLM Pruning and Distillation in Practice: The Minitron Approach |
08/2024 |
FocusLLM: Scaling LLM's Context by Parallel Decoding |
08/2024 |
Memory-Efficient LLM Training with Online Subspace Descent |
08/2024 |
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding |
08/2024 |
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer |
09/2024 |
FedModule: A Modular Federated Learning Framework |
09/2024 |
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers |
09/2024 |
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning |
09/2024 |
Length Desensitization in Directed Preference Optimization (LD-DPO) |
09/2024 |
CPL: Critical Planning Step Learning Boosts LLM Generalization in Reasoning Tasks |
09/2024 |
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval |
09/2024 |
SOAP: Improving and Stabilizing Shampoo using Adam |
09/2024 |
A Controlled Study on Long Context Extension and Generalization in LLMs |
09/2024 |
Scaling FP8 training to trillion-token LLMs |
09/2024 |
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts |
09/2024 |
INT-FlashAttention: Enabling Flash Attention for INT8 Quantization |
09/2024 |
Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction |
09/2024 |
SATA: Spatial Autocorrelation Token Analysis for Enhancing the Robustness of Vision Transformers |
10/2024 |
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment |
10/2024 |
FlashMask: Efficient and Rich Mask Extension of FlashAttention |
10/2024 |
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data |
10/2024 |
Parameter Competition Balancing for Model Merging |
10/2024 |
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration |
10/2024 |
ARB-LLM: Alternating Refined Binarizations for Large Language Models |
10/2024 |
Contextual Document Embeddings |
10/2024 |
SparsePO: Controlling Preference Alignment of LLMs via Sparse Token Masks |
10/2024 |
Accelerating Diffusion Transformers with Token-wise Feature Caching |
10/2024 |
Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling |
10/2024 |
Restructuring Vector Quantization with the Rotation Trick |
10/2024 |
Upcycling Large Language Models into Mixture of Experts |
10/2024 |
Parameter-Efficient Fine-Tuning of State Space Models (SDLoRA) |
10/2024 |
ElasticTok: Adaptive Tokenization for Image and Video |
10/2024 |
LeanAgent: Lifelong Learning for Formal Theorem Proving |
10/2024 |
LoLCATs: On Low-Rank Linearizing of Large Language Models |
10/2024 |
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads |
10/2024 |
SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction |
10/2024 |
A Little Human Data Goes A Long Way |
10/2024 |
SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training |
10/2024 |
Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs |
10/2024 |
FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs |
10/2024 |
LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging |
10/2024 |
AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning |
10/2024 |
Stick-breaking Attention |
10/2024 |
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training |
10/2024 |
HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation |
10/2024 |
UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function |
10/2024 |
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference |
10/2024 |
EMMA: End-to-End Multimodal Model for Autonomous Driving |
10/2024 |
TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters |
11/2024 |
PatternBoost: Constructions in Mathematics with a Little Help from AI |
11/2024 |
Inference Optimal VLMs Need Only One Visual Token but Larger Models |
11/2024 |
LASER: Attention with Exponential Transformation |
11/2024 |
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication |
11/2024 |
Aioli: A Unified Optimization Framework for Language Model Data Mixing |
11/2024 |
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning |
11/2024 |
More Expressive Attention with Negative Weights (Cog Attention) |
11/2024 |
The Surprising Effectiveness of Test-Time Training for Abstract Reasoning |
11/2024 |
Entropy Controllable Direct Preference Optimization |
11/2024 |
Cut Your Losses in Large-Vocabulary Language Models |
11/2024 |
Everything is a Video: Unifying Modalities through Next-Frame Prediction |
11/2024 |
SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration |
11/2024 |
|