/ldg/ Wan 2.1 Install and Optimization Guide

This is a noob's guide to help you install Wan and implement every available optimization to maximize the speed of video generation. Achieving this involves trade-offs in quality, but you can disable any of the optimizations if you prefer to prioritize quality over speed. The idea is to offer the fastest possible generation speed in a single, basic workflow, which you can then tailor to your hardware and needs.

The included guide and workflows were created for NVIDIA GPUs with 24GB of VRAM, typically utilizing 21-23GB during inference. 64GB of system RAM is also recommended, along with having Comfy and the models on an SSD for fast swapping.

There's also options for systems with less than 24GB VRAM, and the workflows can be tweaked to accommodate them. More info is below.

VRAM Requirements and Model Size (aka, "Can My 4GB GPU Run This?")

To determine if your GPU's VRAM can handle a specific model, start by checking the model's file size, as the entire model must be loaded into VRAM for inference. For instance, the wan2.1-t2v-14b-Q8_0.gguf model is 15.9GB, and the wan2.1-i2v-14b-480p-Q8_0.gguf model is 18.1GB, requiring at least 15.9GB and 18.1GB of VRAM, respectively.

However, loading the model is only part of the equation. Additional components, such as text encoders or CLIP models, also consume VRAM. On top of that, inference/generation demands even more VRAM. The extra overhead can equate to 2-5+GB more VRAM usage, depending on resolution and context/frame count, with higher settings potentially increasing this further. The 720p settings/versions of the models are particularly demanding.

You can offset this to a point by offloading to RAM/CPU (virtual_vram_gb in the UnetLoaderGGUFDisTorchMultiGPU node) at the expense of increased generation time, so a 16GB GPU can run the 15.9GB and 18.1GB Q8 480P t2v and i2v models with offloading. You can also offload models like the text encoder to a second GPU if you have one, using the CLIPLoaderMultiGPU node.

If the model is too large for even offloading to offset, you'll need a version quantized at a lower level, which are available in the repositories listed in the installation section below. A quantization is essentially a lower-precision version of the model, which reduces VRAM requirements at the cost of accuracy and quality.

Prerequisite Steps - DO FIRST

ComfyUI Portable
CUDA 12.8
GIT. Open a cmd.exe prompt and enter "git". If the command isn't recognized, download it here.

It's not mandatory, but I recommend a clean install of ComfyUI Portable, one created specifically for Wan. This is because the specific pytorch, sage, triton and CUDA installs required by this guide could cause issues with existing installations/workflows/nodes meant for image/audio generation in ComfyUI if those nodes/extensions require specific or non-nightly versions of these libraries.

Next, download these modified versions of Comfy's workflows. They're kept relatively barebones by design, so you can easily modify them to your needs. They also use Alibaba's default settings as a baseline, but include the most important optimizations, a LoRA loading fix, along with video interpolation, which uses AI to increase the framerate of the generated videos. The workflows output two videos; the raw 16 fps generation and an interpolated 32 fps version.

/ldg/ Comfy I2V 480p workflow: ldg_cc_i2v_14b_480p.json
(updated 17th June 2025)

/ldg/ Comfy T2V 480p workflow: ldg_cc_t2v_14b_480p.json
(updated 17th June 2025)

You can easily adapt these to use the 720P model/setting. See Generating at 720P.

Installation

  1. Ensure that ComfyUI is updated to the very latest version. (update_comfyui.bat in ComfyUI_windows_portable\update)
  2. Download these models. If you have less than 24GB of VRAM, you could also swap out the Q8 models for Q6/Q5/Q4, though you'll see a progressively larger drop in output quality the lower you go.

Do NOT use any other text encoder files with these models! Using quantized version of umt5_xxl_fp16.safetensors can lead to errors! Using KJ's version of the text encoder will error out before generating with Exception during processing !!! mat1 and mat2 shapes cannot be multiplied (77x768 and 4096x5120)

  1. Download this bat file and save it as wan_autoinstall.bat in ComfyUI_windows_portable\
    Run the .bat file and select your GPU type. If you have a 50XX series, it'll install pytorch 2.8.0dev (needed for Blackwell GPU's), while 40XX and below will install 2.7.1 Stable. Also installed are other requirements, all of which will drastically speed up your generations. Run the commands through an LLM to confirm its safe, or run the steps within manually if you prefer.
  2. Make a copy of run_nvidia_gpu.bat in ComfyUI_windows_portable, and call it run_nvidia_gpu_optimizations.bat. Then change the first line to this :

    .\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --use-sage-attention --fast

  3. Run ComfyUI with run_nvidia_gpu_optimizations.bat. Look in the cmd.exe console window and make sure pytorch version: 2.7.1 or pytorch version: 2.8.0dev is shown during startup. You should also see Enabled fp16 accumulation and Using sage attention. Make sure that every time you start Comfy, pytorch version reads either 2.7.1 or 2.8.0dev, otherwise fp16_fast / fp16 accumulation won't work.
  4. Open one of the provided workflows. Run your first gen. The video interpolation model will automatically download once the node is activated.

You're done!

Common Errors

IMPORTANT! There's an issue with Comfy where it sometimes boots up an old version of pytorch. It can happen :

  • when you first install 2.7.1/2.8.0dev and run Comfy
  • when you update Comfy
  • when you restart Comfy via Manager's restart button

On booting Comfy up, if the cmd.exe console window displays anything but 2.7.1 (for 30XX & 40XX) or 2.8.0dev (for 50XX), restart Comfy manually. If it still isn't listing 2.7.1/2.8.0dev after you've restarted it once or twice, try to manually install pytorch again by running this in Comfy portable's root directory:

30XX or 40XX
.\python_embeded\python.exe -s -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --force-reinstall
50XX
.\python_embeded\python.exe -s -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 --force-reinstall

If the workflow freezes during model loading with "Press any key to continue" in the cmd.exe window, you need to restart your computer.

If you get this error when running the workflow :

ImportError: DLL load failed while importing cuda_utils: The specified module could not be found.

Go to \users\username\ and open the .triton directory. Delete the cache subdirectory inside of it. Do not delete the entire .triton directory.

If you get an error about :

SamplerCustomAdvanced returned non-zero exit status 1

Download this and extract it to ComfyUI_windows_portable\python_embeded

Important Notes

It is highly recommended you enable previews during generation. If you followed the guide, you'll have the extension required. Go to ComfyUI Settings (the cog icon at the bottom left) and search for "Display animated previews when sampling". Enable it. Then open Comfy Manager and set Preview method to TAESD (slow). At about step 10, the preview will clear up enough to get a general sense of the composition and movement. This can and will save you a lot of time, as you can cancel gens early if you don't like how they look.

The initial generation time you get is NOT accurate. Teacache kicks in during the gen, and Adaptive about midway through if you have it on.

TorchCompile needs to compile when running your first gen. You'll see multiple lines in your cmd.exe window as it compiles (DeviceCopy in input program) Once it's finished, subsequent generations will be faster. It needs to compile every time you restart Comfy or change your LoRA stack.

When a video finishes generating, you'll get two files in their own i2v or t2v directories and subdirectories. The raw files are the 16 frame outputs while the int files are interpolated to 32 frames which gives you much smoother motion.

NEVER use the 720p i2v model at 480p resolutions and vice versa. If you use the 720p i2v model and set your res to 832x480 for example, the output you get will be much worse than simply using the 480p i2v model. You won't ever improve quality by genning 480p on the 720p model, so don't do it. The only model which allows you to mix 480p and 720p resolutions is t2v 14B.

lightx2v + NAG (Huge Speed Increase)

Experimental implementation of Self Forcing and Normalized Attention Guidance for WAN. In this case, Self Forcing is extracted from this distilled model and applied via a LoRA, and is an early, unofficial attempt. The LoRA is also made for T2V, but works with I2V.

There's a massive decrease in generation time (ie, a 3090 can generate a 720p in 5 minutes versus around 50 minutes without it) and it appears to work with pre-existing WAN LoRA's. However, with this early implementation, there's several trade offs; a decrease in visual quality (more noticeable in 480p compared to 720p), reduced motion fluidity and gens seeming to favor slow motion movements.

As it stands, this isn't a drop-in replacement for normal WAN. If you do a side-by-side comparison between two gens with the same prompt and seed, you'll see there's a difference, and that a regular WAN output is superior to a Light+NAG output. You'll need to test it out to see whether the quality loss is acceptable to you.

I'll update this section as it progresses, but here's a modified version of the 480p workflow with it added and the settings changed to accommodate it. It's easy enough to adapt the nodes/settings to t2v, so I'll only include the i2v for now :

/ldg/ Comfy I2V 480p FAST workflow: ldg_cc_i2v_FAST_14b_480p.json (updated 17th June 2025)

To test it, make sure both ComfyUI and KJNodes is up to date, then get the LoRA for Self Forcing here. Add it to your ComfyUI LoRA directory. Any other LoRA's should be added between it and Patch Model Patcher Order, or via a custom LoRA stacking node. You can set the lightx2v LoRA strength anywhere between 0.8-1.0. You might get better motion at between 0.8-0.9.

There's also this experimental and unofficial FlowMatch scheduler that's supposed to be used with Self Forcing models. It's based off the official code, implemented as a Comfy node. Early tests show some small improvements to motion. To use it, git clone https://github.com/BigStationW/flowmatch_scheduler-comfyui in custom_nodes, then delete "ModelSamplingSD3" from the workflow and replace "BasicScheduler" with the new "FlowMatchSigmas" node.

Generating more than 5 seconds of video

If you try to add more than 5 seconds/81 frames to the EmptyHunyuanLatentVideo node (length=81), the video will simply begin to loop after the first 5 seconds, because the model is designed to only output 81 frames. However, you can add more than 5 seconds without looping, thanks to RifleXRoPE for Wan, which can give you an extra 3 seconds at the cost of generation time.

To do so, make sure KJNodes are up to date, double click on the workflow, search for "Apply RifleXRoPE WanVideo" and integrate it (place it between ModelSamplingSD3 and BasicScheduler for the "model" connection, and attach a "latent" connection from EmptyHunyuanLatentVideo or WanImageToVideo for t2v and i2v respectively). Then set EmptyHunyuanLatentVideo's length from 81 to 129, which gives you 8 seconds of video instead of 5.

Generating at 720P

If you want to use the 720p model in i2v or 720p res on t2v, you'll need to:

  • On t2v, you need to increase the resolution to 720p (1280x720 / 720x1280). The single 14B t2v model supports both 480p and 720p.
  • When using i2v on Wan, start by selecting the i2v 720P model in the model loader. Next, adjust the width and height settings of your input image to 1280x720 or 720x1280.
  • Set Teacache coefficients to i2v_720.
  • Set Teacache threshold to 0.2, which is the medium setting. Increase it to 0.3 for faster gens at the expense of a hit to output quality.
  • Increase virtual_vram_gb.
    On a 24GB GPU, you want to increase it until you're using just under 23GB in total. You never want to reach or exceed 23.5GB use, or you'll either OOM or massively increase gen times. That threshold is the same for any GPU, ie if you have 16GB, don't exceed 15.5 and try to keep it at or under 15GB.

VACE (Video Editing/Video to Video)

VACE 14B is a multi-modal model by the makers of Wan, and it's designed for video to video and video editing. There's no /ldg/ workflow for it at the moment, but you can try out the default Comfy workflow and implementation here. You'll need to add optimization nodes yourself.

Supported Resolutions

Each model in Wan 2.1 is trained and fine-tuned to work best at specific resolutions. Sticking to these supported resolutions generally delivers the sharpest, most reliable results, especially for i2v, where each model was apparently tailored to perform optimally at just two resolutions. Straying from these can, in theory, lead to subpar output.

Text to Video - 1.3B Text to Video - 14B Image to Video - 480p Image to Video - 720p
480x832 720x1280 832x480 1280x720
832x480 1280x720 480x832 720x1280
624x624 960x960
704x544 1088x832
544x704 832x1088
480x832
832x480
624x624
704x544
544x704

Guide On Using Non-Standard Resolutions on I2V

Though the two I2V models were trained at two specific resolutions each, using non-standard resolutions typically only seems to give you lower quality outputs due to reduced pixel space for the model to work with. I haven't noticed any glaring temporal issues or a huge drop in coherence by doing so.

That said, I generally try to keep my own outputs as close to the original resolutions as possible, avoiding extreme shifts from the standard 480p or 720p settings. I prefer to lock one dimension - either 480 for 480p models or 720 for 720p models - and adjust the other dimension downward (never upward) to tweak the aspect ratio as needed.

So with the 480p i2v model, one dimension stays fixed at 480, while the other starts at a maximum of 832 and can be scaled down from there. For the 720p model, one dimension anchors at 720, with the other starting at 1280 and adjustable downward.

This might be a better solution for you if you don't want to crop out vital details to get the image to fit the resolution, and it's better than having black vertical or horizontal bars on the image to slot inside of a precise 480p or 720p aspect ratio, given Wan doesn't seem to like letterboxing.

Still, nothing is stopping you from generating videos at, say, 512x512 on the 480p model, it's just that the quality will be worse than if you found a way to crop/scale it closer to the resolution the model was trained on.

The Optimizations

Several options in this guide speed up inference time. They are fp16_fast (fp16 accumulation), TeaCache, Torch Compile, AdaptiveGuidance and Sage Attention. If you wish to disable them for testing or to increase quality at the expense of time, do the following :

  • fp16_fast : remove --fast from run_nvidia_gpu.bat.
  • Sage Attention : remove --use-sage-attention from run_nvidia_gpu.bat
  • AdaptiveGuidance : set the AdaptiveGuidance node to a threshold of 1
  • Torch Compile : right click on the TorchCompileModelWanVideo node and click Bypass
  • TeaCache : right click the TeaCache node and click Bypass

Changelog

17/06/25

  • Added Self Forcing+NAG section
  • Updated .bat installer with GPU selection for pytorch, given 2.8.0dev seems to be causing OOM's with batch runs on 3090 GPU's
  • Added ComfyUI-Crystools to auto installer for resource monitoring, good to see at a glance if you're using too much VRAM
  • Updated TorchCompileModelWanVideo to TorchCompileModelWanVideoV2

03/06/25

  • Removed KJ section and workflows given nobody was using them and native still remains the best option
  • Added section on RifleXRoPE and VACE

10/05/25

  • Fixed .bat file attempting to download a broken version of torchvision

26/04/25

  • Changed FILM VFI's clear cache from 20 to 10 to prevent OOM's under certain conditions
  • Video files now output to date-formatted directories, with the seed in the filename

22/03/25

  • changed requirement from CUDA 12.6 to 12.8
  • updated pytorch to 2.8.0.dev20250317
  • updated Triton and Sage
  • 50XX series should work with this setup
  • streamlined install process

21/03/25

  • Comfy Workflows: added patch for TorchCompile issue leading to LoRA's being broken in Comfy Native

Other /ldg/ rentries I maintain

Edit
Pub: 06 Mar 2025 17:25 UTC
Edit: 18 Jun 2025 04:44 UTC
Views: 105035