/ldg/ Wan 2.1 Install and Optimization Guide

This is a noob's guide to help you install Wan and implement every available optimization to maximize the speed of video generation. Achieving this involves trade-offs in quality, but you can disable any of the optimizations if you prefer to prioritize quality over speed. The idea is to offer the fastest possible generation speed in a single, basic workflow, which you can then tailor to your hardware and needs.

The included guide and workflows were created for NVIDIA GPUs with 24GB of VRAM, typically utilizing 21-23GB during inference. 64GB of system RAM is also recommended, along with having Comfy and the models on an SSD for fast swapping.

There's also options for systems with less than 24GB VRAM, and the workflows can be tweaked to accommodate them. More info is below.

VRAM Requirements and Model Size (aka, "Can My 4GB GPU Run This?")

To check if your GPU's VRAM can handle a model, verify the model's file size, as it must fully load into VRAM for inference. For example, wan2.1-t2v-14b-Q8_0.gguf (15.9GB) and wan2.1-i2v-14b-480p-Q8_0.gguf (18.1GB) require at least 15.9GB and 18.1GB of VRAM, respectively. That's just to load the models.

Additional components like text encoders or CLIP models also use VRAM, and inference can add 2-5+GB more, depending on resolution and context/frame count, with 720p settings being particularly VRAM-intensive.

To manage VRAM limitations, offload to RAM/CPU using the virtual_vram_gb setting in the UnetLoaderGGUFDisTorchMultiGPU node, though this slows generation and you can only offload so much before it slows down to the point of being unusable. Generally, if the model file size is over but within 5GB or so of your total VRAM, you should be able to offload with sacrificing too much time. ie, using offloading, a 16GB GPU can run the 15.9GB or 18.1GB Q8 480p models. You can also offload components like the text encoder to a second GPU using the CLIPLoaderMultiGPU node.

If your VRAM is still insufficient, use a lower-quantization model (available in repositories listed in the installation section), which reduces VRAM needs but sacrifices accuracy and quality.

Prerequisite Steps - DO FIRST

  1. ComfyUI Portable. It's not mandatory, but I recommend a clean install of ComfyUI Portable, one created specifically for Wan.
  2. GIT. You might already have this. Open a cmd.exe prompt and enter "git". If the command isn't recognized, download it here.

Next, download these workflows, based on both Comfy's original workflows for Wan, as well as user workflows and feedback from /ldg/ users. They're kept relatively barebones by design, so you can easily modify them to suit your needs. They also use Alibaba's default settings as a baseline, but include the most important optimizations, a LoRA loading fix, along with video interpolation, which uses AI to increase the framerate of the generated videos. The workflows output two videos; the raw 16 fps generation and an interpolated 32 fps version with smoother motion.

You can easily adapt these to use the 720P model/setting. See Generating at 720P.

Installation

  1. Ensure that ComfyUI is updated to the very latest version. (update_comfyui.bat in ComfyUI_windows_portable\update)
  2. Download these models. If you have less than 24GB of VRAM, you could also swap out the Q8 models for Q6/Q5/Q4, though you'll see a progressively larger drop in output quality the lower you go.

Do NOT use any other text encoder files with these models! Using quantized version of umt5_xxl_fp16.safetensors can lead to errors! Using KJ's version of the text encoder will error out before generating with Exception during processing !!! mat1 and mat2 shapes cannot be multiplied (77x768 and 4096x5120)

  1. Download this bat file and save it as wan_autoinstall.bat in ComfyUI_windows_portable\
    Run the .bat file and select your GPU type. If you have a 50XX series, it'll install pytorch 2.8.0dev, while 40XX and below will install 2.7.1 Stable. There's no functional difference in speed between pytorch versions, however some 40XX and 30XX users have reported issues with 2.8.0dev and WAN. Also installed are other requirements, all of which will drastically speed up your generations. Run the commands through an LLM to confirm its safe, or run the steps within manually if you prefer.
  2. Make a copy of run_nvidia_gpu.bat in ComfyUI_windows_portable, and call it run_nvidia_gpu_optimizations.bat. Then change the first line to this :

    .\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --use-sage-attention --fast

  3. Run ComfyUI with run_nvidia_gpu_optimizations.bat. Look in the cmd.exe console window and make sure pytorch version: 2.7.1 or pytorch version: 2.8.0dev is shown during startup. You should also see Enabled fp16 accumulation and Using sage attention. Make sure that every time you start Comfy, pytorch version reads either 2.7.1 or 2.8.0dev, otherwise fp16_fast / fp16 accumulation won't work.
  4. Open one of the provided workflows. Run your first gen. The video interpolation model will automatically download once the node is activated.

You're done!

Common Errors

IMPORTANT! There's an issue with Comfy where it sometimes boots up an old version of pytorch. It can happen :

  • when you first install 2.7.1/2.8.0dev and run Comfy
  • when you update Comfy
  • when you restart Comfy via Manager's restart button

On booting Comfy up, if the cmd.exe console window displays anything but 2.7.1 (for 30XX & 40XX) or 2.8.0dev (for 50XX), restart Comfy manually. If it still isn't listing 2.7.1/2.8.0dev after you've restarted it once or twice, try to manually install pytorch again by running this in Comfy portable's root directory:

30XX or 40XX
.\python_embeded\python.exe -s -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --force-reinstall
50XX
.\python_embeded\python.exe -s -m pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 --force-reinstall

If the workflow freezes during model loading with "Press any key to continue" in the cmd.exe window, you need to restart your computer.

If you get this error when running the workflow :

ImportError: DLL load failed while importing cuda_utils: The specified module could not be found.

Go to \users\username\ and open the .triton directory. Delete the cache subdirectory inside of it. Do not delete the entire .triton directory.

If you get an error about :

SamplerCustomAdvanced returned non-zero exit status 1

Download this and extract it to ComfyUI_windows_portable\python_embeded

Important Notes

It is highly recommended you enable previews during generation. If you followed the guide, you'll have the extension required. Go to ComfyUI Settings (the cog icon at the bottom left) and search for "Display animated previews when sampling". Enable it. Then open Comfy Manager and set Preview method to TAESD (slow). At about step 10, the preview will clear up enough to get a general sense of the composition and movement. This can and will save you a lot of time, as you can cancel gens early if you don't like how they look.

The initial generation time you get is NOT accurate. Teacache kicks in during the gen, and Adaptive about midway through if you have it on.

TorchCompile needs to compile when running your first gen. You'll see multiple lines in your cmd.exe window as it compiles (DeviceCopy in input program) Once it's finished, subsequent generations will be faster. It needs to compile every time you restart Comfy or change your LoRA stack.

When a video finishes generating, you'll get two files in their own i2v or t2v directories and subdirectories. The raw files are the 16 frame outputs while the int files are interpolated to 32 frames which gives you much smoother motion.

NEVER use the 720p i2v model at 480p resolutions and vice versa. If you use the 720p i2v model and set your res to 832x480 for example, the output you get will be much worse than simply using the 480p i2v model. You won't ever improve quality by genning 480p on the 720p model, so don't do it. The only model which allows you to mix 480p and 720p resolutions is t2v 14B.

lightx2v + NAG (Huge Speed Increase)

Experimental implementation of Self Forcing and Normalized Attention Guidance for WAN. In this case, Self Forcing is extracted from this distilled model and applied via a LoRA, and is an early, unofficial attempt.

Overview:

  • Extremely fast generation time. A 40 step 720p output can take up to or over 50 minutes on a 3090. With this 4-step method, it takes around 5 minutes
  • Works in conjunction with existing WAN LoRA's
  • Decreased visual quality, though its less of an issue at 720p
  • Reduced motion fluidity and prompt adherence
  • Favors slow motion

To be clear, this isn't a drop-in replacement for normal WAN. If you do a side-by-side comparison between two gens with the same prompt and seed, you'll see there's a difference, and that a regular WAN output is superior to an output made using this method. You'll need to test it out to see whether the quality loss is acceptable to you given the large speed gains and the fact that it makes 720p a viable option.

This is a modified workflow for it with all relevant optimizations and settings. I've only added the I2V version, you should be able to adapt the settings to T2V easily enough :

To test it, make sure both ComfyUI and KJNodes is up to date, then get the LoRA for Self Forcing here. Add it to your ComfyUI LoRA directory. Any other LoRA's should be added between it and Patch Model Patcher Order, or via a custom LoRA stacking node.

There's also this experimental and unofficial FlowMatch scheduler that's supposed to be used with Self Forcing models. It's based off the official code, implemented as a Comfy node. Early tests show some small improvements to motion.
To use it, git clone https://github.com/BigStationW/flowmatch_scheduler-comfyui in custom_nodes, then delete "ModelSamplingSD3" from the workflow and replace "BasicScheduler" with the new "FlowMatchSigmas" node.

You can use the FusionX version of WAN on this by simply loading the gguf as a drop-in replacement. It appears that FusionX produces much better motion fluidity combined with lightx2v when compared to regular WAN. The T2V version seems outright superior to vanilla WAN, but the I2V version can sometimes drastically change faces due to one of the LoRA's merged into it. You can get the T2V ggufs here and the I2V ggufs here.
All the default settings work with it, including lightx2v, though you need to set lightx2v's LoRA strength to 0.6 instead.

Generating at 720P

For T2V :

  • Increase the resolution to 720p (1280x720 / 720x1280). The single 14B t2v model supports both 480p and 720p.

For I2V :

  • Select the i2v 720p model in UnetLoaderGGUFDisTorchMultiGPU
  • Increase the resolution to 720p (1280x720 / 720x1280).
  • Set Teacache coefficients to i2v_720. (Doesn't apply to the FAST workflow)
  • Set Teacache threshold to 0.2, which is the medium setting. Increase it to 0.3 for faster gens at the expense of a hit to output quality. (Doesn't apply to the FAST workflow)

For both T2V and I2V, you'll likely need to increase virtual_vram_gb accordingly if you don't have enough VRAM to run it. In general, you want to reserve at around 1GB of VRAM. The critical point is 500MB within the limit. So a 24GB hitting 23.5GB will usually offload in an inefficient way that massively increases gen times. The same rule applies to any GPU, ie 16GB hitting 15.5GB use.

VACE (Video Editing/Video to Video)

VACE 14B is a multi-modal model by the makers of Wan, and it's designed for video to video and video editing. There's no /ldg/ workflow for it at the moment, but you can try out the default Comfy workflow and implementation here. You'll need to add optimization nodes yourself.

Supported Resolutions

Each model in Wan 2.1 is trained and fine-tuned to work best at specific resolutions. Sticking to these supported resolutions generally delivers the sharpest, most reliable results, especially for i2v, where each model was apparently tailored to perform optimally at just two resolutions. Straying from these can, in theory, lead to subpar output.

Text to Video - 1.3B Text to Video - 14B Image to Video - 480p Image to Video - 720p
480x832 720x1280 832x480 1280x720
832x480 1280x720 480x832 720x1280
624x624 960x960
704x544 1088x832
544x704 832x1088
480x832
832x480
624x624
704x544
544x704

Guide On Using Non-Standard Resolutions on I2V

The two I2V models were trained at specific resolutions (480p and 720p), and using non-standard resolutions typically reduces output quality due to limited pixel space, though temporal issues or major coherence drops are minimal.

To maintain quality, keep resolutions close to the trained settings (480p or 720p). Lock one dimension (480 for 480p models or 720 for 720p models) and adjust the other downward to tweak aspect ratios. ie, with the 480p i2v model, one dimension stays fixed at 480, while the other starts at a maximum of 832 and can be scaled down from there. For the 720p model, one dimension anchors at 720, with the other starting at 1280 and adjustable downward. This avoids cropping vital details while keeping the resolution as close to the training set as possible.

The Optimizations

Several options in this guide speed up inference time. They are fp16_fast (fp16 accumulation), TeaCache, Torch Compile, AdaptiveGuidance and Sage Attention. If you wish to disable them for testing or to increase quality at the expense of time, do the following :

  • fp16_fast : remove --fast from run_nvidia_gpu.bat.
  • Sage Attention : remove --use-sage-attention from run_nvidia_gpu.bat
  • AdaptiveGuidance : set the AdaptiveGuidance node to a threshold of 1
  • Torch Compile : right click on the TorchCompileModelWanVideo node and click Bypass
  • TeaCache : right click the TeaCache node and click Bypass

Changelog

17/06/25

  • Added Self Forcing+NAG section
  • Updated .bat installer with GPU selection for pytorch, given 2.8.0dev seems to be causing OOM's with batch runs on 3090 GPU's
  • Added ComfyUI-Crystools to auto installer for resource monitoring, good to see at a glance if you're using too much VRAM
  • Updated TorchCompileModelWanVideo to TorchCompileModelWanVideoV2

03/06/25

  • Removed KJ section and workflows given nobody was using them and native still remains the best option
  • Added section on RifleXRoPE and VACE

10/05/25

  • Fixed .bat file attempting to download a broken version of torchvision

26/04/25

  • Changed FILM VFI's clear cache from 20 to 10 to prevent OOM's under certain conditions
  • Video files now output to date-formatted directories, with the seed in the filename

22/03/25

  • changed requirement from CUDA 12.6 to 12.8
  • updated pytorch to 2.8.0.dev20250317
  • updated Triton and Sage
  • 50XX series should work with this setup
  • streamlined install process

21/03/25

  • Comfy Workflows: added patch for TorchCompile issue leading to LoRA's being broken in Comfy Native

Other /ldg/ rentries I maintain

Edit
Pub: 06 Mar 2025 17:25 UTC
Edit: 25 Jun 2025 02:55 UTC
Views: 121137