The foundational library for the LTX-2 Audio-Video generation model. This package contains the raw model definitions, component implementations, and loading logic used by ltx-pipelines and ltx-trainer.
components/: Modular diffusion components (Schedulers, Guiders, Noisers, Patchifiers) following standard protocolsconditioning/: Tools for preparing latent states and applying conditioning (image, video, keyframes)guidance/: Perturbation system for fine-grained control over attention mechanismsloader/: Utilities for loading weights from .safetensors, fusing LoRAs, and managing memorymodel/: PyTorch implementations of the LTX-2 Transformer, Video VAE, Audio VAE, Vocoder and Upscalertext_encoders/gemma: Gemma text encoder implementation with tokenizers, feature extractors, and separate encoders for audio-video and video-only generationltx-core provides the building blocks (models, components, and utilities) needed to construct inference flows. For ready-made inference pipelines use ltx-pipelines or ltx-trainer for training.
# From the repository root
uv sync --frozen
# Or install as a package
pip install -e packages/ltx-core
ltx-core provides modular components that can be combined to build custom inference flows:
model/transformer/): The asymmetric dual-stream LTX-2 transformer (14B-parameter video stream, 5B-parameter audio stream) with bidirectional cross-modal attention for joint audio-video processing. Expects inputs in Modality formatmodel/video_vae/): Encodes/decodes video pixels to/from latent space with temporal and spatial compressionmodel/audio_vae/): Encodes/decodes audio spectrograms to/from latent spacemodel/audio_vae/): Neural vocoder that converts mel spectrograms to audio waveformstext_encoders/): Gemma 3-based multilingual encoder with multi-layer feature extraction and thinking tokens that produces separate embeddings for video and audio conditioningmodel/upsampler/): Upsamples latent representations for higher-resolution generationcomponents/schedulers.py): Noise schedules (LTX2Scheduler, LinearQuadratic, Beta) that control the denoising processcomponents/guiders.py): Guidance strategies (CFG, STG, APG) for controlling generation quality and adherence to promptscomponents/noisers.py): Add noise to latents according to the diffusion schedulecomponents/patchifiers.py): Convert between spatial latents [B, C, F, H, W] and sequence format [B, seq_len, dim] for transformer processingconditioning/): Tools for preparing and applying various conditioning types (image, video, keyframes)guidance/): Perturbation system for fine-grained control over attention mechanisms (e.g., skipping specific attention layers)loader/): Model loading from .safetensors, LoRA fusion, weight remapping, and memory managementFor complete, production-ready pipeline implementations that combine these building blocks, see the ltx-pipelines package.
This section provides a deep dive into the internal architecture of the LTX-2 Audio-Video generation model.
LTX-2 is an asymmetric dual-stream diffusion transformer that jointly models the text-conditioned distribution of video and audio signals, capturing true joint dependencies (unlike sequential T2V→V2A pipelines).
┌─────────────────────────────────────────────────────────────┐ │ INPUT PREPARATION │ │ │ │ Video Pixels → Video VAE Encoder → Video Latents │ │ Audio Waveform → Audio VAE Encoder → Audio Latents │ │ Text Prompt → Gemma 3 Encoder → Text Embeddings │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ LTX-2 ASYMMETRIC DUAL-STREAM TRANSFORMER (48 Blocks) │ │ │ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ │ Video Stream (14B) │ │ Audio Stream (5B) │ │ │ │ │ │ │ │ │ │ 3D RoPE (x,y,t) │ │ 1D RoPE (temporal) │ │ │ │ │ │ │ │ │ │ Self-Attn │ │ Self-Attn │ │ │ │ Text Cross-Attn │ │ Text Cross-Attn │ │ │ │ │◄────►│ │ │ │ │ A↔V Cross-Attn │ │ A↔V Cross-Attn │ │ │ │ (1D temporal RoPE) │ │ (1D temporal RoPE) │ │ │ │ Cross-modality │ │ Cross-modality │ │ │ │ AdaLN │ │ AdaLN │ │ │ │ Feed-Forward │ │ Feed-Forward │ │ │ └──────────────────────┘ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ OUTPUT DECODING │ │ │ │ Video Latents → Video VAE Decoder → Video Pixels │ │ Audio Latents → Audio VAE Decoder → Mel Spectrogram │ │ Mel Spectrogram → Vocoder → Audio Waveform (24 kHz) │ └─────────────────────────────────────────────────────────────┘
The core of LTX-2 is an asymmetric dual-stream diffusion transformer with 48 layers that processes both video and audio tokens simultaneously. The architecture allocates 14B parameters to the video stream and 5B parameters to the audio stream, reflecting the different information densities of the two modalities.
Source: src/ltx_core/model/transformer/model.py
The LTXModel class implements the transformer. It supports both video-only and audio-video generation modes. For actual usage, see the ltx-pipelines package which handles model loading and initialization.
Source: src/ltx_core/model/transformer/transformer.py
Each dual-stream block performs four operations sequentially:
┌─────────────────────────────────────────────────────────────┐ │ TRANSFORMER BLOCK │ │ │ │ VIDEO (14B): Input → RMSNorm → AdaLN → Self-Attn → │ │ RMSNorm → Text Cross-Attn → │ │ RMSNorm → AdaLN → A↔V Cross-Attn (1D RoPE) → │ │ RMSNorm → AdaLN → FFN → Output │ │ │ │ AUDIO (5B): Input → RMSNorm → AdaLN → Self-Attn → │ │ RMSNorm → Text Cross-Attn → │ │ RMSNorm → AdaLN → A↔V Cross-Attn (1D RoPE) → │ │ RMSNorm → AdaLN → FFN → Output │ │ │ │ RoPE: Video=3D (x,y,t), Audio=1D (t), Cross-Attn=1D (t) │ │ AdaLN: Timestep-conditioned, cross-modality for A↔V CA │ └─────────────────────────────────────────────────────────────┘
Bidirectional cross-attention enables tight temporal alignment: video and audio streams exchange information bidirectionally using 1D temporal RoPE (synchronization only, no spatial alignment). AdaLN gates condition on each modality's timestep for cross-modal synchronization.
The transformer supports perturbations that selectively skip attention operations.
Perturbations allow you to disable specific attention mechanisms during inference, which is useful for guidance techniques like STG (Spatio-Temporal Guidance).
Supported Perturbation Types:
SKIP_VIDEO_SELF_ATTN: Skip video self-attentionSKIP_AUDIO_SELF_ATTN: Skip audio self-attentionSKIP_A2V_CROSS_ATTN: Skip audio-to-video cross-attentionSKIP_V2A_CROSS_ATTN: Skip video-to-audio cross-attentionPerturbations are used internally by guidance mechanisms like STG (Spatio-Temporal Guidance). For usage examples, see the ltx-pipelines package.
The Video VAE (src/ltx_core/model/video_vae/) encodes video pixels into latent representations and decodes them back.
[B, 3, F, H, W] pixels → [B, 128, F', H/32, W/32] latents
F' = 1 + (F-1)/8 (frame count must satisfy (F-1) % 8 == 0)[B, 3, 33, 512, 512] → [B, 128, 5, 16, 16][B, 128, F, H, W] latents → [B, 3, F', H*32, W*32] pixels
F' = 1 + (F-1)*8[B, 128, 5, 16, 16] → [B, 3, 33, 512, 512]The Video VAE is used internally by pipelines for encoding video pixels to latents and decoding latents back to pixels. For usage examples, see the ltx-pipelines package.
The Audio VAE (src/ltx_core/model/audio_vae/) processes audio spectrograms.
Compact neural audio representation optimized for diffusion-based training. Natively supports stereo: processes two-channel mel-spectrograms (16 kHz input) with channel concatenation before encoding.
[B, mel_bins, T] → [B, 8, T/4, 16] latents (4× temporal downsampling, 8 channels, 16 mel bins in latent space, ~1/25s per token, 128-dim feature vector)[B, 8, T, 16] → [B, mel_bins, T*4] mel spectrogramDownsampling:
The Audio VAE is used internally by pipelines for encoding mel spectrograms to latents and decoding latents back to mel spectrograms. The vocoder converts mel spectrograms to audio waveforms. For usage examples, see the ltx-pipelines package.
LTX-2 uses Gemma 3 (Gemma 3-12B) as the multilingual text encoder backbone, located in src/ltx_core/text_encoders/gemma/. Advanced text understanding is critical not only for global language support but for the phonetic and semantic accuracy of generated speech.
The text conditioning pipeline consists of three stages:
[B, T, D, L][B, T, D×L], and projects via learnable matrix W (jointly optimized with LTX-2, LLM weights frozen)Embeddings1DConnector)Encoders:
AVGemmaTextEncoderModel: Audio-video generation (two connectors → AVGemmaEncoderOutput with separate video/audio contexts)VideoGemmaTextEncoderModel: Video-only generation (single connector → VideoGemmaEncoderOutput)System prompts are also used to enhance user's prompts.
gemma_t2v_system_prompt.txtgemma_i2v_system_prompt.txtImportant: Video and audio receive different context embeddings, even from the same prompt. This allows better modality-specific conditioning and enables the model to synthesize speech that is synchronized with visual lip movement while being natural in cadence, accent, and emotional tone.
Output Format:
[B, seq_len, 4096] - Video-specific text embeddings[B, seq_len, 2048] - Audio-specific text embeddingsThe text encoder is used internally by pipelines. For usage examples, see the ltx-pipelines package.
The Upscaler (src/ltx_core/model/upsampler/) upsamples latent representations for higher-resolution output.
The spatial upsampler is used internally by two-stage pipelines (e.g., TI2VidTwoStagesPipeline, ICLoraPipeline) to upsample low-resolution latents before final VAE decoding. For usage examples, see the ltx-pipelines package.
Here's how all the components work together conceptually (src/ltx_core/components/):
Pipeline Steps:
[B, C, F, H, W][B, seq_len, dim] for transformerTI2VidTwoStagesPipeline - Two-stage text-to-video (recommended)ICLoraPipeline - Video-to-video with IC-LoRA controlDistilledPipeline - Fast inference with distilled modelKeyframeInterpolationPipeline - Keyframe-based interpolationSee the ltx-pipelines README for usage examples.