Configuration

Configuration functions from peft.tuners.lora.dora that control fused kernel dispatch, memory thresholds, backward-pass heuristics, and distributed training integration.

Info

These are module-level functions, most prefixed with _ (private). They are documented here as developer reference for understanding the runtime control flow.

Fused Kernel Control¶

Functions that read environment variables and manage cached decisions about whether to use fused Triton kernels.

`peft.tuners.lora.dora._use_fused_kernels()` ¶

Return True if fused Triton kernels should be used (when available).

Controlled by env var PEFT_DORA_FUSED: * "1" or "true" (case-insensitive) → enable * "0" or "false" → disable * unset → enable by default when Triton is available

The result is cached after the first call so that os.environ.get() is not invoked on every forward pass. A threading.Lock guards the first-write to avoid TOCTOU races in threaded launchers.

Note

This flag enables Triton kernels for inference-style forward paths. During training, fused compose uses the custom autograd path by default (disable with PEFT_DORA_FUSED_BACKWARD=0).

Source code in peft/tuners/lora/dora.py

def _use_fused_kernels() -> bool:
    """Return True if fused Triton kernels should be used (when available).

    Controlled by env var ``PEFT_DORA_FUSED``:
      * ``"1"`` or ``"true"`` (case-insensitive) → enable
      * ``"0"`` or ``"false"`` → disable
      * unset → enable by default when Triton is available

    The result is cached after the first call so that ``os.environ.get()``
    is not invoked on every forward pass.  A ``threading.Lock`` guards the
    first-write to avoid TOCTOU races in threaded launchers.

    Note:
        This flag enables Triton kernels for inference-style forward paths.
        During training, fused compose uses the custom autograd path by
        default (disable with ``PEFT_DORA_FUSED_BACKWARD=0``).
    """
    global _cached_use_fused_kernels  # noqa: PLW0603
    # Double-checked locking: the first read outside the lock is a non-atomic
    # read of a module-global reference.  Under CPython (with or without GIL)
    # this is safe because pointer-width writes are atomic on all supported
    # platforms.  Under free-threaded Python (PEP 703, 3.13t+) the explicit
    # lock below serializes the first-write; subsequent reads of a fully
    # constructed Python object reference are safe without the lock.  This
    # relies on CPython implementation details (pointer-width atomicity), not
    # language-level guarantees.
    val = _cached_use_fused_kernels
    if val is not _SENTINEL:
        return val
    # Dynamo cannot trace through threading.Lock context managers (it raises
    # ``Unsupported: Unsupported context manager``).  During compilation,
    # tracing is single-threaded so the lock is unnecessary — resolve the
    # env var directly and let Dynamo inline the boolean constant.
    if dynamo_is_compiling is not None and dynamo_is_compiling():
        return _resolve_fused_kernels()
    with _fused_cache_lock:
        # Double-check after acquiring the lock
        if _cached_use_fused_kernels is not _SENTINEL:
            return _cached_use_fused_kernels
        _cached_use_fused_kernels = _resolve_fused_kernels()
        return _cached_use_fused_kernels

`peft.tuners.lora.dora._use_fused_backward()` ¶

Return True if the custom autograd backward path should be used.

Controlled by env var PEFT_DORA_FUSED_BACKWARD: * "1" or "true" → enable * "0" or "false" → disable * unset → enabled by default unconditionally (the fused backward uses PyTorch fallbacks when Triton is unavailable, so Triton is not required; set PEFT_DORA_FUSED_BACKWARD=0 to opt out)

Enabled by default because the fused forward-and-inner kernel eliminates the VRAM spike from sequential PyTorch ops, and the frozen-mag path skips the inner allocation entirely when mag_norm_scale doesn't require gradients. Overhead in the normal (unfrozen) case is exactly 1x lora-sized activation per layer (the saved inner).

Set PEFT_DORA_FUSED_BACKWARD=0 to opt out if VRAM is extremely tight.

Source code in peft/tuners/lora/dora.py

def _use_fused_backward() -> bool:
    """Return True if the custom autograd backward path should be used.

    Controlled by env var ``PEFT_DORA_FUSED_BACKWARD``:
      * ``"1"`` or ``"true"`` → enable
      * ``"0"`` or ``"false"`` → disable
      * unset → **enabled by default unconditionally** (the fused backward
        uses PyTorch fallbacks when Triton is unavailable, so Triton is not
        required; set ``PEFT_DORA_FUSED_BACKWARD=0`` to opt out)

    Enabled by default because the fused forward-and-inner kernel eliminates
    the VRAM spike from sequential PyTorch ops, and the frozen-mag path skips
    the ``inner`` allocation entirely when ``mag_norm_scale`` doesn't require
    gradients.  Overhead in the normal (unfrozen) case is exactly 1x
    ``lora``-sized activation per layer (the saved ``inner``).

    Set ``PEFT_DORA_FUSED_BACKWARD=0`` to opt out if VRAM is extremely tight.
    """
    global _cached_use_fused_backward  # noqa: PLW0603
    val = _cached_use_fused_backward
    if val is not _SENTINEL:
        return val
    if dynamo_is_compiling is not None and dynamo_is_compiling():
        return _resolve_fused_backward()
    with _fused_cache_lock:
        if _cached_use_fused_backward is not _SENTINEL:
            return _cached_use_fused_backward
        _cached_use_fused_backward = _resolve_fused_backward()
        return _cached_use_fused_backward

`peft.tuners.lora.dora._invalidate_fused_cache()` ¶

Reset cached env var results (fused flags + thresholds + FSDP2 detection). Useful for testing.

Source code in peft/tuners/lora/dora.py

def _invalidate_fused_cache():
    """Reset cached env var results (fused flags + thresholds + FSDP2 detection). Useful for testing."""
    global _cached_use_fused_kernels, _cached_use_fused_backward, _cached_fused_backward_explicit  # noqa: PLW0603
    global _cached_norm_threshold, _cached_fwd_threshold  # noqa: PLW0603
    global _fsdp2_detect_fns, _cached_allow_partial_gather, _cached_force_gather_override, _cached_is_zero3  # noqa: PLW0603
    with _fused_cache_lock:
        _cached_use_fused_kernels = _SENTINEL
        _cached_use_fused_backward = _SENTINEL
        _cached_fused_backward_explicit = _SENTINEL
        _cached_norm_threshold = _SENTINEL
        _cached_fwd_threshold = _SENTINEL
        _cached_allow_partial_gather = _SENTINEL
        _cached_force_gather_override = _SENTINEL
        _cached_is_zero3 = _SENTINEL
        _fsdp2_detect_fns = None
    is_triton_available.cache_clear()

`peft.tuners.lora.dora.is_triton_available()` `cached` ¶

Return True if Triton is importable, without importing dora_fused.

Keep this lightweight probe separate from dora_fused.is_triton_available to preserve lazy import behavior for users that never execute DoRA paths.

Source code in peft/tuners/lora/dora.py

@lru_cache(maxsize=1)
def is_triton_available() -> bool:
    """Return True if Triton is importable, without importing dora_fused.

    Keep this lightweight probe separate from ``dora_fused.is_triton_available``
    to preserve lazy import behavior for users that never execute DoRA paths.
    """
    try:
        import triton  # noqa: F401
    except ImportError:
        return False
    return True

Memory Thresholds¶

Functions controlling chunking thresholds for norm computation and forward passes. Matrices exceeding these thresholds are processed in chunks to bound peak memory.

`peft.tuners.lora.dora.set_dora_norm_threshold_mb(mb)` ¶

Set the PEFT_DORA_NORM_CHUNK_MB environment variable to control the working-set memory threshold (in MB). Enforces that mb is an integer >= 16 and <= 65536 (64 GB). Raises ValueError if mb is out of bounds.

Source code in peft/tuners/lora/dora.py

def set_dora_norm_threshold_mb(mb: int) -> None:
    """
    Set the PEFT_DORA_NORM_CHUNK_MB environment variable to control the working-set memory threshold (in MB).
    Enforces that mb is an integer >= 16 and <= 65536 (64 GB).
    Raises ValueError if mb is out of bounds.
    """
    min_mb = 16
    max_mb = 65536  # 64 GB, arbitrary upper bound to prevent mistakes
    if not isinstance(mb, int):
        raise ValueError(f"mb must be an integer, got {type(mb).__name__}")
    if not (min_mb <= mb <= max_mb):
        raise ValueError(f"mb must be between {min_mb} and {max_mb} (got {mb})")
    os.environ["PEFT_DORA_NORM_CHUNK_MB"] = str(mb)
    _invalidate_threshold_cache()

`peft.tuners.lora.dora.get_dora_norm_threshold_mb()` ¶

Return the current DoRA norm chunk threshold in MB.

Source code in peft/tuners/lora/dora.py

def get_dora_norm_threshold_mb() -> int:
    """Return the current DoRA norm chunk threshold in MB."""
    return int(_get_norm_memory_threshold_bytes() // (1024 * 1024))

`peft.tuners.lora.dora.get_dora_norm_threshold_bytes()` ¶

Return the current DoRA norm chunk threshold in bytes.

Source code in peft/tuners/lora/dora.py

def get_dora_norm_threshold_bytes() -> int:
    """Return the current DoRA norm chunk threshold in bytes."""
    return int(_get_norm_memory_threshold_bytes())

Fused Backward Heuristics¶

Shape-based heuristics that decide whether the fused backward kernel is beneficial for a given tensor. Small matrices may not benefit from kernel launch overhead.

`peft.tuners.lora.dora._should_auto_use_fused_backward_shape(num_rows, num_cols)` ¶

Benchmark-informed crossover for auto-enabled fused backward.

num_cols is the activation's last dimension (d_out for linear layers, i.e. lora_out.shape[-1]). num_rows is the product of all other dimensions (batch * seq for linear layers).

The warmed 6-GPU benchmark bundle (L40S, A100, RTX 6000 PRO, H200, B200, B300) shows the crossover entering the win regime around the 2048x6144 (rows x cols) activation shape on Blackwell and Ampere; L40S and H200 may still trail at the threshold and do not consistently win until roughly 2x the threshold work-item count. The threshold is therefore conservative on high-bandwidth HBM GPUs and slightly aggressive on lower-bandwidth / older architectures.

We keep explicit env-var enables as a force-on override; this heuristic only applies when PEFT_DORA_FUSED_BACKWARD is unset (the default auto mode).

Source code in peft/tuners/lora/dora.py

def _should_auto_use_fused_backward_shape(num_rows: int, num_cols: int) -> bool:
    """Benchmark-informed crossover for auto-enabled fused backward.

    ``num_cols`` is the activation's last dimension (``d_out`` for linear
    layers, i.e. ``lora_out.shape[-1]``).  ``num_rows`` is the product of
    all other dimensions (``batch * seq`` for linear layers).

    The warmed 6-GPU benchmark bundle (L40S, A100, RTX 6000 PRO, H200,
    B200, B300) shows the crossover entering the win regime around the
    2048x6144 (rows x cols) activation shape on Blackwell and Ampere;
    L40S and H200 may still trail at the threshold and do not consistently
    win until roughly 2x the threshold work-item count.  The threshold is
    therefore conservative on high-bandwidth HBM GPUs and slightly
    aggressive on lower-bandwidth / older architectures.

    We keep explicit env-var enables as a force-on override; this heuristic
    only applies when PEFT_DORA_FUSED_BACKWARD is unset (the default auto
    mode).
    """

    if num_rows <= 0 or num_cols < _FUSED_BACKWARD_AUTO_MIN_COLS:
        return False
    return num_rows * num_cols >= _FUSED_BACKWARD_AUTO_MIN_WORK_ITEMS

`peft.tuners.lora.dora._should_use_fused_backward_for_tensor(lora_out, mag_norm_scale=None)` ¶

Decide whether training-time compose should route through fused backward.

Source code in peft/tuners/lora/dora.py

def _should_use_fused_backward_for_tensor(
    lora_out: torch.Tensor,
    mag_norm_scale: Optional[torch.Tensor] = None,
) -> bool:
    """Decide whether training-time compose should route through fused backward."""

    if not (_use_fused_backward() and lora_out.is_cuda):
        return False

    # _use_fused_backward() returns True both for "unset (default on)" and
    # "explicitly set to 1".  We need to distinguish: explicit opt-in skips
    # the auto crossover heuristic, while unset defers to shape analysis.
    # _resolve_fused_backward_explicit() returns True/False for explicit
    # setting, None for unset — all cached.
    explicit = _resolve_fused_backward_explicit()
    if explicit is not None:
        return explicit

    if lora_out.ndim == 0:
        return False

    # Apply the auto heuristic only to the linear/embedding-style broadcast
    # pattern that the Triton benchmark suite covers directly.  For other
    # layouts (for example Conv with mag=[1,C,1,1]), preserve the previous
    # custom-autograd behavior.
    if mag_norm_scale is not None and _mag_broadcasts_last_dim(mag_norm_scale, lora_out):
        num_cols = lora_out.shape[-1]
        num_rows = lora_out.numel() // max(num_cols, 1)
        return _should_auto_use_fused_backward_shape(num_rows, num_cols)

    return True

FSDP & ZeRO-3 Integration¶

Functions for detecting and interacting with distributed training frameworks (FSDP2, DeepSpeed ZeRO-3) that shard parameters across devices.

`peft.tuners.lora.dora._is_fsdp2_managed(module)` ¶

Detect whether module is wrapped by PyTorch FSDP2 (composable API).

FSDP2 (torch.distributed._composable.fsdp.fully_shard, available since PyTorch 2.4) attaches FSDPState to modules but does not wrap them with FullyShardedDataParallel, so the FSDP1 summon_full_params API silently no-ops. We detect FSDP2 by checking for the state object that fully_shard attaches.

Implementation notes (private API dependencies): - FSDPState: the composable-FSDP state class. Import location moved from torch.distributed._composable.fsdp (2.4–2.9) to torch.distributed.fsdp._fully_shard._fsdp_state (2.10+). Both paths are tried by _resolve_fsdp2_detect_fns. - torch.distributed._composable_state._get_module_state: stable since PyTorch 2.4. Returns the composable state attached by fully_shard. - torch.distributed.fsdp._common_utils._get_module_fsdp_state: legacy fallback for PyTorch < 2.4.

Detection functions are resolved once on first call and cached to avoid import overhead on every forward pass (hundreds of layers × thousands of steps). All imports are guarded by try/except so breakage in future PyTorch releases degrades to returning False (FSDP1 behavior preserved). Last verified against PyTorch 2.4.0, 2.5.0, 2.6.0, and 2.10.0.

Source code in peft/tuners/lora/dora.py

def _is_fsdp2_managed(module) -> bool:
    """Detect whether *module* is wrapped by PyTorch FSDP2 (composable API).

    FSDP2 (``torch.distributed._composable.fsdp.fully_shard``, available since
    PyTorch 2.4) attaches ``FSDPState`` to modules but does **not** wrap them
    with ``FullyShardedDataParallel``, so the FSDP1 ``summon_full_params`` API
    silently no-ops.  We detect FSDP2 by checking for the state object that
    ``fully_shard`` attaches.

    Implementation notes (private API dependencies):
      - ``FSDPState``: the composable-FSDP state class.  Import location moved
        from ``torch.distributed._composable.fsdp`` (2.4–2.9) to
        ``torch.distributed.fsdp._fully_shard._fsdp_state`` (2.10+).
        Both paths are tried by ``_resolve_fsdp2_detect_fns``.
      - ``torch.distributed._composable_state._get_module_state``: stable
        since PyTorch 2.4.  Returns the composable state attached by
        ``fully_shard``.
      - ``torch.distributed.fsdp._common_utils._get_module_fsdp_state``:
        legacy fallback for PyTorch < 2.4.

    Detection functions are resolved once on first call and cached to avoid
    import overhead on every forward pass (hundreds of layers × thousands of
    steps).  All imports are guarded by ``try/except`` so breakage in future
    PyTorch releases degrades to returning ``False`` (FSDP1 behavior preserved).
    Last verified against PyTorch 2.4.0, 2.5.0, 2.6.0, and 2.10.0.
    """
    if not isinstance(module, nn.Module):
        return False

    global _fsdp2_detect_fns  # noqa: PLW0603
    if _fsdp2_detect_fns is None:
        with _fused_cache_lock:
            if _fsdp2_detect_fns is None:
                _fsdp2_detect_fns = _resolve_fsdp2_detect_fns()

    fsdp_state_cls, get_state_fn = _fsdp2_detect_fns

    if get_state_fn is None:
        return False

    try:
        state = get_state_fn(module)
    except (TypeError, AttributeError):
        state = None

    if state is not None:
        if fsdp_state_cls is not None:
            if isinstance(state, fsdp_state_cls):
                return True
        elif FSDP is None or not isinstance(module, FSDP):
            # Has composable state but is not FSDP1-wrapped → FSDP2
            return True

    # Note: we intentionally do NOT check for DTensor parameters here.
    # FSDP2 converts child parameters to DTensor when fully_shard() is
    # called on a parent, so _get_module_state returns None for leaf layers
    # even though their params are sharded.  However, DTensor is also used
    # by Tensor Parallelism and Pipeline Parallelism — checking for DTensor
    # would false-positive on TP-only configs and crash DoRA forward.
    # The primary detection via _get_module_state catches directly-wrapped
    # modules.  Parent-only FSDP2 wrapping is a known detection gap, but
    # in that configuration FSDP2's own pre-forward hooks unshard parameters
    # before DoRA's forward runs, so norms are computed from full params.
    return False

`peft.tuners.lora.dora._fsdp_full_param_ctx(*modules)` ¶

Best-effort context to expose full parameters when modules are wrapped with torch.distributed.fsdp.FullyShardedDataParallel (FSDP). - Yields exactly once. - No-ops outside FSDP or if modules are not FSDP-wrapped. - Does not swallow exceptions raised inside the 'with' body. This is safe under ZeRO/DP/DDP (it will simply do nothing). Debug logs which modules were successfully summoned (best-effort).

Callers must only pass nn.Module instances — raw tensors or nn.Parameter objects (e.g. embedding LoRA factors) are not individually FSDP-wrapped and should not be passed here. Use _maybe_gather_base_params_ctx for those.

Raises RuntimeError if any module is managed by FSDP2 (composable API), which uses a different full-parameter mechanism that this helper does not support. Failing loudly is preferable to silently computing norms from sharded parameters.

Source code in peft/tuners/lora/dora.py

@contextmanager
def _fsdp_full_param_ctx(*modules):
    """
    Best-effort context to expose full parameters when modules are wrapped with
    torch.distributed.fsdp.FullyShardedDataParallel (FSDP).
    - Yields exactly once.
    - No-ops outside FSDP or if modules are not FSDP-wrapped.
    - Does not swallow exceptions raised inside the 'with' body.
    This is safe under ZeRO/DP/DDP (it will simply do nothing).
    Debug logs which modules were successfully summoned (best-effort).

    Callers must only pass ``nn.Module`` instances — raw tensors or
    ``nn.Parameter`` objects (e.g. embedding LoRA factors) are not
    individually FSDP-wrapped and should not be passed here.  Use
    ``_maybe_gather_base_params_ctx`` for those.

    Raises ``RuntimeError`` if any module is managed by FSDP2 (composable API),
    which uses a different full-parameter mechanism that this helper does not
    support.  Failing loudly is preferable to silently computing norms from
    sharded parameters.
    """
    if FSDP is None:
        yield
        return

    # Filter to nn.Module instances only — raw tensors and Parameters should
    # use _maybe_gather_base_params_ctx instead.
    modules = tuple(m for m in modules if m is not None and isinstance(m, nn.Module))
    if not modules:
        yield
        return

    # Detect FSDP2 and fail loudly rather than silently returning shards.
    for m in modules:
        if _is_fsdp2_managed(m):
            raise RuntimeError(
                f"DoRA detected FSDP2-wrapped module ({type(m).__name__}). "
                "The current DoRA implementation only supports FSDP1's "
                "`summon_full_params` API. FSDP2 (composable `fully_shard`) "
                "requires a different full-parameter mechanism that is not yet "
                "implemented. Using FSDP2 with DoRA would silently compute "
                "norms from sharded parameters and produce incorrect results."
            )

    with ExitStack() as stack:
        summoned = 0
        for m in modules:
            try:
                cm = FSDP.summon_full_params(m, writeback=False, with_grads=False)
            except (TypeError, AttributeError):
                # Not FSDP-wrapped or incompatible; skip
                continue
            try:
                stack.enter_context(cm)
                summoned += 1
            except RuntimeError:
                # Some FSDP variants may raise at enter time; skip
                continue
        if summoned:
            logger.debug("DoRA: entered FSDP full-param ctx for %d module(s)", summoned)
        yield

`peft.tuners.lora.dora._maybe_gather_base_params_ctx(base_layer, *extra_modules)` ¶

Only required under DeepSpeed ZeRO-3 where parameters are sharded. For ZeRO-2, params are replicated, so gathering is unnecessary. We gate gathering by: - explicit PEFT_FORCE_GATHER override when set, else - DS_ZERO_STAGE==3 (env), or - check_deepspeed_zero3_enabled(). We try param-tuple signature first, else module object; logs which one worked.

extra_modules are additional modules or raw tensors/parameters whose parameters should also be gathered (e.g. lora_A, lora_B). Under ZeRO-3 the adapter weights can be sharded too, so every tensor consumed by the norm path must be inside the gather scope.

Items that are nn.Module contribute via .parameters(). Items that are bare torch.Tensor / nn.Parameter (e.g. embedding LoRA factors) are included directly in the gather tuple.

Source code in peft/tuners/lora/dora.py

def _maybe_gather_base_params_ctx(base_layer, *extra_modules):
    """
    Only required under DeepSpeed ZeRO-3 where parameters are sharded. For ZeRO-2, params are
    replicated, so gathering is unnecessary. We gate gathering by:
      - explicit ``PEFT_FORCE_GATHER`` override when set, else
      - DS_ZERO_STAGE==3 (env), or
      - check_deepspeed_zero3_enabled().
    We try param-tuple signature first, else module object; logs which one worked.

    *extra_modules* are additional modules or raw tensors/parameters whose
    parameters should also be gathered (e.g. ``lora_A``, ``lora_B``).  Under
    ZeRO-3 the adapter weights can be sharded too, so every tensor consumed by
    the norm path must be inside the gather scope.

    Items that are ``nn.Module`` contribute via ``.parameters()``.  Items that
    are bare ``torch.Tensor`` / ``nn.Parameter`` (e.g. embedding LoRA factors)
    are included directly in the gather tuple.
    """
    if gather_params_ctx is None or not _is_zero3_active():
        return nullcontext()

    # Collect parameters from all modules (base + extras) into a single tuple.
    # Modules contribute via .parameters(); raw tensors are included directly.
    all_modules = [base_layer] + [m for m in extra_modules if m is not None]
    param_iterable = None
    try:
        params = []
        for mod in all_modules:
            if hasattr(mod, "parameters") and callable(mod.parameters):
                params.extend(mod.parameters())
            elif isinstance(mod, torch.Tensor):
                params.append(_resolve_tensor_base(mod))
        if params:
            param_iterable = tuple(params)
    except TypeError:
        param_iterable = None

    @contextmanager
    def _ctx():
        with ExitStack() as stack:
            entered = False
            if param_iterable is not None:
                try:
                    cm = gather_params_ctx(param_iterable)
                    stack.enter_context(cm)
                    logger.debug("DoRA: ZeRO-3 gather using param tuple (%d params)", len(param_iterable))
                    entered = True
                except (TypeError, AttributeError, RuntimeError) as exc:
                    logger.debug("DoRA: param-tuple gather failed (%s: %s), trying module", type(exc).__name__, exc)
                    entered = False

            if not entered:
                # Fall back to per-module gather.  Track successes and
                # failures separately — a *partial* gather (some modules
                # gathered, others not) is worse than no gather at all
                # because it silently mixes full and sharded tensors.
                gathered_mods = []
                failed_mods = []
                for mod in all_modules:
                    try:
                        # GatheredParameters expects an iterable of Parameters
                        # (or a single Parameter).  Passing an nn.Module directly
                        # makes GatheredParameters a silent no-op (the module
                        # isn't iterable and lacks ds_id).  Always extract params.
                        if isinstance(mod, torch.Tensor):
                            target = (_resolve_tensor_base(mod),)
                        elif hasattr(mod, "parameters") and callable(mod.parameters):
                            target = tuple(mod.parameters())
                            if not target:
                                # Module has no parameters — nothing to gather.
                                gathered_mods.append(type(mod).__name__)
                                continue
                        else:
                            failed_mods.append((type(mod).__name__, "TypeError", "not a Module or Tensor"))
                            continue
                        cm = gather_params_ctx(target)
                        stack.enter_context(cm)
                        gathered_mods.append(type(mod).__name__)
                    except (TypeError, AttributeError, RuntimeError) as exc:
                        failed_mods.append((type(mod).__name__, type(exc).__name__, str(exc)))

                if gathered_mods:
                    entered = True
                    logger.debug("DoRA: ZeRO-3 gather using module objects (%s)", ", ".join(gathered_mods))

                if gathered_mods and failed_mods:
                    # Partial gather: some modules gathered, others failed.
                    # This silently mixes fully gathered and sharded tensors
                    # in the norm computation — a correctness violation.
                    failed_desc = "; ".join(f"{n} ({e}: {m})" for n, e, m in failed_mods)
                    msg = (
                        f"DoRA: ZeRO-3 partial gather — gathered [{', '.join(gathered_mods)}] "
                        f"but failed for [{failed_desc}]. "
                        "Norm computation would mix fully gathered and sharded parameters, "
                        "producing incorrect results."
                    )
                    if _allow_partial_gather():
                        logger.warning(msg + " Continuing due to PEFT_DORA_ALLOW_PARTIAL_GATHER=1.")
                    else:
                        raise RuntimeError(
                            msg + " Set PEFT_DORA_ALLOW_PARTIAL_GATHER=1 to override (at your own risk)."
                        )

            if not entered:
                logger.warning(
                    "DoRA: ZeRO-3 gather failed for all modules. "
                    "Proceeding without gathering — outputs may be incorrect if parameters "
                    "are truly sharded.",
                )
            yield

    return _ctx()

`peft.tuners.lora.dora._is_zero3_active()` ¶

Return True when DoRA should gather sharded parameters.

PEFT_FORCE_GATHER is a cached ternary override: * unset -> auto-detect ZeRO-3 * 1 / true -> force gather on and cache forever * 0 / false -> force gather off and cache forever

When the override is unset, only detected True is cached — ZeRO-3 may initialize after the first DoRA call (the common HF Trainer flow: create model -> PEFT wrap -> deepspeed.initialize), so an auto-detected False must still be re-evaluated on later forwards.

Source code in peft/tuners/lora/dora.py

def _is_zero3_active() -> bool:
    """Return True when DoRA should gather sharded parameters.

    ``PEFT_FORCE_GATHER`` is a cached ternary override:
      * unset -> auto-detect ZeRO-3
      * ``1`` / ``true`` -> force gather on and cache forever
      * ``0`` / ``false`` -> force gather off and cache forever

    When the override is unset, only detected ``True`` is cached — ZeRO-3 may
    initialize after the first DoRA call (the common HF Trainer flow: create
    model -> PEFT wrap -> deepspeed.initialize), so an auto-detected ``False``
    must still be re-evaluated on later forwards.
    """
    force = _force_gather_override()
    if force is not None:
        return force
    global _cached_is_zero3  # noqa: PLW0603
    if _cached_is_zero3 is True:
        return True
    # No explicit override: don't cache False, re-evaluate late DS init.
    # Skip all DeepSpeed checks when distributed isn't initialized — ZeRO-3
    # can't be active without a process group.  This makes the False path very
    # cheap (~100ns for the is_initialized() boolean) on single-GPU setups
    # with 100+ DoRA layers, avoiding the os.environ.get() on every call.
    is_zero3_ds = False
    try:
        if torch.distributed.is_initialized():
            # Fast path: if the env var is set, trust it immediately.
            if os.environ.get("DS_ZERO_STAGE") == "3":
                with _fused_cache_lock:
                    _cached_is_zero3 = True
                return True
            try:
                is_zero3_ds = check_deepspeed_zero3_enabled()
            except (ImportError, RuntimeError, ValueError):
                pass
            except Exception:
                logger.debug("DoRA: check_deepspeed_zero3_enabled() raised unexpected error", exc_info=True)
    except (RuntimeError, AttributeError):
        pass
    result = is_zero3_ds
    if result:
        with _fused_cache_lock:
            _cached_is_zero3 = True
    return result

`peft.tuners.lora.dora._allow_partial_gather()` ¶

Return True if PEFT_DORA_ALLOW_PARTIAL_GATHER=1.

Cached after first call, consistent with other env-var accessors.

Source code in peft/tuners/lora/dora.py

def _allow_partial_gather() -> bool:
    """Return True if ``PEFT_DORA_ALLOW_PARTIAL_GATHER=1``.

    Cached after first call, consistent with other env-var accessors.
    """
    global _cached_allow_partial_gather  # noqa: PLW0603
    val = _cached_allow_partial_gather
    if val is not _SENTINEL:
        return val
    # Dynamo cannot trace through threading.Lock — resolve directly during
    # compilation (single-threaded, so the lock is unnecessary).
    if dynamo_is_compiling is not None and dynamo_is_compiling():
        return os.environ.get("PEFT_DORA_ALLOW_PARTIAL_GATHER", "0") == "1"
    with _fused_cache_lock:
        if _cached_allow_partial_gather is not _SENTINEL:
            return _cached_allow_partial_gather
        _cached_allow_partial_gather = os.environ.get("PEFT_DORA_ALLOW_PARTIAL_GATHER", "0") == "1"
        return _cached_allow_partial_gather

Composition Helpers¶

Eager (non-fused) composition functions used as the fallback path.

`peft.tuners.lora.dora._compose_eager_inplace(lora, base, mag_norm_scale, scale)` ¶

Numerically stable in-place DoRA composition for eager PyTorch paths.

Computes out = (mag - 1) * base + mag * (scale * lora) in-place into lora, avoiding catastrophic cancellation when mag ≈ 1 in bf16/fp16.

When all operands already match lora.dtype, the in-place path uses lora *= scale then lora *= mag then lora += (mag - 1) * base. The two-step multiply preserves the canonical associativity mag * (scale * lora) (scale first, then mag).

Under mixed dtypes (for example fp32 magnitude with bf16 activations under AMP), eager training defines the reference contract: evaluate the stable form in the promoted dtype, then cast back to the activation dtype. The in-place helper mirrors that by materializing the promoted result and copying it back into lora. This restores bitwise parity across eager out-of-place, eager in-place, and chunked eager composition, but it does not by itself change the separate fused-autograd dtype contract in _compose_with_dispatch.

This is the single source of truth for the eager in-place formula. The Triton kernel in dora_fused.py computes the same expression but is maintained separately because kernel code cannot share Python helpers. See test_compose_formula_cross_reference for the consistency assertion.

Source code in peft/tuners/lora/dora.py

def _compose_eager_inplace(
    lora: torch.Tensor,
    base: torch.Tensor,
    mag_norm_scale: torch.Tensor,
    scale: float,
) -> torch.Tensor:
    """Numerically stable in-place DoRA composition for eager PyTorch paths.

    Computes ``out = (mag - 1) * base + mag * (scale * lora)`` in-place into
    *lora*, avoiding catastrophic cancellation when ``mag ≈ 1`` in bf16/fp16.

    When all operands already match ``lora.dtype``, the in-place path uses
    ``lora *= scale`` then ``lora *= mag`` then ``lora += (mag - 1) * base``.
    The two-step multiply preserves the canonical associativity
    ``mag * (scale * lora)`` (scale first, then mag).

    Under mixed dtypes (for example fp32 magnitude with bf16 activations under
    AMP), eager training defines the reference contract: evaluate the stable
    form in the promoted dtype, then cast back to the activation dtype.  The
    in-place helper mirrors that by materializing the promoted result and
    copying it back into ``lora``.  This restores bitwise parity across eager
    out-of-place, eager in-place, and chunked eager composition, but it does
    not by itself change the separate fused-autograd dtype contract in
    ``_compose_with_dispatch``.

    This is the single source of truth for the eager in-place formula.
    The Triton kernel in ``dora_fused.py`` computes the same expression
    but is maintained separately because kernel code cannot share Python helpers.
    See ``test_compose_formula_cross_reference`` for the consistency assertion.
    """
    if _promoted_compose_dtype(lora.dtype, base.dtype, mag_norm_scale.dtype) != lora.dtype:
        result = mag_norm_scale * (scale * lora) + (mag_norm_scale - 1) * base
        lora.copy_(result)
        return lora

    # Step 1: lora = scale * lora  (in-place, canonical order: scale first)
    lora.mul_(scale)
    # Step 2: lora = mag * lora  (in-place, canonical order: mag second)
    lora.mul_(mag_norm_scale)
    # Step 3: lora += (mag - 1) * base  (in-place, adds the base correction)
    lora.add_(base * (mag_norm_scale - 1))
    return lora

Configuration

Fused Kernel Control¶

peft.tuners.lora.dora._use_fused_kernels() ¶

peft.tuners.lora.dora._use_fused_backward() ¶

peft.tuners.lora.dora._invalidate_fused_cache() ¶

peft.tuners.lora.dora.is_triton_available() cached ¶

Memory Thresholds¶

peft.tuners.lora.dora.set_dora_norm_threshold_mb(mb) ¶

peft.tuners.lora.dora.get_dora_norm_threshold_mb() ¶

peft.tuners.lora.dora.get_dora_norm_threshold_bytes() ¶

Fused Backward Heuristics¶

peft.tuners.lora.dora._should_auto_use_fused_backward_shape(num_rows, num_cols) ¶

peft.tuners.lora.dora._should_use_fused_backward_for_tensor(lora_out, mag_norm_scale=None) ¶

FSDP & ZeRO-3 Integration¶

peft.tuners.lora.dora._is_fsdp2_managed(module) ¶

peft.tuners.lora.dora._fsdp_full_param_ctx(*modules) ¶

peft.tuners.lora.dora._maybe_gather_base_params_ctx(base_layer, *extra_modules) ¶

peft.tuners.lora.dora._is_zero3_active() ¶

peft.tuners.lora.dora._allow_partial_gather() ¶

Composition Helpers¶

peft.tuners.lora.dora._compose_eager_inplace(lora, base, mag_norm_scale, scale) ¶

`peft.tuners.lora.dora._use_fused_kernels()` ¶

`peft.tuners.lora.dora._use_fused_backward()` ¶

`peft.tuners.lora.dora._invalidate_fused_cache()` ¶

`peft.tuners.lora.dora.is_triton_available()` `cached` ¶

`peft.tuners.lora.dora.set_dora_norm_threshold_mb(mb)` ¶

`peft.tuners.lora.dora.get_dora_norm_threshold_mb()` ¶

`peft.tuners.lora.dora.get_dora_norm_threshold_bytes()` ¶

`peft.tuners.lora.dora._should_auto_use_fused_backward_shape(num_rows, num_cols)` ¶

`peft.tuners.lora.dora._should_use_fused_backward_for_tensor(lora_out, mag_norm_scale=None)` ¶

`peft.tuners.lora.dora._is_fsdp2_managed(module)` ¶

`peft.tuners.lora.dora._fsdp_full_param_ctx(*modules)` ¶

`peft.tuners.lora.dora._maybe_gather_base_params_ctx(base_layer, *extra_modules)` ¶

`peft.tuners.lora.dora._is_zero3_active()` ¶

`peft.tuners.lora.dora._allow_partial_gather()` ¶

`peft.tuners.lora.dora._compose_eager_inplace(lora, base, mag_norm_scale, scale)` ¶