Configuration functions from peft.tuners.lora.dora that control fused kernel dispatch, memory thresholds, backward-pass heuristics, and distributed training integration.

Info

These are module-level functions, most prefixed with _ (private). They are documented here as developer reference for understanding the runtime control flow.


Fused Kernel Control

Functions that read environment variables and manage cached decisions about whether to use fused Triton kernels.

peft.tuners.lora.dora._use_fused_kernels()

Return True if fused Triton kernels should be used (when available).

Controlled by env var PEFT_DORA_FUSED: * "1" or "true" (case-insensitive) → enable * "0" or "false" → disable * unset → enable by default when Triton is available

The result is cached after the first call so that os.environ.get() is not invoked on every forward pass. A threading.Lock guards the first-write to avoid TOCTOU races in threaded launchers.

Note

This flag enables Triton kernels for inference-style forward paths. During training, fused compose uses the custom autograd path by default (disable with PEFT_DORA_FUSED_BACKWARD=0).

Source code in peft/tuners/lora/dora.py
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
def _use_fused_kernels() -> bool:
    """Return True if fused Triton kernels should be used (when available).

    Controlled by env var ``PEFT_DORA_FUSED``:
      * ``"1"`` or ``"true"`` (case-insensitive) → enable
      * ``"0"`` or ``"false"`` → disable
      * unset → enable by default when Triton is available

    The result is cached after the first call so that ``os.environ.get()``
    is not invoked on every forward pass.  A ``threading.Lock`` guards the
    first-write to avoid TOCTOU races in threaded launchers.

    Note:
        This flag enables Triton kernels for inference-style forward paths.
        During training, fused compose uses the custom autograd path by
        default (disable with ``PEFT_DORA_FUSED_BACKWARD=0``).
    """
    global _cached_use_fused_kernels  # noqa: PLW0603
    # Double-checked locking: the first read outside the lock is a non-atomic
    # read of a module-global reference.  Under CPython (with or without GIL)
    # this is safe because pointer-width writes are atomic on all supported
    # platforms.  Under free-threaded Python (PEP 703, 3.13t+) the explicit
    # lock below serializes the first-write; subsequent reads of a fully
    # constructed Python object reference are safe without the lock.  This
    # relies on CPython implementation details (pointer-width atomicity), not
    # language-level guarantees.
    val = _cached_use_fused_kernels
    if val is not _SENTINEL:
        return val
    # Dynamo cannot trace through threading.Lock context managers (it raises
    # ``Unsupported: Unsupported context manager``).  During compilation,
    # tracing is single-threaded so the lock is unnecessary — resolve the
    # env var directly and let Dynamo inline the boolean constant.
    if dynamo_is_compiling is not None and dynamo_is_compiling():
        return _resolve_fused_kernels()
    with _fused_cache_lock:
        # Double-check after acquiring the lock
        if _cached_use_fused_kernels is not _SENTINEL:
            return _cached_use_fused_kernels
        _cached_use_fused_kernels = _resolve_fused_kernels()
        return _cached_use_fused_kernels

peft.tuners.lora.dora._use_fused_backward()

Return True if the custom autograd backward path should be used.

Controlled by env var PEFT_DORA_FUSED_BACKWARD: * "1" or "true" → enable * "0" or "false" → disable * unset → enabled by default unconditionally (the fused backward uses PyTorch fallbacks when Triton is unavailable, so Triton is not required; set PEFT_DORA_FUSED_BACKWARD=0 to opt out)

Enabled by default because the fused forward-and-inner kernel eliminates the VRAM spike from sequential PyTorch ops, and the frozen-mag path skips the inner allocation entirely when mag_norm_scale doesn't require gradients. Overhead in the normal (unfrozen) case is exactly 1x lora-sized activation per layer (the saved inner).

Set PEFT_DORA_FUSED_BACKWARD=0 to opt out if VRAM is extremely tight.

Source code in peft/tuners/lora/dora.py
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
def _use_fused_backward() -> bool:
    """Return True if the custom autograd backward path should be used.

    Controlled by env var ``PEFT_DORA_FUSED_BACKWARD``:
      * ``"1"`` or ``"true"`` → enable
      * ``"0"`` or ``"false"`` → disable
      * unset → **enabled by default unconditionally** (the fused backward
        uses PyTorch fallbacks when Triton is unavailable, so Triton is not
        required; set ``PEFT_DORA_FUSED_BACKWARD=0`` to opt out)

    Enabled by default because the fused forward-and-inner kernel eliminates
    the VRAM spike from sequential PyTorch ops, and the frozen-mag path skips
    the ``inner`` allocation entirely when ``mag_norm_scale`` doesn't require
    gradients.  Overhead in the normal (unfrozen) case is exactly 1x
    ``lora``-sized activation per layer (the saved ``inner``).

    Set ``PEFT_DORA_FUSED_BACKWARD=0`` to opt out if VRAM is extremely tight.
    """
    global _cached_use_fused_backward  # noqa: PLW0603
    val = _cached_use_fused_backward
    if val is not _SENTINEL:
        return val
    if dynamo_is_compiling is not None and dynamo_is_compiling():
        return _resolve_fused_backward()
    with _fused_cache_lock:
        if _cached_use_fused_backward is not _SENTINEL:
            return _cached_use_fused_backward
        _cached_use_fused_backward = _resolve_fused_backward()
        return _cached_use_fused_backward

peft.tuners.lora.dora._invalidate_fused_cache()

Reset cached env var results (fused flags + thresholds + FSDP2 detection). Useful for testing.

Source code in peft/tuners/lora/dora.py
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
def _invalidate_fused_cache():
    """Reset cached env var results (fused flags + thresholds + FSDP2 detection). Useful for testing."""
    global _cached_use_fused_kernels, _cached_use_fused_backward, _cached_fused_backward_explicit  # noqa: PLW0603
    global _cached_norm_threshold, _cached_fwd_threshold  # noqa: PLW0603
    global _fsdp2_detect_fns, _cached_allow_partial_gather, _cached_force_gather_override, _cached_is_zero3  # noqa: PLW0603
    with _fused_cache_lock:
        _cached_use_fused_kernels = _SENTINEL
        _cached_use_fused_backward = _SENTINEL
        _cached_fused_backward_explicit = _SENTINEL
        _cached_norm_threshold = _SENTINEL
        _cached_fwd_threshold = _SENTINEL
        _cached_allow_partial_gather = _SENTINEL
        _cached_force_gather_override = _SENTINEL
        _cached_is_zero3 = _SENTINEL
        _fsdp2_detect_fns = None
    is_triton_available.cache_clear()

peft.tuners.lora.dora.is_triton_available() cached

Return True if Triton is importable, without importing dora_fused.

Keep this lightweight probe separate from dora_fused.is_triton_available to preserve lazy import behavior for users that never execute DoRA paths.

Source code in peft/tuners/lora/dora.py
175
176
177
178
179
180
181
182
183
184
185
186
@lru_cache(maxsize=1)
def is_triton_available() -> bool:
    """Return True if Triton is importable, without importing dora_fused.

    Keep this lightweight probe separate from ``dora_fused.is_triton_available``
    to preserve lazy import behavior for users that never execute DoRA paths.
    """
    try:
        import triton  # noqa: F401
    except ImportError:
        return False
    return True

Memory Thresholds

Functions controlling chunking thresholds for norm computation and forward passes. Matrices exceeding these thresholds are processed in chunks to bound peak memory.

peft.tuners.lora.dora.set_dora_norm_threshold_mb(mb)

Set the PEFT_DORA_NORM_CHUNK_MB environment variable to control the working-set memory threshold (in MB). Enforces that mb is an integer >= 16 and <= 65536 (64 GB). Raises ValueError if mb is out of bounds.

Source code in peft/tuners/lora/dora.py
614
615
616
617
618
619
620
621
622
623
624
625
626
627
def set_dora_norm_threshold_mb(mb: int) -> None:
    """
    Set the PEFT_DORA_NORM_CHUNK_MB environment variable to control the working-set memory threshold (in MB).
    Enforces that mb is an integer >= 16 and <= 65536 (64 GB).
    Raises ValueError if mb is out of bounds.
    """
    min_mb = 16
    max_mb = 65536  # 64 GB, arbitrary upper bound to prevent mistakes
    if not isinstance(mb, int):
        raise ValueError(f"mb must be an integer, got {type(mb).__name__}")
    if not (min_mb <= mb <= max_mb):
        raise ValueError(f"mb must be between {min_mb} and {max_mb} (got {mb})")
    os.environ["PEFT_DORA_NORM_CHUNK_MB"] = str(mb)
    _invalidate_threshold_cache()

peft.tuners.lora.dora.get_dora_norm_threshold_mb()

Return the current DoRA norm chunk threshold in MB.

Source code in peft/tuners/lora/dora.py
1746
1747
1748
def get_dora_norm_threshold_mb() -> int:
    """Return the current DoRA norm chunk threshold in MB."""
    return int(_get_norm_memory_threshold_bytes() // (1024 * 1024))

peft.tuners.lora.dora.get_dora_norm_threshold_bytes()

Return the current DoRA norm chunk threshold in bytes.

Source code in peft/tuners/lora/dora.py
1751
1752
1753
def get_dora_norm_threshold_bytes() -> int:
    """Return the current DoRA norm chunk threshold in bytes."""
    return int(_get_norm_memory_threshold_bytes())

Fused Backward Heuristics

Shape-based heuristics that decide whether the fused backward kernel is beneficial for a given tensor. Small matrices may not benefit from kernel launch overhead.

peft.tuners.lora.dora._should_auto_use_fused_backward_shape(num_rows, num_cols)

Benchmark-informed crossover for auto-enabled fused backward.

num_cols is the activation's last dimension (d_out for linear layers, i.e. lora_out.shape[-1]). num_rows is the product of all other dimensions (batch * seq for linear layers).

The warmed 6-GPU benchmark bundle (L40S, A100, RTX 6000 PRO, H200, B200, B300) shows the crossover entering the win regime around the 2048x6144 (rows x cols) activation shape on Blackwell and Ampere; L40S and H200 may still trail at the threshold and do not consistently win until roughly 2x the threshold work-item count. The threshold is therefore conservative on high-bandwidth HBM GPUs and slightly aggressive on lower-bandwidth / older architectures.

We keep explicit env-var enables as a force-on override; this heuristic only applies when PEFT_DORA_FUSED_BACKWARD is unset (the default auto mode).

Source code in peft/tuners/lora/dora.py
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
def _should_auto_use_fused_backward_shape(num_rows: int, num_cols: int) -> bool:
    """Benchmark-informed crossover for auto-enabled fused backward.

    ``num_cols`` is the activation's last dimension (``d_out`` for linear
    layers, i.e. ``lora_out.shape[-1]``).  ``num_rows`` is the product of
    all other dimensions (``batch * seq`` for linear layers).

    The warmed 6-GPU benchmark bundle (L40S, A100, RTX 6000 PRO, H200,
    B200, B300) shows the crossover entering the win regime around the
    2048x6144 (rows x cols) activation shape on Blackwell and Ampere;
    L40S and H200 may still trail at the threshold and do not consistently
    win until roughly 2x the threshold work-item count.  The threshold is
    therefore conservative on high-bandwidth HBM GPUs and slightly
    aggressive on lower-bandwidth / older architectures.

    We keep explicit env-var enables as a force-on override; this heuristic
    only applies when PEFT_DORA_FUSED_BACKWARD is unset (the default auto
    mode).
    """

    if num_rows <= 0 or num_cols < _FUSED_BACKWARD_AUTO_MIN_COLS:
        return False
    return num_rows * num_cols >= _FUSED_BACKWARD_AUTO_MIN_WORK_ITEMS

peft.tuners.lora.dora._should_use_fused_backward_for_tensor(lora_out, mag_norm_scale=None)

Decide whether training-time compose should route through fused backward.

Source code in peft/tuners/lora/dora.py
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
def _should_use_fused_backward_for_tensor(
    lora_out: torch.Tensor,
    mag_norm_scale: Optional[torch.Tensor] = None,
) -> bool:
    """Decide whether training-time compose should route through fused backward."""

    if not (_use_fused_backward() and lora_out.is_cuda):
        return False

    # _use_fused_backward() returns True both for "unset (default on)" and
    # "explicitly set to 1".  We need to distinguish: explicit opt-in skips
    # the auto crossover heuristic, while unset defers to shape analysis.
    # _resolve_fused_backward_explicit() returns True/False for explicit
    # setting, None for unset — all cached.
    explicit = _resolve_fused_backward_explicit()
    if explicit is not None:
        return explicit

    if lora_out.ndim == 0:
        return False

    # Apply the auto heuristic only to the linear/embedding-style broadcast
    # pattern that the Triton benchmark suite covers directly.  For other
    # layouts (for example Conv with mag=[1,C,1,1]), preserve the previous
    # custom-autograd behavior.
    if mag_norm_scale is not None and _mag_broadcasts_last_dim(mag_norm_scale, lora_out):
        num_cols = lora_out.shape[-1]
        num_rows = lora_out.numel() // max(num_cols, 1)
        return _should_auto_use_fused_backward_shape(num_rows, num_cols)

    return True

FSDP & ZeRO-3 Integration

Functions for detecting and interacting with distributed training frameworks (FSDP2, DeepSpeed ZeRO-3) that shard parameters across devices.

peft.tuners.lora.dora._is_fsdp2_managed(module)

Detect whether module is wrapped by PyTorch FSDP2 (composable API).

FSDP2 (torch.distributed._composable.fsdp.fully_shard, available since PyTorch 2.4) attaches FSDPState to modules but does not wrap them with FullyShardedDataParallel, so the FSDP1 summon_full_params API silently no-ops. We detect FSDP2 by checking for the state object that fully_shard attaches.

Implementation notes (private API dependencies): - FSDPState: the composable-FSDP state class. Import location moved from torch.distributed._composable.fsdp (2.4–2.9) to torch.distributed.fsdp._fully_shard._fsdp_state (2.10+). Both paths are tried by _resolve_fsdp2_detect_fns. - torch.distributed._composable_state._get_module_state: stable since PyTorch 2.4. Returns the composable state attached by fully_shard. - torch.distributed.fsdp._common_utils._get_module_fsdp_state: legacy fallback for PyTorch < 2.4.

Detection functions are resolved once on first call and cached to avoid import overhead on every forward pass (hundreds of layers × thousands of steps). All imports are guarded by try/except so breakage in future PyTorch releases degrades to returning False (FSDP1 behavior preserved). Last verified against PyTorch 2.4.0, 2.5.0, 2.6.0, and 2.10.0.

Source code in peft/tuners/lora/dora.py
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
def _is_fsdp2_managed(module) -> bool:
    """Detect whether *module* is wrapped by PyTorch FSDP2 (composable API).

    FSDP2 (``torch.distributed._composable.fsdp.fully_shard``, available since
    PyTorch 2.4) attaches ``FSDPState`` to modules but does **not** wrap them
    with ``FullyShardedDataParallel``, so the FSDP1 ``summon_full_params`` API
    silently no-ops.  We detect FSDP2 by checking for the state object that
    ``fully_shard`` attaches.

    Implementation notes (private API dependencies):
      - ``FSDPState``: the composable-FSDP state class.  Import location moved
        from ``torch.distributed._composable.fsdp`` (2.4–2.9) to
        ``torch.distributed.fsdp._fully_shard._fsdp_state`` (2.10+).
        Both paths are tried by ``_resolve_fsdp2_detect_fns``.
      - ``torch.distributed._composable_state._get_module_state``: stable
        since PyTorch 2.4.  Returns the composable state attached by
        ``fully_shard``.
      - ``torch.distributed.fsdp._common_utils._get_module_fsdp_state``:
        legacy fallback for PyTorch < 2.4.

    Detection functions are resolved once on first call and cached to avoid
    import overhead on every forward pass (hundreds of layers × thousands of
    steps).  All imports are guarded by ``try/except`` so breakage in future
    PyTorch releases degrades to returning ``False`` (FSDP1 behavior preserved).
    Last verified against PyTorch 2.4.0, 2.5.0, 2.6.0, and 2.10.0.
    """
    if not isinstance(module, nn.Module):
        return False

    global _fsdp2_detect_fns  # noqa: PLW0603
    if _fsdp2_detect_fns is None:
        with _fused_cache_lock:
            if _fsdp2_detect_fns is None:
                _fsdp2_detect_fns = _resolve_fsdp2_detect_fns()

    fsdp_state_cls, get_state_fn = _fsdp2_detect_fns

    if get_state_fn is None:
        return False

    try:
        state = get_state_fn(module)
    except (TypeError, AttributeError):
        state = None

    if state is not None:
        if fsdp_state_cls is not None:
            if isinstance(state, fsdp_state_cls):
                return True
        elif FSDP is None or not isinstance(module, FSDP):
            # Has composable state but is not FSDP1-wrapped → FSDP2
            return True

    # Note: we intentionally do NOT check for DTensor parameters here.
    # FSDP2 converts child parameters to DTensor when fully_shard() is
    # called on a parent, so _get_module_state returns None for leaf layers
    # even though their params are sharded.  However, DTensor is also used
    # by Tensor Parallelism and Pipeline Parallelism — checking for DTensor
    # would false-positive on TP-only configs and crash DoRA forward.
    # The primary detection via _get_module_state catches directly-wrapped
    # modules.  Parent-only FSDP2 wrapping is a known detection gap, but
    # in that configuration FSDP2's own pre-forward hooks unshard parameters
    # before DoRA's forward runs, so norms are computed from full params.
    return False

peft.tuners.lora.dora._fsdp_full_param_ctx(*modules)

Best-effort context to expose full parameters when modules are wrapped with torch.distributed.fsdp.FullyShardedDataParallel (FSDP). - Yields exactly once. - No-ops outside FSDP or if modules are not FSDP-wrapped. - Does not swallow exceptions raised inside the 'with' body. This is safe under ZeRO/DP/DDP (it will simply do nothing). Debug logs which modules were successfully summoned (best-effort).

Callers must only pass nn.Module instances — raw tensors or nn.Parameter objects (e.g. embedding LoRA factors) are not individually FSDP-wrapped and should not be passed here. Use _maybe_gather_base_params_ctx for those.

Raises RuntimeError if any module is managed by FSDP2 (composable API), which uses a different full-parameter mechanism that this helper does not support. Failing loudly is preferable to silently computing norms from sharded parameters.

Source code in peft/tuners/lora/dora.py
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
@contextmanager
def _fsdp_full_param_ctx(*modules):
    """
    Best-effort context to expose full parameters when modules are wrapped with
    torch.distributed.fsdp.FullyShardedDataParallel (FSDP).
    - Yields exactly once.
    - No-ops outside FSDP or if modules are not FSDP-wrapped.
    - Does not swallow exceptions raised inside the 'with' body.
    This is safe under ZeRO/DP/DDP (it will simply do nothing).
    Debug logs which modules were successfully summoned (best-effort).

    Callers must only pass ``nn.Module`` instances — raw tensors or
    ``nn.Parameter`` objects (e.g. embedding LoRA factors) are not
    individually FSDP-wrapped and should not be passed here.  Use
    ``_maybe_gather_base_params_ctx`` for those.

    Raises ``RuntimeError`` if any module is managed by FSDP2 (composable API),
    which uses a different full-parameter mechanism that this helper does not
    support.  Failing loudly is preferable to silently computing norms from
    sharded parameters.
    """
    if FSDP is None:
        yield
        return

    # Filter to nn.Module instances only — raw tensors and Parameters should
    # use _maybe_gather_base_params_ctx instead.
    modules = tuple(m for m in modules if m is not None and isinstance(m, nn.Module))
    if not modules:
        yield
        return

    # Detect FSDP2 and fail loudly rather than silently returning shards.
    for m in modules:
        if _is_fsdp2_managed(m):
            raise RuntimeError(
                f"DoRA detected FSDP2-wrapped module ({type(m).__name__}). "
                "The current DoRA implementation only supports FSDP1's "
                "`summon_full_params` API. FSDP2 (composable `fully_shard`) "
                "requires a different full-parameter mechanism that is not yet "
                "implemented. Using FSDP2 with DoRA would silently compute "
                "norms from sharded parameters and produce incorrect results."
            )

    with ExitStack() as stack:
        summoned = 0
        for m in modules:
            try:
                cm = FSDP.summon_full_params(m, writeback=False, with_grads=False)
            except (TypeError, AttributeError):
                # Not FSDP-wrapped or incompatible; skip
                continue
            try:
                stack.enter_context(cm)
                summoned += 1
            except RuntimeError:
                # Some FSDP variants may raise at enter time; skip
                continue
        if summoned:
            logger.debug("DoRA: entered FSDP full-param ctx for %d module(s)", summoned)
        yield

peft.tuners.lora.dora._maybe_gather_base_params_ctx(base_layer, *extra_modules)

Only required under DeepSpeed ZeRO-3 where parameters are sharded. For ZeRO-2, params are replicated, so gathering is unnecessary. We gate gathering by: - explicit PEFT_FORCE_GATHER override when set, else - DS_ZERO_STAGE==3 (env), or - check_deepspeed_zero3_enabled(). We try param-tuple signature first, else module object; logs which one worked.

extra_modules are additional modules or raw tensors/parameters whose parameters should also be gathered (e.g. lora_A, lora_B). Under ZeRO-3 the adapter weights can be sharded too, so every tensor consumed by the norm path must be inside the gather scope.

Items that are nn.Module contribute via .parameters(). Items that are bare torch.Tensor / nn.Parameter (e.g. embedding LoRA factors) are included directly in the gather tuple.

Source code in peft/tuners/lora/dora.py
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
def _maybe_gather_base_params_ctx(base_layer, *extra_modules):
    """
    Only required under DeepSpeed ZeRO-3 where parameters are sharded. For ZeRO-2, params are
    replicated, so gathering is unnecessary. We gate gathering by:
      - explicit ``PEFT_FORCE_GATHER`` override when set, else
      - DS_ZERO_STAGE==3 (env), or
      - check_deepspeed_zero3_enabled().
    We try param-tuple signature first, else module object; logs which one worked.

    *extra_modules* are additional modules or raw tensors/parameters whose
    parameters should also be gathered (e.g. ``lora_A``, ``lora_B``).  Under
    ZeRO-3 the adapter weights can be sharded too, so every tensor consumed by
    the norm path must be inside the gather scope.

    Items that are ``nn.Module`` contribute via ``.parameters()``.  Items that
    are bare ``torch.Tensor`` / ``nn.Parameter`` (e.g. embedding LoRA factors)
    are included directly in the gather tuple.
    """
    if gather_params_ctx is None or not _is_zero3_active():
        return nullcontext()

    # Collect parameters from all modules (base + extras) into a single tuple.
    # Modules contribute via .parameters(); raw tensors are included directly.
    all_modules = [base_layer] + [m for m in extra_modules if m is not None]
    param_iterable = None
    try:
        params = []
        for mod in all_modules:
            if hasattr(mod, "parameters") and callable(mod.parameters):
                params.extend(mod.parameters())
            elif isinstance(mod, torch.Tensor):
                params.append(_resolve_tensor_base(mod))
        if params:
            param_iterable = tuple(params)
    except TypeError:
        param_iterable = None

    @contextmanager
    def _ctx():
        with ExitStack() as stack:
            entered = False
            if param_iterable is not None:
                try:
                    cm = gather_params_ctx(param_iterable)
                    stack.enter_context(cm)
                    logger.debug("DoRA: ZeRO-3 gather using param tuple (%d params)", len(param_iterable))
                    entered = True
                except (TypeError, AttributeError, RuntimeError) as exc:
                    logger.debug("DoRA: param-tuple gather failed (%s: %s), trying module", type(exc).__name__, exc)
                    entered = False

            if not entered:
                # Fall back to per-module gather.  Track successes and
                # failures separately — a *partial* gather (some modules
                # gathered, others not) is worse than no gather at all
                # because it silently mixes full and sharded tensors.
                gathered_mods = []
                failed_mods = []
                for mod in all_modules:
                    try:
                        # GatheredParameters expects an iterable of Parameters
                        # (or a single Parameter).  Passing an nn.Module directly
                        # makes GatheredParameters a silent no-op (the module
                        # isn't iterable and lacks ds_id).  Always extract params.
                        if isinstance(mod, torch.Tensor):
                            target = (_resolve_tensor_base(mod),)
                        elif hasattr(mod, "parameters") and callable(mod.parameters):
                            target = tuple(mod.parameters())
                            if not target:
                                # Module has no parameters — nothing to gather.
                                gathered_mods.append(type(mod).__name__)
                                continue
                        else:
                            failed_mods.append((type(mod).__name__, "TypeError", "not a Module or Tensor"))
                            continue
                        cm = gather_params_ctx(target)
                        stack.enter_context(cm)
                        gathered_mods.append(type(mod).__name__)
                    except (TypeError, AttributeError, RuntimeError) as exc:
                        failed_mods.append((type(mod).__name__, type(exc).__name__, str(exc)))

                if gathered_mods:
                    entered = True
                    logger.debug("DoRA: ZeRO-3 gather using module objects (%s)", ", ".join(gathered_mods))

                if gathered_mods and failed_mods:
                    # Partial gather: some modules gathered, others failed.
                    # This silently mixes fully gathered and sharded tensors
                    # in the norm computation — a correctness violation.
                    failed_desc = "; ".join(f"{n} ({e}: {m})" for n, e, m in failed_mods)
                    msg = (
                        f"DoRA: ZeRO-3 partial gather — gathered [{', '.join(gathered_mods)}] "
                        f"but failed for [{failed_desc}]. "
                        "Norm computation would mix fully gathered and sharded parameters, "
                        "producing incorrect results."
                    )
                    if _allow_partial_gather():
                        logger.warning(msg + " Continuing due to PEFT_DORA_ALLOW_PARTIAL_GATHER=1.")
                    else:
                        raise RuntimeError(
                            msg + " Set PEFT_DORA_ALLOW_PARTIAL_GATHER=1 to override (at your own risk)."
                        )

            if not entered:
                logger.warning(
                    "DoRA: ZeRO-3 gather failed for all modules. "
                    "Proceeding without gathering — outputs may be incorrect if parameters "
                    "are truly sharded.",
                )
            yield

    return _ctx()

peft.tuners.lora.dora._is_zero3_active()

Return True when DoRA should gather sharded parameters.

PEFT_FORCE_GATHER is a cached ternary override: * unset -> auto-detect ZeRO-3 * 1 / true -> force gather on and cache forever * 0 / false -> force gather off and cache forever

When the override is unset, only detected True is cached — ZeRO-3 may initialize after the first DoRA call (the common HF Trainer flow: create model -> PEFT wrap -> deepspeed.initialize), so an auto-detected False must still be re-evaluated on later forwards.

Source code in peft/tuners/lora/dora.py
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
def _is_zero3_active() -> bool:
    """Return True when DoRA should gather sharded parameters.

    ``PEFT_FORCE_GATHER`` is a cached ternary override:
      * unset -> auto-detect ZeRO-3
      * ``1`` / ``true`` -> force gather on and cache forever
      * ``0`` / ``false`` -> force gather off and cache forever

    When the override is unset, only detected ``True`` is cached — ZeRO-3 may
    initialize after the first DoRA call (the common HF Trainer flow: create
    model -> PEFT wrap -> deepspeed.initialize), so an auto-detected ``False``
    must still be re-evaluated on later forwards.
    """
    force = _force_gather_override()
    if force is not None:
        return force
    global _cached_is_zero3  # noqa: PLW0603
    if _cached_is_zero3 is True:
        return True
    # No explicit override: don't cache False, re-evaluate late DS init.
    # Skip all DeepSpeed checks when distributed isn't initialized — ZeRO-3
    # can't be active without a process group.  This makes the False path very
    # cheap (~100ns for the is_initialized() boolean) on single-GPU setups
    # with 100+ DoRA layers, avoiding the os.environ.get() on every call.
    is_zero3_ds = False
    try:
        if torch.distributed.is_initialized():
            # Fast path: if the env var is set, trust it immediately.
            if os.environ.get("DS_ZERO_STAGE") == "3":
                with _fused_cache_lock:
                    _cached_is_zero3 = True
                return True
            try:
                is_zero3_ds = check_deepspeed_zero3_enabled()
            except (ImportError, RuntimeError, ValueError):
                pass
            except Exception:
                logger.debug("DoRA: check_deepspeed_zero3_enabled() raised unexpected error", exc_info=True)
    except (RuntimeError, AttributeError):
        pass
    result = is_zero3_ds
    if result:
        with _fused_cache_lock:
            _cached_is_zero3 = True
    return result

peft.tuners.lora.dora._allow_partial_gather()

Return True if PEFT_DORA_ALLOW_PARTIAL_GATHER=1.

Cached after first call, consistent with other env-var accessors.

Source code in peft/tuners/lora/dora.py
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
def _allow_partial_gather() -> bool:
    """Return True if ``PEFT_DORA_ALLOW_PARTIAL_GATHER=1``.

    Cached after first call, consistent with other env-var accessors.
    """
    global _cached_allow_partial_gather  # noqa: PLW0603
    val = _cached_allow_partial_gather
    if val is not _SENTINEL:
        return val
    # Dynamo cannot trace through threading.Lock — resolve directly during
    # compilation (single-threaded, so the lock is unnecessary).
    if dynamo_is_compiling is not None and dynamo_is_compiling():
        return os.environ.get("PEFT_DORA_ALLOW_PARTIAL_GATHER", "0") == "1"
    with _fused_cache_lock:
        if _cached_allow_partial_gather is not _SENTINEL:
            return _cached_allow_partial_gather
        _cached_allow_partial_gather = os.environ.get("PEFT_DORA_ALLOW_PARTIAL_GATHER", "0") == "1"
        return _cached_allow_partial_gather

Composition Helpers

Eager (non-fused) composition functions used as the fallback path.

peft.tuners.lora.dora._compose_eager_inplace(lora, base, mag_norm_scale, scale)

Numerically stable in-place DoRA composition for eager PyTorch paths.

Computes out = (mag - 1) * base + mag * (scale * lora) in-place into lora, avoiding catastrophic cancellation when mag ≈ 1 in bf16/fp16.

When all operands already match lora.dtype, the in-place path uses lora *= scale then lora *= mag then lora += (mag - 1) * base. The two-step multiply preserves the canonical associativity mag * (scale * lora) (scale first, then mag).

Under mixed dtypes (for example fp32 magnitude with bf16 activations under AMP), eager training defines the reference contract: evaluate the stable form in the promoted dtype, then cast back to the activation dtype. The in-place helper mirrors that by materializing the promoted result and copying it back into lora. This restores bitwise parity across eager out-of-place, eager in-place, and chunked eager composition, but it does not by itself change the separate fused-autograd dtype contract in _compose_with_dispatch.

This is the single source of truth for the eager in-place formula. The Triton kernel in dora_fused.py computes the same expression but is maintained separately because kernel code cannot share Python helpers. See test_compose_formula_cross_reference for the consistency assertion.

Source code in peft/tuners/lora/dora.py
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
def _compose_eager_inplace(
    lora: torch.Tensor,
    base: torch.Tensor,
    mag_norm_scale: torch.Tensor,
    scale: float,
) -> torch.Tensor:
    """Numerically stable in-place DoRA composition for eager PyTorch paths.

    Computes ``out = (mag - 1) * base + mag * (scale * lora)`` in-place into
    *lora*, avoiding catastrophic cancellation when ``mag ≈ 1`` in bf16/fp16.

    When all operands already match ``lora.dtype``, the in-place path uses
    ``lora *= scale`` then ``lora *= mag`` then ``lora += (mag - 1) * base``.
    The two-step multiply preserves the canonical associativity
    ``mag * (scale * lora)`` (scale first, then mag).

    Under mixed dtypes (for example fp32 magnitude with bf16 activations under
    AMP), eager training defines the reference contract: evaluate the stable
    form in the promoted dtype, then cast back to the activation dtype.  The
    in-place helper mirrors that by materializing the promoted result and
    copying it back into ``lora``.  This restores bitwise parity across eager
    out-of-place, eager in-place, and chunked eager composition, but it does
    not by itself change the separate fused-autograd dtype contract in
    ``_compose_with_dispatch``.

    This is the single source of truth for the eager in-place formula.
    The Triton kernel in ``dora_fused.py`` computes the same expression
    but is maintained separately because kernel code cannot share Python helpers.
    See ``test_compose_formula_cross_reference`` for the consistency assertion.
    """
    if _promoted_compose_dtype(lora.dtype, base.dtype, mag_norm_scale.dtype) != lora.dtype:
        result = mag_norm_scale * (scale * lora) + (mag_norm_scale - 1) * base
        lora.copy_(result)
        return lora

    # Step 1: lora = scale * lora  (in-place, canonical order: scale first)
    lora.mul_(scale)
    # Step 2: lora = mag * lora  (in-place, canonical order: mag second)
    lora.mul_(mag_norm_scale)
    # Step 3: lora += (mag - 1) * base  (in-place, adds the base correction)
    lora.add_(base * (mag_norm_scale - 1))
    return lora