Configuration
Configuration functions from peft.tuners.lora.dora that control fused kernel dispatch,
memory thresholds, backward-pass heuristics, and distributed training integration.
Info
These are module-level functions, most prefixed with _ (private). They are documented
here as developer reference for understanding the runtime control flow.
Fused Kernel Control¶
Functions that read environment variables and manage cached decisions about whether to use fused Triton kernels.
peft.tuners.lora.dora._use_fused_kernels()
¶
Return True if fused Triton kernels should be used (when available).
Controlled by env var PEFT_DORA_FUSED:
* "1" or "true" (case-insensitive) → enable
* "0" or "false" → disable
* unset → enable by default when Triton is available
The result is cached after the first call so that os.environ.get()
is not invoked on every forward pass. A threading.Lock guards the
first-write to avoid TOCTOU races in threaded launchers.
Note
This flag enables Triton kernels for inference-style forward paths.
During training, fused compose uses the custom autograd path by
default (disable with PEFT_DORA_FUSED_BACKWARD=0).
Source code in peft/tuners/lora/dora.py
276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 | |
peft.tuners.lora.dora._use_fused_backward()
¶
Return True if the custom autograd backward path should be used.
Controlled by env var PEFT_DORA_FUSED_BACKWARD:
* "1" or "true" → enable
* "0" or "false" → disable
* unset → enabled by default unconditionally (the fused backward
uses PyTorch fallbacks when Triton is unavailable, so Triton is not
required; set PEFT_DORA_FUSED_BACKWARD=0 to opt out)
Enabled by default because the fused forward-and-inner kernel eliminates
the VRAM spike from sequential PyTorch ops, and the frozen-mag path skips
the inner allocation entirely when mag_norm_scale doesn't require
gradients. Overhead in the normal (unfrozen) case is exactly 1x
lora-sized activation per layer (the saved inner).
Set PEFT_DORA_FUSED_BACKWARD=0 to opt out if VRAM is extremely tight.
Source code in peft/tuners/lora/dora.py
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 | |
peft.tuners.lora.dora._invalidate_fused_cache()
¶
Reset cached env var results (fused flags + thresholds + FSDP2 detection). Useful for testing.
Source code in peft/tuners/lora/dora.py
350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 | |
peft.tuners.lora.dora.is_triton_available()
cached
¶
Return True if Triton is importable, without importing dora_fused.
Keep this lightweight probe separate from dora_fused.is_triton_available
to preserve lazy import behavior for users that never execute DoRA paths.
Source code in peft/tuners/lora/dora.py
175 176 177 178 179 180 181 182 183 184 185 186 | |
Memory Thresholds¶
Functions controlling chunking thresholds for norm computation and forward passes. Matrices exceeding these thresholds are processed in chunks to bound peak memory.
peft.tuners.lora.dora.set_dora_norm_threshold_mb(mb)
¶
Set the PEFT_DORA_NORM_CHUNK_MB environment variable to control the working-set memory threshold (in MB). Enforces that mb is an integer >= 16 and <= 65536 (64 GB). Raises ValueError if mb is out of bounds.
Source code in peft/tuners/lora/dora.py
614 615 616 617 618 619 620 621 622 623 624 625 626 627 | |
peft.tuners.lora.dora.get_dora_norm_threshold_mb()
¶
Return the current DoRA norm chunk threshold in MB.
Source code in peft/tuners/lora/dora.py
1746 1747 1748 | |
peft.tuners.lora.dora.get_dora_norm_threshold_bytes()
¶
Return the current DoRA norm chunk threshold in bytes.
Source code in peft/tuners/lora/dora.py
1751 1752 1753 | |
Fused Backward Heuristics¶
Shape-based heuristics that decide whether the fused backward kernel is beneficial for a given tensor. Small matrices may not benefit from kernel launch overhead.
peft.tuners.lora.dora._should_auto_use_fused_backward_shape(num_rows, num_cols)
¶
Benchmark-informed crossover for auto-enabled fused backward.
num_cols is the activation's last dimension (d_out for linear
layers, i.e. lora_out.shape[-1]). num_rows is the product of
all other dimensions (batch * seq for linear layers).
The warmed 6-GPU benchmark bundle (L40S, A100, RTX 6000 PRO, H200, B200, B300) shows the crossover entering the win regime around the 2048x6144 (rows x cols) activation shape on Blackwell and Ampere; L40S and H200 may still trail at the threshold and do not consistently win until roughly 2x the threshold work-item count. The threshold is therefore conservative on high-bandwidth HBM GPUs and slightly aggressive on lower-bandwidth / older architectures.
We keep explicit env-var enables as a force-on override; this heuristic only applies when PEFT_DORA_FUSED_BACKWARD is unset (the default auto mode).
Source code in peft/tuners/lora/dora.py
903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 | |
peft.tuners.lora.dora._should_use_fused_backward_for_tensor(lora_out, mag_norm_scale=None)
¶
Decide whether training-time compose should route through fused backward.
Source code in peft/tuners/lora/dora.py
928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 | |
FSDP & ZeRO-3 Integration¶
Functions for detecting and interacting with distributed training frameworks (FSDP2, DeepSpeed ZeRO-3) that shard parameters across devices.
peft.tuners.lora.dora._is_fsdp2_managed(module)
¶
Detect whether module is wrapped by PyTorch FSDP2 (composable API).
FSDP2 (torch.distributed._composable.fsdp.fully_shard, available since
PyTorch 2.4) attaches FSDPState to modules but does not wrap them
with FullyShardedDataParallel, so the FSDP1 summon_full_params API
silently no-ops. We detect FSDP2 by checking for the state object that
fully_shard attaches.
Implementation notes (private API dependencies):
- FSDPState: the composable-FSDP state class. Import location moved
from torch.distributed._composable.fsdp (2.4–2.9) to
torch.distributed.fsdp._fully_shard._fsdp_state (2.10+).
Both paths are tried by _resolve_fsdp2_detect_fns.
- torch.distributed._composable_state._get_module_state: stable
since PyTorch 2.4. Returns the composable state attached by
fully_shard.
- torch.distributed.fsdp._common_utils._get_module_fsdp_state:
legacy fallback for PyTorch < 2.4.
Detection functions are resolved once on first call and cached to avoid
import overhead on every forward pass (hundreds of layers × thousands of
steps). All imports are guarded by try/except so breakage in future
PyTorch releases degrades to returning False (FSDP1 behavior preserved).
Last verified against PyTorch 2.4.0, 2.5.0, 2.6.0, and 2.10.0.
Source code in peft/tuners/lora/dora.py
416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 | |
peft.tuners.lora.dora._fsdp_full_param_ctx(*modules)
¶
Best-effort context to expose full parameters when modules are wrapped with torch.distributed.fsdp.FullyShardedDataParallel (FSDP). - Yields exactly once. - No-ops outside FSDP or if modules are not FSDP-wrapped. - Does not swallow exceptions raised inside the 'with' body. This is safe under ZeRO/DP/DDP (it will simply do nothing). Debug logs which modules were successfully summoned (best-effort).
Callers must only pass nn.Module instances — raw tensors or
nn.Parameter objects (e.g. embedding LoRA factors) are not
individually FSDP-wrapped and should not be passed here. Use
_maybe_gather_base_params_ctx for those.
Raises RuntimeError if any module is managed by FSDP2 (composable API),
which uses a different full-parameter mechanism that this helper does not
support. Failing loudly is preferable to silently computing norms from
sharded parameters.
Source code in peft/tuners/lora/dora.py
482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 | |
peft.tuners.lora.dora._maybe_gather_base_params_ctx(base_layer, *extra_modules)
¶
Only required under DeepSpeed ZeRO-3 where parameters are sharded. For ZeRO-2, params are
replicated, so gathering is unnecessary. We gate gathering by:
- explicit PEFT_FORCE_GATHER override when set, else
- DS_ZERO_STAGE==3 (env), or
- check_deepspeed_zero3_enabled().
We try param-tuple signature first, else module object; logs which one worked.
extra_modules are additional modules or raw tensors/parameters whose
parameters should also be gathered (e.g. lora_A, lora_B). Under
ZeRO-3 the adapter weights can be sharded too, so every tensor consumed by
the norm path must be inside the gather scope.
Items that are nn.Module contribute via .parameters(). Items that
are bare torch.Tensor / nn.Parameter (e.g. embedding LoRA factors)
are included directly in the gather tuple.
Source code in peft/tuners/lora/dora.py
757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 | |
peft.tuners.lora.dora._is_zero3_active()
¶
Return True when DoRA should gather sharded parameters.
PEFT_FORCE_GATHER is a cached ternary override:
* unset -> auto-detect ZeRO-3
* 1 / true -> force gather on and cache forever
* 0 / false -> force gather off and cache forever
When the override is unset, only detected True is cached — ZeRO-3 may
initialize after the first DoRA call (the common HF Trainer flow: create
model -> PEFT wrap -> deepspeed.initialize), so an auto-detected False
must still be re-evaluated on later forwards.
Source code in peft/tuners/lora/dora.py
650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 | |
peft.tuners.lora.dora._allow_partial_gather()
¶
Return True if PEFT_DORA_ALLOW_PARTIAL_GATHER=1.
Cached after first call, consistent with other env-var accessors.
Source code in peft/tuners/lora/dora.py
630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 | |
Composition Helpers¶
Eager (non-fused) composition functions used as the fallback path.
peft.tuners.lora.dora._compose_eager_inplace(lora, base, mag_norm_scale, scale)
¶
Numerically stable in-place DoRA composition for eager PyTorch paths.
Computes out = (mag - 1) * base + mag * (scale * lora) in-place into
lora, avoiding catastrophic cancellation when mag ≈ 1 in bf16/fp16.
When all operands already match lora.dtype, the in-place path uses
lora *= scale then lora *= mag then lora += (mag - 1) * base.
The two-step multiply preserves the canonical associativity
mag * (scale * lora) (scale first, then mag).
Under mixed dtypes (for example fp32 magnitude with bf16 activations under
AMP), eager training defines the reference contract: evaluate the stable
form in the promoted dtype, then cast back to the activation dtype. The
in-place helper mirrors that by materializing the promoted result and
copying it back into lora. This restores bitwise parity across eager
out-of-place, eager in-place, and chunked eager composition, but it does
not by itself change the separate fused-autograd dtype contract in
_compose_with_dispatch.
This is the single source of truth for the eager in-place formula.
The Triton kernel in dora_fused.py computes the same expression
but is maintained separately because kernel code cannot share Python helpers.
See test_compose_formula_cross_reference for the consistency assertion.
Source code in peft/tuners/lora/dora.py
961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 | |