QCFS Threshold (Lambda) Not Learning During ANN-to-SNN Conversion: Root Cause
If your QCFS clipping threshold freezes near its initial value and never moves during calibration, the method isn't broken - your optimizer configuration is. Here's the exact gradient-flow mechanism and the fix.
Direct answer
QCFS's learnable threshold parameter (commonly called lambda or theta) receives gradient signal that is one to two orders of magnitude smaller than the gradient flowing into convolutional weights, because lambda only affects the loss through the clip and floor boundary conditions of the quantization function. If lambda shares a learning rate with the network's weights, its effective update per step is too small to move it away from initialization within a normal training budget - it looks frozen because, numerically, it nearly is. The fix is to put lambda in its own parameter group with a separate, higher learning rate.
What QCFS is supposed to do
Quantization Clip-Floor-Shift (Bu et al., 2023) is a calibration method for converting a trained ANN into a spiking neural network. It replaces each ReLU activation with a function that clips, floors, and shifts the activation into discrete quantization levels, controlled by a learnable per-layer threshold parameter, typically written as λ (lambda) or θ (theta):
QCFS(x, λ, L) = λ · clip(floor((x / λ) · L + 0.5) / L, 0, 1)
The threshold λ is meant to learn the right firing threshold for each layer's converted spiking neurons, so the network's spike rate over T timesteps approximates the original ReLU's continuous output. When this works, the converted SNN should recover most of the original ANN's accuracy with a small number of timesteps. When it doesn't work, λ sits at (or very near) its initial value - often 1.0, or whatever default the implementation initializes it to - for the entire calibration run, and the converted network never properly calibrates per-layer.
Why this looks like a method failure but isn't
It is tempting, after a calibration run where λ doesn't move, to conclude that QCFS as a method doesn't generalize, or that learnable thresholds don't help. That conclusion is wrong, and the giveaway is in the gradient, not the result. If you log gradient norms separately for λ parameters versus convolutional weight parameters during the first few calibration steps, you will typically see λ's gradient norm sitting one to two orders of magnitude below the weights'. That's not a sign the method has no learning signal for λ - it's a sign the learning signal exists but is small, and a shared learning rate tuned for weights is far too conservative for it.
Where the small gradient comes from
λ only enters the QCFS function through the clip and floor operations. The floor function's gradient is zero almost everywhere by construction (it's piecewise constant), so in practice QCFS implementations use a surrogate gradient through the floor, typically approximated as 1 within the active range and 0 outside the clip bounds. That means λ's gradient contribution comes almost entirely from activations sitting near the clip boundary - a much smaller fraction of the total activation distribution than the activations that contribute gradient to the layer's weights through the main matrix multiply. Fewer activations contributing gradient means a smaller aggregate gradient magnitude for λ, by construction of the function, not as an implementation accident.
The fix: separate parameter groups, separate learning rate
The standard fix, consistent with how the original QCFS paper and most working reimplementations handle this, is to never let λ share an optimizer learning rate with the convolutional/linear weights:
lambda_params = [p for n, p in model.named_parameters() if 'lambda' in n]
weight_params = [p for n, p in model.named_parameters() if 'lambda' not in n]
optimizer = torch.optim.SGD([
{'params': weight_params, 'lr': 0.01},
{'params': lambda_params, 'lr': 0.5}, # 10-50x higher
])
The exact multiplier depends on the network depth and the specific activation statistics of the pretrained ANN, but 10-50x the base learning rate is a reasonable starting search range. The right value is the one where gradient-norm logging shows λ actually moving by a meaningful fraction of its initial value within the first epoch, not the one that "feels aggressive."
Two more contributing causes worth ruling out
Initialization far from the right scale
If λ is initialized to a constant like 1.0 for every layer regardless of that layer's actual activation range, a small gradient has to travel a long distance to reach the right value. Initializing λ per-layer from the ANN's actual activation statistics (e.g. the 99.7th percentile of that layer's pre-activation outputs on a calibration batch) gives the gradient a much shorter distance to travel, which compounds with the learning-rate fix rather than replacing it.
Early layers calibrating before later layers have stabilized
In a deep network, layer 1's optimal λ depends on what layer 2 onward does with its output, and vice versa. If you don't see λ moving specifically in layers 1-2 while later layers calibrate normally, check whether your calibration schedule trains all layers jointly from step one, versus a staged or layer-wise warmup - joint training from a poor initialization can leave early layers stuck in a local optimum that a staged calibration avoids.
How to confirm you actually have this bug
Add a hook or manually print param.grad.norm() for lambda parameters vs. weight parameters during the first 5-10 steps. If lambda's norm is consistently 10x+ smaller, this is your root cause.
A healthy calibration shows each layer's lambda moving away from its initial value within the first epoch and converging to a stable, layer-specific value. A flat line at the initial value across all epochs confirms the bug.
If lambda starts moving and per-layer accuracy improves, you've confirmed the diagnosis. If lambda moves but accuracy doesn't improve, look at the surrogate gradient approximation through the floor operation instead.
If every layer initializes lambda to the same constant despite very different activation ranges across layers, fix initialization alongside the learning rate, not instead of it.
What a properly calibrating QCFS run looks like
| Layer | Lambda at init | Lambda after calibration | Moved |
|---|---|---|---|
| conv1 | 1.00 | 3.42 | Yes |
| conv2 | 1.00 | 2.18 | Yes |
| conv3 (frozen-bug case) | 1.00 | 1.003 | No - bug present |
Lambda values that converge to layer-specific numbers reflecting that layer's actual activation scale are the signature of correct calibration. A value that ends within rounding distance of its initialization, especially in early layers, is the signature of this bug - not evidence that learnable QCFS thresholds don't help.
Sources & further reading
- Bu et al., "Optimal ANN-SNN Conversion for High-accuracy and Ultra-low-latency Spiking Neural Networks" (QCFS), ICLR 2023
- NeuroCUDA conversion pipeline implementation notes, github.com/Krishnav1/neurocuda
Frequently asked questions
Why is my QCFS lambda threshold stuck at 1.0?
The most common cause is that the lambda (clipping threshold) parameter is being trained with the same learning rate as the rest of the network's weights, but its gradient magnitude is typically one to two orders of magnitude smaller because it only receives gradient signal through the floor/clipping operation's boundary cases. With a shared learning rate tuned for weight updates, lambda's effective update step is too small to move it away from its initialization within a normal number of epochs.
What is QCFS in spiking neural networks?
QCFS (Quantization Clip-Floor-Shift) is a calibration technique for converting a trained ANN into a spiking neural network. It replaces ReLU activations with a quantized clip-floor-shift function controlled by a learnable threshold parameter (commonly called lambda), which determines the firing threshold each converted spiking neuron should use to best approximate its original ReLU's output range.
How do you fix gradient flow problems in QCFS calibration?
Group lambda parameters separately from weight parameters in the optimizer and assign them a distinct, typically higher, learning rate. Initialize lambda from the ANN's actual per-layer activation statistics rather than a constant default, and verify with gradient-norm logging that lambda is receiving a non-trivial gradient signal during the first several calibration steps before training proceeds further.