MLow comfort noise¶
Encodings - mlow-noise
ENC-12 - status: review - audio
Per-subframe shaped comfort noise the MLow decoder adds into the LPC excitation before CELP synthesis, with separate voiced/unvoiced branches.
Run the comfort-noise generator once per subframe of length L (L <= 160,
fs = 16 kHz) and add its output into the LPC excitation before CELP synthesis
(see mlow-synthesis). Output MUST be bit-identical for identical
inputs and state. All arithmetic is single-precision; accumulation order is fixed
(4-wide envelope and pulse loops). pi is the literal 3.1415926535897. sigmoid
clamps its argument to [-80, 80].
Persistent state. Keep across subframes, zero-initialised at decoder start:
env_smth : f32 ; running envelope smoother
env_last : f32 ; last subframe envelope value
out_state_uv[2] : f32 ; ARMA1 output filter state (unvoiced)
out_state_v[2] : f32 ; MA2 output filter state (voiced)
corr_smth[3] : f32 ; smoothed autocorrelation (voiced)
shape_state[2] : f32 ; MA2 shaping-filter state (voiced)
prev_voiced : bool ; voicing of the previous subframe
since_unvoiced : i32 ; subframes since the last unvoiced subframe
rand_seed : i32 ; PRNG seed
PRNG. LCG, wrapping i32 arithmetic:
seed' = 907633515 + (u32)seed * 196314165
Pulses are drawn four at a time: each new seed yields four samples by
reinterpreting the seed left-shifted by 0, 8, 16, 24 bits as a signed i32 scaled by
8.1e-10. A trailing tail of fewer than four samples MUST draw one fresh seed per
sample (no bit-rotation). The shared seed MUST advance for every pulse buffer drawn
in a subframe (call count and ordering is observable in the output seed).
Voiced branch (voiced = true):
- Compute 3-tap autocorrelation
corrs[0..2]ofexc_lpc; biascorrs[0] += 1e-12. - Smooth into
corr_smthwith coef0.4ifL == 160else0.16:corr_smth[i] += coef * (corrs[i] - corr_smth[i]). c = corr_smth * (0.35^2 * corrs[0] / corr_smth[0]); then doublec[1]andc[2].- Map
cthrough the noise DCT (3 x 16cosine matrix, scale1/sqrt(16),omegastep(0.5 + i) * pi / 16) to a 16-bin spectrumf2; setf2_tgt = max(f2) * 1.5 - f2; map back via DCT transpose toctgt. - Draw white pulses; reset
env_smthtoenv_lastif previous subframe was unvoiced; apply squared-signal envelope (smoothing coef0.95); normalisectgtby1/(noise energy + 1e-12). - Spectrally factor
ctgtinto a 3-tap MA filter; filter the noise through it (shape_state). - On a voiced subframe immediately following an unvoiced one, seed an unvoiced
cross-fade noise with a decaying
0.99envelope; else zero it whensince_unvoiced < 2.
Unvoiced branch (voiced = false). Clear corr_smth and shape_state, zero the
voiced noise, then:
- If
num_pulses > 0:nrg_ratio = energy(exc_lpc) / (nrgres + 1e-20);hardness = 10 + 20 * normalized_bitrate;nrg_tgt = nrgres * ln(exp(hardness * (1 - nrg_ratio)) + 1) / hardness; take the excitation envelope with smoothing coef0.995. - Else:
nrg_ratio = 0,nrg_tgt = nrgres; take the decaying no-excitation envelopeenv0(coef0.995).
Solve for affine envelope gain (f, g) so generated noise hits nrg_tgt while
matching env_last at the subframe boundary (quadratic with f = 0 / g = 0
degenerate branches per the energy-matching solver). Draw white pulses; scale by
f + g * env[i]. When num_pulses > 0 the scaled value MUST be clamped to
fcbgains_uv[fcbg_idx] * 0.5 and applied only where exc_lpc[i] == 0 (pulse
positions left at zero); env_last MUST be updated to the clamped final gain.
Unvoiced gain table: fcbgains_uv[ix] = 10^(0.05 * (ix - 90)), ix in 0..=90.
Output mixing.
- If
prev_voiced || voiced: pass voiced noise through fixed MA2 filter[0.25, -0.496, 0.25](out_state_v) into the output; else zero the output. - If
since_unvoiced < 2 || !voiced: high-pass shape unvoiced noise viaadd_noise_uv(ARMA1, corner freq fromlsf[0],lsf[1],nrg_ratio, base corner800 Hz, clamped to1500 Hz, gain0.8) and add into output; else resetout_state_uvto[0, 0].
Finally set prev_voiced = voiced; increment since_unvoiced on a voiced subframe,
reset it to 0 on an unvoiced subframe.
Inputs. Decode residual energy from the per-subframe Q14 value:
nrgres = max(0, 10^(0.1 * q14 / 2^14) - 3.1622776e-9) * fcb_subfrlen.
normalized_bitrate = sigmoid(1.4 * log2(pulses_per_20ms + 1) - 6.5) with
pulses_per_20ms = num_pulses * frame_length_16 / 320.
Parent: mlow
Requires: mlow, mlow-frame, mlow-excitation, mlow-lsf-lpc, mlow-synthesis, mlow-vad
Breakdown: mlow-decoder, mlow-synthesis
Implemented by
| Flavor | Status | Source | Notes |
|---|---|---|---|
whatsapp-rust |
working | history - blame - commits 674e851 |
— |
meowcaller |
partial | history - blame - commits 8cb06a9 42ea850 |
encodings DSP modules are KAT-verified piecewise; full decoder orchestration in progress |
Annotation wacrg:ENC-12 — a flavor marks its implementation site in source with this comment; a script clones the source, finds it, and attaches the commit blame/permalink.
Contributors
| Contributor | Role |
|---|---|
| wrote initial spec |
protocol history / diff - blame
Open questions - Exact bit layout and subframe schedule by which num_pulses, nrgres (Q14), fcbg_idx and the LSFs are delivered from the frame to this stage is specified in mlow-frame / mlow-excitation, not here.
References - RFC 6716 — Opus (CELT range coder reused by MLow)
Changelog¶
- 2026-06-21 — Initial spec entry.