Skip to content

MLow comfort noise

Encodings - mlow-noise

ENC-12 - status: review - audio

Per-subframe shaped comfort noise the MLow decoder adds into the LPC excitation before CELP synthesis, with separate voiced/unvoiced branches.

Run the comfort-noise generator once per subframe of length L (L <= 160, fs = 16 kHz) and add its output into the LPC excitation before CELP synthesis (see mlow-synthesis). Output MUST be bit-identical for identical inputs and state. All arithmetic is single-precision; accumulation order is fixed (4-wide envelope and pulse loops). pi is the literal 3.1415926535897. sigmoid clamps its argument to [-80, 80].

Persistent state. Keep across subframes, zero-initialised at decoder start:

env_smth        : f32        ; running envelope smoother
env_last        : f32        ; last subframe envelope value
out_state_uv[2] : f32        ; ARMA1 output filter state (unvoiced)
out_state_v[2]  : f32        ; MA2 output filter state (voiced)
corr_smth[3]    : f32        ; smoothed autocorrelation (voiced)
shape_state[2]  : f32        ; MA2 shaping-filter state (voiced)
prev_voiced     : bool       ; voicing of the previous subframe
since_unvoiced  : i32        ; subframes since the last unvoiced subframe
rand_seed       : i32        ; PRNG seed

PRNG. LCG, wrapping i32 arithmetic:

seed' = 907633515 + (u32)seed * 196314165

Pulses are drawn four at a time: each new seed yields four samples by reinterpreting the seed left-shifted by 0, 8, 16, 24 bits as a signed i32 scaled by 8.1e-10. A trailing tail of fewer than four samples MUST draw one fresh seed per sample (no bit-rotation). The shared seed MUST advance for every pulse buffer drawn in a subframe (call count and ordering is observable in the output seed).

Voiced branch (voiced = true):

  1. Compute 3-tap autocorrelation corrs[0..2] of exc_lpc; bias corrs[0] += 1e-12.
  2. Smooth into corr_smth with coef 0.4 if L == 160 else 0.16: corr_smth[i] += coef * (corrs[i] - corr_smth[i]).
  3. c = corr_smth * (0.35^2 * corrs[0] / corr_smth[0]); then double c[1] and c[2].
  4. Map c through the noise DCT (3 x 16 cosine matrix, scale 1/sqrt(16), omega step (0.5 + i) * pi / 16) to a 16-bin spectrum f2; set f2_tgt = max(f2) * 1.5 - f2; map back via DCT transpose to ctgt.
  5. Draw white pulses; reset env_smth to env_last if previous subframe was unvoiced; apply squared-signal envelope (smoothing coef 0.95); normalise ctgt by 1/(noise energy + 1e-12).
  6. Spectrally factor ctgt into a 3-tap MA filter; filter the noise through it (shape_state).
  7. On a voiced subframe immediately following an unvoiced one, seed an unvoiced cross-fade noise with a decaying 0.99 envelope; else zero it when since_unvoiced < 2.

Unvoiced branch (voiced = false). Clear corr_smth and shape_state, zero the voiced noise, then:

  • If num_pulses > 0: nrg_ratio = energy(exc_lpc) / (nrgres + 1e-20); hardness = 10 + 20 * normalized_bitrate; nrg_tgt = nrgres * ln(exp(hardness * (1 - nrg_ratio)) + 1) / hardness; take the excitation envelope with smoothing coef 0.995.
  • Else: nrg_ratio = 0, nrg_tgt = nrgres; take the decaying no-excitation envelope env0 (coef 0.995).

Solve for affine envelope gain (f, g) so generated noise hits nrg_tgt while matching env_last at the subframe boundary (quadratic with f = 0 / g = 0 degenerate branches per the energy-matching solver). Draw white pulses; scale by f + g * env[i]. When num_pulses > 0 the scaled value MUST be clamped to fcbgains_uv[fcbg_idx] * 0.5 and applied only where exc_lpc[i] == 0 (pulse positions left at zero); env_last MUST be updated to the clamped final gain. Unvoiced gain table: fcbgains_uv[ix] = 10^(0.05 * (ix - 90)), ix in 0..=90.

Output mixing.

  • If prev_voiced || voiced: pass voiced noise through fixed MA2 filter [0.25, -0.496, 0.25] (out_state_v) into the output; else zero the output.
  • If since_unvoiced < 2 || !voiced: high-pass shape unvoiced noise via add_noise_uv (ARMA1, corner freq from lsf[0], lsf[1], nrg_ratio, base corner 800 Hz, clamped to 1500 Hz, gain 0.8) and add into output; else reset out_state_uv to [0, 0].

Finally set prev_voiced = voiced; increment since_unvoiced on a voiced subframe, reset it to 0 on an unvoiced subframe.

Inputs. Decode residual energy from the per-subframe Q14 value: nrgres = max(0, 10^(0.1 * q14 / 2^14) - 3.1622776e-9) * fcb_subfrlen. normalized_bitrate = sigmoid(1.4 * log2(pulses_per_20ms + 1) - 6.5) with pulses_per_20ms = num_pulses * frame_length_16 / 320.

Parent: mlow
Requires: mlow, mlow-frame, mlow-excitation, mlow-lsf-lpc, mlow-synthesis, mlow-vad
Breakdown: mlow-decoder, mlow-synthesis

Implemented by

Flavor Status Source Notes
whatsapp-rust working history - blame - commits 674e851
meowcaller partial history - blame - commits 8cb06a9 42ea850 encodings DSP modules are KAT-verified piecewise; full decoder orchestration in progress

Annotation wacrg:ENC-12 — a flavor marks its implementation site in source with this comment; a script clones the source, finds it, and attaches the commit blame/permalink.

Contributors

Contributor Role
Rajeh Taher Rajeh Taher wrote initial spec

protocol history / diff - blame

Open questions - Exact bit layout and subframe schedule by which num_pulses, nrgres (Q14), fcbg_idx and the LSFs are delivered from the frame to this stage is specified in mlow-frame / mlow-excitation, not here.

References - RFC 6716 — Opus (CELT range coder reused by MLow)

Changelog

  • 2026-06-21 — Initial spec entry.

Back to the full spec