Skip to content

MLow post-filters

Encodings - mlow-postfilter

ENC-13 - status: draft - audio

The three deterministic decoder-side DSP post-filters MLow runs after range decoding: excitation harmonic comb, HP pitch comb, and harmonic post-filter.

Three enhancement filters run during decode. They consume no bits and have no wire format. A decoder MUST run them in the order and with the constants below. All math is single-precision (f32) with fast (non-strict-IEEE) arithmetic; a strict-IEEE decoder reproduces output to within the i16 quantization step (1/32768), not bit-for-bit.

1. Excitation harmonic comb post-filter

Applied per subframe to the low-band excitation BEFORE LPC synthesis; output is ADDED back into the excitation. Derives a 2nd-order pitch-resonant filter from the excitation's own 3-lag autocorrelation (NOT the pitch lag). Subframe length N is 80 or 160.

Active-subframe path:

  1. Compute 3-lag autocorrelation auto[0..2] of input, then auto[0] += 9.999999960041972e-13.
  2. Smooth into persistent smoothed_c[i] with coef = 0.4 when N == 160 else coef = 0.16: smoothed_c[i] = coef*(auto[i] - smoothed_c[i]) + smoothed_c[i].
  3. local5 = auto[0] * 0.1224999949336052 / smoothed_c[0]; scaled vector {local5*smoothed_c[0], 2*local5*smoothed_c[1], 2*local5*smoothed_c[2]} (lags 1,2 doubled).
  4. Project through fixed G_PITCH 3x16 basis: proj[j] = sum_r scaled[r]*G_PITCH[r][j], peak = max_j proj[j], scale = 1.5*peak, refl[i] = scale - proj[i], comb_c[r] = sum_i G_PITCH[r][i]*refl[i].
  5. Fill N samples of LCG noise (below), seed pitch_gain from env_state on first call, RMS-envelope-smooth input with coef 0.95, multiply noise by envelope.
  6. local5 /= (sum(noise^2) + 9.999999960041972e-13), then comb_c[i] *= local5.

Inactive-subframe path resets smoothed_c and LCG state and derives a single scalar gain from the band-energy ratio; it builds no comb coefficients.

Resonator: if comb_c[0] >= 0, add 1.0000000031710769e-30, run the 2-iteration Levinson-style solve (returns r5, r8, denom). On success g = sqrt(comb_c[0]/denom) and resonator FIR {g, r8*g, r5*g}; on failure {sqrt(comb_c[0]), 0, 0}; if comb_c[0] < 0 resonator is zeroed. Run the 3-tap resonator FIR over env-shaped noise, then static de-emphasis FIR {0.25, -0.49599999, 0.25} to produce the additive output. A trailing AR1/MA1 biquad (corner from band energy, sigmoid trailing-pole g5 = sigmoid(0.2*(nrgEnv[1]-nrgEnv[0]+1e-30) - 3)) is added UNLESS the subframe is active with call_count > 1.

LCG noise fill: s = 196314165*s + 907633515 (wrapping i32), output s * 8.100000115085493e-10, emitting byte-shifted views s<<8, s<<16, s<<24 in the 4-wide block; state persists across calls.

G_PITCH rows (16 columns each):

row0: 0.25 x16
row1: 0.24879618 0.23923509 0.22048031 0.19325261 0.15859832 0.11784916
      0.07257116 0.02450428 -0.02450431 -0.07257118 -0.11784921 -0.15859832
      -0.19325262 -0.22048034 -0.23923509 -0.24879618
row2: 0.24519631 0.20786740 0.13889255 0.04877256 -0.04877258 -0.13889259
      -0.20786741 -0.24519633 -0.24519631 -0.20786738 -0.13889250 -0.04877260
      0.04877260 0.13889261 0.20786740 0.24519633

2. HP (pitch) post-filter

Applied to one frame (FRAME_LEN = 320) of post-LPC-synthesis output. Chain:

de-emphasis (AR1 leaky integrator, coef {1, -0.995})
  -> ARMA2 comb (MA2 numerator, AR2 denominator)
  -> companion pre-emphasis (MA1 differentiator, coef {1, -0.995})

Comb keys on frame average pitch lag lag = sum(l^2)/sum(l) over subframe lags (0 -> unvoiced). The ARMA2 biquad is built by smpl_calc_hp_coefs with f = 1/lag (voiced) or f = 50/16000 (50 Hz corner) when lag <= 0:

cos_approx(x) = 1 - 0.5*x^2
coef_ma = { 1, -2*cos_approx(2*pi*maf*f), 1 }
far = arf[0]*f + arf[1]*f^2
rar = arr[0]*f + arr[1]*f^2
coef_ar = { 1, -2*cos_approx(2*pi*far)*(1+rar), 1 + (2*rar + rar^2) }
sc = (1 - coef_ar[1] + coef_ar[2]) / (1 - coef_ma[1] + coef_ma[2])
coef_ma *= sc                                   ; unity DC gain

AR denominator is a resonance at angle 2*pi*far, radius 1+rar (rar negative for a stable pole). Voiced curve: maf = 0.1, arf = {0.608057355, 0.070939485}, arr = {-2.187380512, 2.291030664}. Default curve: maf = 0.1, arf = {0.728508218, 0.476039848}, arr = {-4.363803713, 8.441854006}.

When lag > 1.25*lag_old or 1.25*lag < lag_old, the decoder MUST run OLD and NEW coefficients over the frame and overlap-add with the cos(omega)^2 down-ramp table (HP_POSTF_TRANSITION_SPEED = 2, d_omega = pi/(2*(FRAME_LEN+1)), omega by repeated addition) before the companion pre-emphasis. lag_old < 0 marks a fresh/reset filter.

3. Harmonic post-filter

Final per-packet stage; runs on the full low-band output after the HP filter. Mixes x[-lag] + x[+lag], low-pass filtered by a lag-dependent kernel, and introduces total group delay SMPL_TOT_POSTFILT_DELAY = 48 (8 feedback + 40 lag-subframe). Constants:

FRAME_LEN = 320            LAG_SUBFR_LEN = 40       FB_DELAY = 8
MIN_PITCH_LAG = 32         MAX_PITCH_LAG = 320      MAXPITCH_LEN = 320
FB_STRENGTH = 0.4734       STRENGTH = 0.6438        CUTOFF_HZ = 4000
NHARM_CUTOFF = 6.3         REDUCTION_FAC = 0.0579   LP_FILT_RES = 2500
PITCH_NUM_SUBFRAMES = 8

Operates per 40-sample lag block. Packet is appended to a persistent StateComb buffer at offset MAX_PITCH_LAG + HARM_DELAY; reads index back into history. Per-packet feedback strength fb_strength = 1 - FB_STRENGTH*normalized_bitrate. For a block with lag > 0:

y_harm[i] = comb[x+i-lag] + comb[x+i+lag]        ; (lookforward-clamped at packet edge)
xy = dot(comb[x..], y_harm)
if xy > 0:
  xx = nrg(comb[x..], L);  yy = 0.25*nrg(y_harm, L)
  strength = 0.5*xy / max(yy, xx)
  high_lag_reduction = 1 - REDUCTION_FAC*((lag-MIN_PITCH_LAG)/(MAX_PITCH_LAG-MIN_PITCH_LAG))
  strength *= high_lag_reduction * STRENGTH
  y_harm *= 0.5*strength
  diff = y_harm - strength*comb[x..]
  lpcoefs = lp_filter(lag) * fb_strength            ; 17-tap symmetric kernel
  y_harm = MA17(diff) + comb[x - FB_DELAY ..]       ; 48-delayed base, recursive

When xy <= 0 (or lag <= 0) the block copies the 48-delayed base; if the previous block filtered, the first 2*FB_DELAY samples carry the previous kernel's zero-input response. Per-bucket LP kernel is built by create_lp_filter from a cosine window filt_win[i] = cos(omega)/(i+1) (omega stepping 0.5*pi/(FB_DELAY+1)), cutoff omega_c = min(omega0*NHARM_CUTOFF, CUTOFF_HZ/16000*pi) with omega0 = 2*pi/lag, scaled to unity sum; bucket index LP_FILT_RES/max(lag+30,80) - LP_FILT_RES/MAX_PITCH_LAG. After processing, StateComb is shifted left by the packet length. Block lag for iteration k is round(lags[k]); prev_lag carries across packets.

Notes. Recursive accumulation through the near-unit-circle pitch poles can drift up to ~1.5e-5 from strict IEEE (below the i16 LSB). The one larger residual is the first 48 samples of a silence packet following a voiced packet (the comb's zero-input response under the prior frame's coefficients).

Parent: mlow
Requires: mlow, mlow-excitation, mlow-synthesis, mlow-frame
Breakdown: mlow-decoder, mlow-synthesis

Implemented by

Flavor Status Source Notes
whatsapp-rust working history - blame - commits 674e851 all three filters ported and validated against the live WASM decoder and the C decoder dumps
meowcaller partial history - blame - commits 4323881 5b8d5a5 encodings codec modules are partial

Annotation wacrg:ENC-13 — a flavor marks its implementation site in source with this comment; a script clones the source, finds it, and attaches the commit blame/permalink.

Contributors

Contributor Role
Rajeh Taher Rajeh Taher wrote initial spec

protocol history / diff - blame

Open questions - The excitation comb carries a single unresolved 8/7 output scalar; its exact value is not yet pinned down. - Whether the excitation-comb subframe length selection (N in {80,160}) is fully determined by frame size or also depends on bandwidth mode.

References - MLow: WhatsApp's low-bitrate speech codec (engineering blog)

Changelog

  • 2026-06-21 — Initial spec entry.

Back to the full spec