Skip to content

Media loop and session

Relay - media-loop

REL-07 - status: draft - audio, video

Per-call session state machine plus the outbound (protect) and inbound (unprotect) audio frame pipelines: RTP framing, E2E-SRTP, and WARP tagging.

Session state

A session MUST hold at minimum: call id, peer JID, call-creator JID, direction (outgoing | incoming), and current phase. Phase MUST be one of:

Idle | Calling | Ringing | Connecting | Active | Ended

Outgoing sessions MUST start in Idle; incoming sessions MUST start in Ringing. The idempotent self-transition x → x MUST be accepted. The only other legal transitions are:

Idle       → Calling      ; outgoing only
Calling    → Ringing
Ringing    → Connecting
Connecting → Active
<any phase except Ended> → Ended

All other transitions MUST be rejected as a no-op. Ended MUST be terminal. An incoming session MUST NOT transition to Calling. Media MAY flow only in Active.

Media pipeline keying

Derive two independent E2E-SRTP key sets from the call key (see srtp-master-key): send keys keyed by the sender's own participant id, recv keys keyed by the peer's participant id. The HKDF info for the send direction MUST be the sender's own participant id (the peer derives its recv keys from that id). Both JIDs MUST be normalized with the E2E-SRTP participant-id rule (see ssrc) before derivation. The two directions MUST NOT be inverted.

Outbound loop (protect)

For each audio frame to send, in order:

  1. Obtain the next RTP header from the send sequencer. RTP sequence number starts at 1 and increments by 1 per packet; RTP timestamp advances by samples_per_packet (960 for a 60 ms frame at 16 kHz); payload type MUST be 120 (Opus).
  2. Advance the rollover counter (ROC) for the new sequence number.
  3. Encode the RTP header bytes.
  4. E2E-SRTP encrypt the Opus payload under the send keys, using the header SSRC, sequence number and ROC as keystream inputs (see srtp-e2e).
  5. Concatenate rtp_header || encrypted_payload.
  6. Append the 4-byte WARP message-integrity tag, computed over that concatenation under the send auth key with the ROC (see warp).

Send the result as one binary message on the relay media channel (see stun-relay, call-transport).

Inbound loop (unprotect)

For each relay packet classified as RTP (see rtp-framing):

  1. Reject if shorter than 12 + 4 bytes (min RTP header + WARP MI tag).
  2. Strip the trailing 4-byte WARP MI tag.
  3. Parse the RTP header, compute its byte length; reject if no payload follows.
  4. E2E-SRTP decrypt the remaining payload under the recv keys, using the parsed SSRC, sequence number and ROC.

The decrypted result is the Opus payload for the decoder (see opus).

Codec parameters

Audio MUST be Opus, mono, 16 kHz, 60 ms frames (960 samples per packet), VoIP application mode. Reference encode: 25 kbps, complexity 9. Payload type 120 on send; 120 and 121 both recognized as Opus on the inbound path. Priming frames are fixed constant payloads and MUST bypass the encoder (see rtp-framing).

Notes. The inbound path strips but does NOT verify the WARP MI tag in the reference composition. The proprietary "mlow" encode (see mlow) is the on-wire WhatsApp codec; standard libopus at the reference parameters is accepted by the peer.

Requires: srtp-master-key, srtp-e2e, warp, rtp-framing, ssrc, opus, call-transport, stun-relay

Implemented by

Flavor Status Source Notes
whatsapp-rust working history - blame - commits 674e851 session state machine and protect/unprotect pipeline composition; live relay flow over the channel is deferred
zapo-caller working signalling + crypto + relay loop

Annotation wacrg:REL-07 — a flavor marks its implementation site in source with this comment; a script clones the source, finds it, and attaches the commit blame/permalink.

Contributors

Contributor Role
Rajeh Taher Rajeh Taher wrote initial spec

protocol history / diff - blame

Open questions - Full inbound ROC-recovery (rollover) algorithm for out-of-order and wrapped sequence numbers. - Whether the WARP MI tag is verified on receive in the production client, and the failure policy if it is. - Send-side pacing / DTX and comfort-noise behavior during the Active phase. - Exact retransmission/jitter-buffer handling on the receive path.

References - RFC 3711 — SRTP - RFC 3550 — RTP - RFC 6716 — Opus

Changelog

  • 2026-06-21 — Initial spec entry.

Back to the full spec