Architecture¶
A WhatsApp 1:1 call runs over two distinct planes that are established and operated independently:
- The signaling plane: call control (offer, ring, accept, reject,
terminate, transport updates, mute/video state). This rides the same
Noise-protocol-encrypted WebSocket that WhatsApp multi-device uses for
messaging, carried as WABinary nodes under a top-level
<call>tag. - The media plane: the actual audio/video bytes. This is SRTP over UDP to WhatsApp voip/relay servers, negotiated using transport endpoints that the signaling plane exchanges. It does not travel over the WebSocket.
Keying bridges the two: the media key is delivered on the signaling plane, encrypted to the peer with the existing Signal protocol session, and the SRTP keys for the media plane are derived from it.
Confidence note: the two-plane split and the Noise-WebSocket signaling transport are well-supported by how multi-device messaging works and are treated as
probable. The precise SRTP key-derivation steps arespeculativeand flagged as such throughout.
The two planes at a glance¶
| Signaling plane | Media plane | |
|---|---|---|
| Carries | Call control stanzas | Audio/video frames |
| Transport | Noise-encrypted WebSocket (TCP) | SRTP over UDP |
| Encoding | WABinary <call> nodes |
RTP/SRTP packets |
| Encryption | Noise (link) + Signal (per-stanza <enc>) |
SRTP (keys derived from the Signal-delivered media key) |
| Routing | Through the WA server (store-and-forward, multi-device fan-out) | Direct or via WA relays (ICE/TURN-like) |
| Documented in | Signaling, Transport over Noise, Keying | Media / SRTP, ICE & relays |
Where keying sits¶
Keying lives on the signaling plane but logically between the two. When a
device sends an offer, it includes one or more <enc> nodes whose payloads are
Signal-protocol ciphertext. Each <enc> delivers the call/media key to one peer
device (multi-device means several <enc> nodes per offer; see
fan-out). Because the key is wrapped
in the recipient's Signal session, the WA server routes it but cannot read it.
Both endpoints then derive SRTP keys from that shared secret, and the media plane
comes up encrypted end-to-end.
Call lifecycle¶
The diagram below sketches a typical outgoing 1:1 audio call. It is a model that orients new contributors; individual stanzas are documented (with confidence levels) in the generated stanza catalog, and full sequences live in the flow catalog.
sequenceDiagram
autonumber
participant Caller as Caller device
participant Server as WhatsApp server
participant Callee as Callee device
participant Media as Media plane (SRTP/UDP)
Note over Caller,Callee: Signaling plane, Noise-encrypted WebSocket (WABinary)
Caller->>Server: <call><offer> (audio desc, net, capability,<br/>encopt, <enc> media key per device, destination)
Server->>Callee: routed <call><offer>
Server-->>Caller: <ack>
Callee->>Server: <call><preaccept> (ringing)
Server->>Caller: routed <preaccept>
Callee->>Server: <call><accept>
Server->>Caller: routed <accept>
Server-->>Callee: <ack>
Note over Caller,Callee: Both derive SRTP keys from the Signal-wrapped media key
Caller->>Media: SRTP audio (UDP, via relays)
Callee->>Media: SRTP audio (UDP, via relays)
Note over Caller,Callee: Mid-call: <transport> / <relaylatency> /<br/>mute / video-toggle updates on the signaling plane
Caller->>Server: <call><terminate reason="...">
Server->>Callee: routed <terminate>
Server-->>Caller: <ack>
Reading the diagram¶
- Steps 1–3: the offer is server-routed; the server
<ack>s the sender. - Steps 4–7: ringing (
<preaccept>) and acceptance (<accept>) flow back through the server. - After acceptance both sides bring up SRTP media over UDP, typically through WA relays selected from the offered transport endpoints. See ICE & relays.
- During the call, transport and state updates continue on the signaling plane.
- The call ends with a
<terminate>carrying areason(e.g.declined,timeout,busy,connection_lost).
A call that is never answered ends with <terminate reason="timeout">; a
declined call with <reject> followed by <terminate reason="declined">. These
reason codes are catalogued in the generated
enums page.
What is firmer vs. softer¶
- Firmer (
probable): the existence of the<call>node family; signaling over the Noise WebSocket; media over SRTP/UDP; the media key being delivered via<enc>Signal ciphertext; multi-device producing multiple<enc>nodes. - Softer (
speculative): exact attribute meanings (e.g.net medium,encopt keygen), the precise SRTP key schedule, relay candidate selection logic, and codec negotiation details.
Every claim above is mirrored in the corpus with an explicit confidence and,
where we are unsure, an open_questions entry. See
methodology for how those levels move.