Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/images/discord-voice.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/contributing/reverse-engineering/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ There are various other resources that have been incredibly helpful in our endev
- [DiscordLists](https://github.com/Delitefully/DiscordLists)
- [Discord Datamining](https://github.com/Discord-Datamining/Discord-Datamining)
- [Discord.js](https://discord.js.org) and other bot implementations
- [Discord blog about voice](https://discord.com/blog/how-discord-handles-two-and-half-million-concurrent-voice-users-using-webrtc)
26 changes: 26 additions & 0 deletions docs/contributing/reverse-engineering/voice/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Voice

## Types of voice connections

Discord supports two types of voice connections:

- UDP voice connections - used by the desktop client and bots
- Connection initiation: uses raw IP and port to connect over UDP. Does not need ICE, so it is better for clients behind NAT.
- Transport Encryption: uses AEAD AES256-GCM or AEAD XChaCha20 Poly1305
- E2E Encryption: uses DAVE
- Well documented in [Discord developer docs](https://docs.discord.com/developers/topics/voice-connections)
- WebRTC voice connections - used by the browser client
- Connection initiation: uses SDP to negotiate video/audio and connection information. Uses ICE to find connection candidates.
- Transport Encryption: uses the Secure Real-time Transport Protocol (SRTP)
- E2E Encryption: uses DAVE
- Documented in our [WebRTC docs](/contributing/reverse-engineering/voice/webrtc)

A voice connection is composed of two parts: a WebSocket connection to the Voice Gateway, and a UDP/WebRTC connection to the SFU. Since clients connecting from a browser and those connecting from the Desktop client have to be able to communicate with one another in a voice channel, the Voice Gateway and SFU have to accept both types of connections. The client tells the Voice Gateway which type of voice connection it will be using, and the Voice Gateway is able to negotiate the selected type of connection between the SFU and the client.

## Architecture

Discord is able to support many voice connections by distributing them across regions. When a user signals that they are joining a voice channel, the Discord Gateway sends them the endpoint of the Voice Gateway they should connect to, optimizing for one in the geographical region of the user if their voice channel hasn't started yet, or to the region already hosting the voice channel if it has already been initiated.

Each voice server is composed of a WebSocket Voice Gateway and an SFU. The Voice Gateway is used for signaling between the SFU and the user's client, and controls the SFU. The SFU just routes the media packets to the other users in the voice channel, regardless if they are a WebRTC or UDP connection.

<img src="/assets/images/discord-voice.png">
35 changes: 35 additions & 0 deletions docs/contributing/reverse-engineering/voice/sfu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Selective Forwarding Unit (SFU) Server

## WebRTC implementation details

The media SFU must be an ICE-lite server, which means that the client is the *controlling agent* which initiates the ICE connection to the SFU, which is in *controlled mode*. This can be denoted to the client by giving it a `remoteDescription` with the line `setup:passive` and optionally `a=ice-lite` in the SDP. ICE-lite servers make ICE connection simpler and more flexible for clients behind NAT, since the SFU's sole connection candidate will always be a public IP with an open port. Once the client begins connecting over ICE, the client will send a STUN Binding request to the SFU, which is used for the connectivity check. The SFU will reply with the STUN Binding response which will contain the client's translated transport address on the public side of any NATs between the client and its peer on the SFU, making it easier for clients behind NAT to connect.

The SFU and client will only exchange SDP once, at the beginning of the connection. A complete SDP will not be sent over the Voice Gateway, instead only the necessary connection details are exchanged. The client first sends an offer containing the client's fingerprint, ICE username + password, and separate from the SDP it will also include a list of supported codecs and their payload ids. The SFU server then replies with the answer, which will include the SFU host and port, fingerprint, and ICE username + password. Both server and client have to reconstruct the original offer/answer SDP from this fragment before setting it to their peer's `remoteDescription`. After connecting, the Discord client will use this single peerConnection for both receiving and sending streams.

In summary, in order for the SFU server to be completely compatible with a Discord client, the server:

- Must support using a single peer connection for sending and receiving media.
- Must support server-side ICE-Lite, with the Discord client being the controlling agent in the ICE connection.
- Must support the client deciding the codec payload ids (dynamic payload ids).

## UDP implementation details

The UDP protocol is a stripped-down version of the WebRTC protocol, with no ICE, no DTLS, and with its own transport encryption to replace SRTP. The SFU's UDP endpoint is sent to the client in the **Opcode 2 Ready** message, then the UDP client sends a pseudo-STUN Binding Request to this endpoint, which is just a packet with the following format:
```
---------------------------------------------------------------------------
| 2 byte (0x1) | 2 byte msg length (70) | 4 byte audio SSRC | 66 byte padding |
---------------------------------------------------------------------------
```

The server then replies with a pseudo-STUN Binding Response, which contains the public IP address of the UDP client, and its port which was used in the UDP connection that sent the original request:
```
------------------------------------------------------------------------------------------------------------------------------------------
| 2 byte (0x2) | 2 byte msg length (70) | 4 byte audio SSRC | 64 byte Null-terminated string containing IP address | 2 byte containing port |
-------------------------------------------------------------------------------------------------------------------------------------------
```

The client then sends this IP and port value to the Voice Gateway in the **Opcode 1 Select Protocol** message, which also specifies the transport encryption used, and the connecting protocol to be `udp`.

The UDP protocol still uses regular RTP packets for sending and receiving media, but with its own custom transport encryption. When the UDP client sends media to the SFU, the RTP packets would then just be decrypted then forwarded to the other UDP/WebRTC clients, and similarly, the RTP packets coming from other WebRTC/UDP clients would be encrypted and forwarded to our UDP client.

Luckily the UDP protocol is heavily documented in the [Discord Developer Docs](https://docs.discord.com/developers/topics/voice-connections#transport-encryption-and-sending-voice)
211 changes: 211 additions & 0 deletions docs/contributing/reverse-engineering/voice/webrtc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
# WebRTC Signaling

Since WebRTC is a well-defined open standard and the browser-built in APIs are heavily documented in many online sources, we are not going to focus on the WebRTC browser API and how to use it to connect. Rather, this document will focus on the Discord-specific signaling portion of the connection establishment. For details on using the browser's WebRTC API, you can reference [Mozilla Developer Docs](https://developer.mozilla.org/en-US/docs/Web/API/RTCPeerConnection)

## Connecting to voice

The signaling for connecting to voice over WebRTC is very similar to the one used by UDP clients. There's only a few differences, notably the fact that our connection information is exchanged in SDP format.

### 1. Getting our Voice Gateway endpoint and our token:

First send an **Opcode 4 Gateway Voice State Update** to Gateway:

Example:
```json
{
"op": 4,
"d": {
"guild_id": "41771983423143937",
"channel_id": "127121515262115840",
"self_mute": false,
"self_deaf": false
}
}
```

Gateway will respond with 2 events: **Voice State Update** event and a **Voice Server Update** event

**Voice Server Update** will contain information about the Voice Gateway endpoint that we will connect to, as well as our token for authenticating to it.

```json
{
"t": "VOICE_SERVER_UPDATE",
"s": 2,
"op": 0,
"d": {
"token": "my_token",
"guild_id": "41771983423143937",
"endpoint": "sweetwater-12345.discord.media:2048"
}
}
```
### 2. Connecting to Voice Gateway

Now that we have the Voice Gateway endpoint and token, we can connect to it. Start a new WebSocket connection to `wss://endpoint_obtained_from_gateway`

Once connected, send an **Opcode 0 Identify** payload with our server_id, user_id, session_id, and token

```json
{
"op": 0,
"d": {
"server_id": "41771983423143937",
"user_id": "104694319306248192",
"session_id": "my_session_id",
"token": "my_token",
"video": true,
"streams": [
{
"type": "video",
"rid": "100",
"quality": 100
}
],
"max_dave_protocol_version": 1
}
}
```

The Voice Gateway will respond with an **Opcode 2 Ready**

```json
{
"op": 2,
"d": {
"ssrc": 1,
"ip": "127.0.0.1",
"port": 1234,
"modes": ["xsalsa20_poly1305", "xsalsa20_poly1305_suffix", "xsalsa20_poly1305_lite"],
"heartbeat_interval": 1,
"streams": [
{
"type": "video",
"rid": "100",
"quality": 100,
"ssrc": 2,
"rtx_ssrc": 3,
}
],
}
}
```

The IP, Port, and SSRC values would be used by UDP clients for connecting and sending media packets, but since we will be using WebRTC to connect, we can disregard everything in this message. Our browser will generate its own SSRC values and the connection information will be exchanged over SDP

### 3. Start SDP negotiation

To start the SDP negotiation, we send **Opcode 1 Select Protocol** that includes our offer SDP

***>>Important: our offer sdp is truncated before being sent to the server. You can send a complete sdp offer, it does not really matter because the client and server will use SDP munging to negotiate any streams. It is important that the codecs array has the correct payload types, as this will be used for the SDP munging by the server***

```json
{
"op": 1,
"d": {
"protocol": "webrtc",
"data": "our sdp offer here",
"sdp": "our sdp offer here",
"codecs": [
{
"name": "opus",
"type": "audio",
"priority": 1000,
"payload_type": 111,
},
{
"name": "H264",
"type": "video",
"priority": 1000,
"payload_type": 103,
"rtx_payload_type": 104
}
],
"rtc_connection_id": "" // uuid
}
}
```

Voice Gateway will respond with **Opcode 4 Session Description**

```json
{
"op": 4,
"d": {
"video_codec": "H264",
"sdp": "sdp answer here",
"media_session_id": "",
"audio_codec": "opus",
},
}
```

The SDP answer will not be a full SDP, instead it is a truncated sdp that contains the necessary information for connection, such as the SFU server host and port, fingerprint, and ICE username + password. Using this basic information we should be able to start our WebRTC connection, but first we have to construct a complete SDP answer using this information so that our client peer accepts it as a valid sdp.

When re-constructing the SDP answer, remember to add `setup=passive` to the answer SDP, since that will tell the client to act in *controlling mode*, initiating the ICE connection to the SFU.

## Signaling track events

After we have successfully connected to the SFU over WebRTC, we are ready to begin sending and receiving media. Since the SFU and WebRTC client will only exchange SDP once (at the start of the connection), both the SFU and client will do have to do some clever tricking to negotiate the changing of the incoming/outgoing streams. The client will have to use something called **SDP munging**, where it alters the SDP manually, even generating its own SDP answer once it receives **Op Code 12 Video**

### Server->Client Op Code 12 Video

An incoming **Op Code 12 Video** event signals that a user in the voice channel has changed their outgoing tracks. It includes the `user_id` so that you can map each track to a user, based on the SSRC.

```json
{
"audio_ssrc": 1,
"video_ssrc": 2,
"rtx_ssrc": 3,
"user_id": "29229393982",
"streams": [
{
"type": "video",
"rid": "100",
"ssrc": 2,
"active": false,
"quality": 100,
"rtx_ssrc": 3,
"max_bitrate": 2500,
"max_framerate": 30,
"max_resolution": {
"type": "fixed", "width": 1080, "height": 720
}
}
]
}
```

A value > 0 for the SSRC indicates that the user is publishing that track, while a value of 0 for the track indicates the user is not currently publishing that track. If you receive a positive value SSRC for either audio, expect the PeerConnection onTrack event to be fired and to receive a track matching that SSRC. For video, the video_ssrc has to be > 0 AND you have to have at least 1 stream with `active=true`.

### Client->Server Op Code 12 Video

An outgoing **Op Code 12 Video** event signals that your client wants to change outgoing tracks. It is similar to the incoming event payload, but it omits the `user_id`, since the server already knows this event is coming from you.

```json
{
"audio_ssrc": 1,
"video_ssrc": 2,
"rtx_ssrc": 3,
"streams": [
{
"type": "video",
"rid": "100",
"ssrc": 2,
"active": false,
"quality": 100,
"rtx_ssrc": 3,
"max_bitrate": 2500,
"max_framerate": 30,
"max_resolution": {
"type": "fixed", "width": 1080, "height": 720
}
}
]
}
```

Similarily, a value > 0 for the SSRC indicates that the user is publishing that track, while a value of 0 for the track indicates the user is not currently publishing that track. If you send a positive value for SSRC for audio, the server will be expecting for you to start sending a track matching that SSRC. For video, the video_ssrc has to be > 0 AND you have to have at least 1 stream with `active=true`.

## Other Voice Gateway events

The remaining Voice Gateway events will be exactly the same for WebRTC clients and UDP clients, so you can just reference the [Discord developer docs](https://docs.discord.com/developers/topics/voice-connections). These include Speaking events, Heartbeating, Buffered Resume, and the E2EE DAVE protocol-related message events.