RTMP vs SRT vs WebRTC: A Decision Framework for Real-Time Video Architecture

RTMP vs SRT vs WebRTC: A Decision Framework for Real-Time Video Architecture

Before your next vendor discussion, here’s the mental model you actually need.

RTMP vs SRT vs WebRTC — If you’ve recently found yourself in a conversation about low-latency video infrastructure, you’ve almost certainly encountered all three: RTMP, SRT, and WebRTC. Each has passionate advocates. Each has legitimate use cases. And each is routinely misapplied — often by the very vendors pitching them.

This piece won’t give you another latency comparison table. What it will give you is a decision framework: a way of thinking about your pipeline in legs, asking the right questions at each leg, and arriving at an architecture that uses the right protocol for the right job — even if that means using all three simultaneously.

The Core Mental Model: Think in Pipeline Legs, Not Protocols

The most common mistake in streaming architecture decisions is choosing a single protocol and applying it everywhere. Real-world systems almost never work this way.

A live video pipeline has at minimum three distinct legs:

[Source / Performer]
        ↓
[Ingest & Processing]
        ↓
[Distribution & Viewer]

Each leg has different latency requirements, scale characteristics, network conditions, and endpoint types. A protocol that’s optimal for the ingest leg may be catastrophic for viewer distribution. The right question is never “Which protocol should we use?” — it’s “Which protocol is right for which leg of this pipeline?”

With that model in mind, let’s look at each protocol with the precision a real architectural decision requires.

RTMP: The Universal Workhorse (That’s Showing Its Age)

What it is: Real-Time Messaging Protocol was developed by Macromedia in the early 2000s for Flash-based streaming. Adobe later opened the specification. Despite Flash being long dead, RTMP remains the dominant ingest protocol for live streaming infrastructure globally.

Transport: TCP. This is the defining characteristic that explains most of RTMP’s tradeoffs.

Typical latency: 3–8 seconds, and it accumulates. Unlike UDP-based protocols, TCP’s retransmission behavior means that packet loss on a congested network causes head-of-line blocking — the entire stream waits for the dropped packet to be retransmitted before continuing. Over a multi-hour stream, this can compound into significant drift.

Where RTMP genuinely excels:

  • Compatibility. OBS, Wirecast, hardware encoders, and virtually every streaming platform (YouTube Live, Twitch, Facebook Live) speak RTMP for ingest. When you need to accept streams from diverse, unpredictable sources, RTMP is the lowest-friction path.
  • CDN and media server support. Wowza, Nginx-RTMP, AWS Elemental, and most CDNs have decades of hardened RTMP support. If your distribution infrastructure is already built on RTMP, staying on it for ingest reduces architectural surface area.
  • Simplicity at the encoder side. For a broadcaster using OBS, setting up an RTMP stream is a single URL and stream key. Nothing else to configure.

Where RTMP fails:

  • Any use case where latency below 3 seconds matters.
  • Environments with unreliable network conditions (the TCP retransmission behavior makes it worse on lossy networks, not better).
  • Browser-native ingest or delivery — RTMP requires a plugin or native application. There is no browser RTMP.

The verdict: Among RTMP vs SRT vs WebRTC, RTMP is the right choice when compatibility and ecosystem coverage outweigh latency requirements. It is not suitable as the primary transport for any real-time interactive application.

SRT: The Broadcast Engineer’s Answer to an Unreliable World

What it is: Secure Reliable Transport was developed by Haivision and open-sourced in 2017. It was designed specifically to solve one problem: how do you get broadcast-quality, low-latency video across the public internet reliably?

Transport: UDP, with a reliability layer (ARQ — Automatic Repeat reQuest) built on top, plus built-in AES-128/256 encryption.

Typical latency: 0.5–2 seconds. SRT uses a configurable latency buffer; you trade latency for reliability by adjusting this buffer based on your expected network conditions. On a controlled, low-jitter network, you can push SRT latency toward 200–300ms. On a congested public internet path, you’ll want a larger buffer.

Where SRT genuinely excels:

  • Contribution feeds over unpredictable networks. Field reporters, remote venues, OB trucks — SRT was designed for exactly this. It recovers from packet loss gracefully without the TCP head-of-line blocking problem.
  • Cross-datacenter transport. Moving video between data centers across the public internet, SRT is significantly more reliable than RTMP and meaningfully lower latency.
  • Broadcast-grade workflows. Major broadcasters (BBC, Fox Sports, others) have adopted SRT for contribution because it behaves predictably under network stress.
  • Security. Unlike base RTMP, SRT encrypts the stream by default.
  • Protocol-level latency control. SRT gives you explicit knobs for latency vs. reliability tradeoffs. RTMP gives you none.

Where SRT falls short:

  • Not browser-native. Like RTMP, SRT requires native application support. No browser can originate or receive an SRT stream natively.
  • Not universally supported. While adoption is growing fast, SRT support in CDN edges and ingest points is still less universal than RTMP.
  • Still not low enough for interactive applications. 500ms is a meaningful improvement over RTMP, but it’s still above the threshold for true interactive video. You can hear 500ms delay in a conversation. You can feel it in a live response system.

The verdict: Among RTMP vs SRT vs WebRTC, SRT is the right choice for the contribution leg — moving video from a source to a media server across a network you don’t fully control, where reliability matters as much as latency. It’s a significant upgrade from RTMP for that specific job. It is not a replacement for WebRTC in interactive or ultra-low-latency scenarios.

WebRTC: The Interactive Web’s Real-Time Foundation

What it is: Web Real-Time Communication is a W3C/IETF open standard developed primarily by Google and now implemented natively in all major browsers. It was built from the ground up for peer-to-peer and server-mediated real-time communication.

Transport: UDP via DTLS-SRTP. Encryption is mandatory — there is no unencrypted WebRTC. The protocol stack also includes STUN/TURN/ICE for NAT traversal and congestion control algorithms (GCC, REMB) for adaptive bitrate.

Typical latency: 50–200ms. This is a qualitatively different class of latency from RTMP or SRT. Sub-200ms means real-time human perception: a performer can react to viewer input, a system can detect stream failure and switch states before a viewer notices.

Where WebRTC genuinely excels:

  • Browser-native ingest and delivery. A performer can stream directly from a browser tab. A viewer can receive a stream in a browser tab. No application install, no plugin, no external dependency.
  • Interactive applications. Video conferencing, live auctions, real-time coaching, interactive performances — anything where participants need to respond to each other in real time.
  • Adaptive bitrate by default. WebRTC’s congestion control automatically adjusts encoding based on network conditions, without requiring manual configuration.
  • Safety-critical routing. Because WebRTC sessions are server-mediated through an SFU (Selective Forwarding Unit), the server controls precisely what each participant receives. This is critical for use cases where certain streams must never be exposed to certain viewers — that enforcement lives at the media routing layer, not the application layer.
  • OBS integration via WHIP. From OBS Studio v30 onwards, WebRTC ingest via the WHIP (WebRTC HTTP Ingest Protocol) standard is natively supported. This removes one of the last remaining barriers to WebRTC adoption in production streaming workflows.

Where WebRTC falls short:

  • Scale requires an SFU. Pure WebRTC peer-to-peer breaks down beyond a handful of participants. Production WebRTC at scale requires a properly architected SFU layer — which adds infrastructure complexity.
  • CDN delivery is not native. WebRTC isn’t natively supported by CDN edges the way HLS is. For large-audience broadcast delivery (thousands to millions of viewers), WebRTC needs to be combined with HLS/DASH at the egress leg.
  • Not universally supported by hardware encoders. Professional broadcast hardware often speaks RTMP or SRT but not WebRTC. In workflows where hardware encoders are the source, RTMP or SRT ingest may be unavoidable.

The verdict: Among RTMP vs SRT vs WebRTC, WebRTC is the right choice whenever sub-300ms latency is required, whenever browser-native participation matters, or whenever you need fine-grained server-side control over who receives what stream. It is the modern foundation for interactive live video.

RTMP vs SRT vs WebRTC: The Comparison at a Glance

DimensionRTMPSRTWebRTC
Latency3–8s (accumulates)0.5–2s50–200ms
TransportTCPUDP + ARQUDP + DTLS
EncryptionOptional (RTMPS)Built-in AESMandatory
Browser nativeNoNoYes
Adaptive bitrateNoLimitedYes (built-in)
Reliability on lossy networksPoorExcellentGood
CDN/ecosystem supportExcellentGrowingLimited
Scale modelPush to serverPush to serverSFU-mediated
OBS supportNativeNativeNative (v30+, WHIP)
Suitable for interactiveNoNoYes
Suitable for broadcast scaleYesYesWith HLS egress

The Decision Framework: Questions to Ask at Each Pipeline Leg

Rather than picking a protocol, run through these questions for each leg of your pipeline:

Leg 1: Source → Ingest Server

Question 1: Does the source require real-time interaction or feedback?

  • Yes → WebRTC (only protocol with sub-300ms latency at this leg)
  • No → proceed to Question 2

Question 2: How controlled is the network between source and ingest?

  • Unreliable / public internet / variable → SRT (reliability layer handles packet loss)
  • Controlled / datacenter / low jitter → RTMP or SRT (both work; SRT preferred for its lower latency and encryption)

Question 3: What software or hardware is the source using?

  • OBS v30+, browser → WebRTC (WHIP) is viable
  • Hardware encoder, legacy OBS, third-party streaming software → RTMP or SRT depending on support

Leg 2: Media Server → Processing Layer

Question 4: Does the processing layer need real-time responsiveness to stream state?

  • Yes (e.g., AI video processing, failure detection, frame-level operations) → RTP (raw transport from a WebRTC SFU, or GStreamer pipeline with SRT)
  • No (e.g., recording, async processing) → RTMP or SRT are both fine

Question 5: How quickly must failure conditions be detected and handled?

  • Sub-second → WebRTC/SFU-based routing (can enforce viewer-facing states at the routing layer in real time)
  • Seconds are acceptable → SRT or RTMP with application-level monitoring

Leg 3: Media Server → Viewer

Question 6: How many concurrent viewers?

  • Hundreds or fewer, interactive → WebRTC direct delivery
  • Thousands+ → HLS/DASH from a transcoding layer, potentially after WebRTC ingest and SFU processing

Question 7: Do viewers need to interact with the stream (react, participate)?

  • Yes → WebRTC delivery
  • No (passive viewing) → HLS/DASH is more cost-effective at scale

Real-World Architecture Blueprints

Blueprint A: Large-Scale Broadcast (YouTube-style)

Broadcaster (OBS/hardware)
    → RTMP or SRT ingest
    → Media server (Wowza/Elemental)
    → Transcoding to HLS/DASH
    → CDN
    → Viewer (browser/app, HLS)

Why: Compatibility at ingest, CDN scale at egress. Latency is 10–30 seconds — acceptable for passive broadcast.

Blueprint B: Low-Latency Interactive Live (Conferencing/Coaching)

Participant (browser or OBS via WHIP)
    → WebRTC ingest
    → SFU (session control, routing)
    → WebRTC delivery
    → Viewer (browser)

Why: Sub-200ms end-to-end. Browser-native. SFU controls exactly who sees what.

Blueprint C: Broadcast with Interactive Layer (Webinar-style)

Speaker (OBS via WHIP or browser)
    → WebRTC ingest → SFU
    → WebRTC delivery → interactive participants (low latency)
    → Transcoding → HLS delivery → passive audience (higher latency, higher scale)

Why: Speakers interact in real time; passive viewers receive a CDN-scaled HLS stream.

Blueprint D: AI Video Processing Pipeline (Real-Time FaceSwap / Avatar / Filter)

Performer (OBS via WHIP)
    → WebRTC ingest → SFU
    → RTP frames extracted → GPU worker (AI processing)
    → Processed frames returned → SFU
    → WebRTC delivery → viewer

Why: WebRTC at ingest gives sub-200ms, enabling the AI processing layer to stay within a real-time latency budget. The SFU enforces the critical safety requirement: raw performer video never reaches the viewer — only the processed output does. SRT or RTMP at the ingest leg would add 500ms–8s before the frame even reaches the GPU, making true real-time processing impossible.

The Hybrid Principle: Most Production Systems Use All Three

The practical conclusion from this framework is that most mature real-world systems use multiple protocols at different legs. A common production-grade architecture might use:

  • RTMP for accepting ingest from legacy hardware encoders and third-party broadcasters who won’t change their setup
  • SRT for contribution feeds from field locations over unreliable networks
  • WebRTC for performer ingest and interactive viewer delivery where latency matters
  • HLS for passive large-audience delivery from a CDN

The protocols are not competitors. They are tools with different jobs. A mature infrastructure team reaches for the right one at each stage of the pipeline — and designs the system so that each leg can be upgraded independently as requirements evolve.

Before Your Next Vendor Discussion: The Checklist

Go into vendor conversations with these questions answered for your specific architecture:

  1. What are the latency requirements at each leg? (Interactive vs. near-real-time vs. broadcast-acceptable)
  2. What are the source endpoints? (Browser, OBS, hardware encoder, IP camera)
  3. What network conditions govern each leg? (Controlled datacenter vs. public internet last mile)
  4. What is the peak viewer scale? (Hundreds vs. thousands vs. millions)
  5. Does any leg require server-side control over what viewers receive? (Safety routing, access control, failure states)
  6. What processing happens between ingest and egress? (AI inference, transcoding, recording)
  7. What does the failure handling model look like? (How fast must failure be detected? What do viewers see?)

A vendor who cannot answer these questions in terms of your specific pipeline legs — rather than pitching a single protocol as the universal answer — is not yet thinking at the depth your architecture requires.

Conclusion

The question is never “RTMP vs SRT vs WebRTC?” The question is always “which protocol, at which leg of this pipeline, for which set of constraints?”

RTMP earns its place in compatibility-first ingest scenarios and established CDN workflows. SRT earns its place in contribution feeds across unreliable networks where reliability and modest latency matter. WebRTC earns its place wherever interactivity, browser-native access, server-side routing control, or sub-200ms latency is required.

Real-time AI video processing — including applications like live video filtering, avatar replacement, and face processing — represents a use case where WebRTC at the ingest leg is increasingly not just the best option but the necessary one. The latency budget demanded by real-time AI inference leaves no room for the seconds that RTMP accumulates or the half-second buffer that SRT requires.

Build your architecture in legs. Choose deliberately at each one. And be appropriately skeptical of any vendor who offers you a single-protocol answer to a multi-leg problem.

CentEdge builds real-time communications infrastructure for enterprises in regulated industries. Samvyo is an AI-native WebRTC platform built on a scalable SFU architecture, designed for low-latency media ingest, session control, and adaptive egress.

WebRTC getUserMedia: The secret for using media devices in the browser

WebRTC getUserMedia: The secret for using media devices in the browser

The getUserMedia() API in WebRTC is primarily responsible capturing the media streams currently available. The WebRTC standard provides this API for accessing cameras and microphones connected to the computer or smartphone. These devices are commonly referred to as Media Devices and can be accessed with JavaScript through the navigator.mediaDevices object, which implements the MediaDevices interface. From this object we can enumerate all connected devices, listen for device changes (when a device is connected or disconnected), and open a device to retrieve a Media Stream.

The most common way this is used is through the function getUserMedia(), which returns a promise that will resolve to a MediaStream for the matching media devices. This function takes a single MediaStreamConstraints object that specifies the requirements that we have. For instance, to simply open the default microphone and camera, we would do the following.

const mediaDevices = async (constraints) => {
    return await navigator.mediaDevices.getUserMedia(constraints);
}

try {
    const stream = mediaDevices({'video':true,'audio':true});
    console.log('Got MediaStream:', stream);
} catch(error) {
    console.error('Error while trying to access media devices.', error);
}

The call to getUserMedia() will trigger a permissions request. If the user accepts the permission, the promise is resolved with a MediaStream containing one video and one audio track. If the permission is denied, a Permission Denied Error is thrown. In case there are no matching devices connected, a Not Found Error will be thrown.

Media Constraints

As one can see on the above snippet, one has to pass the media constraints while calling the getUserMedia API to access the video and audio streams( camera and mic) available to the browser. The constraints object, which must implement the MediaStreamConstraints interface, that we pass as a parameter to getUserMedia() allows us to open a media device that matches a certain requirement. This requirement can be very loosely defined (audio and/or video), or very specific (minimum camera resolution or an exact device ID). It is recommended that applications that use the getUserMedia()API first check the existing devices and then specifies a constraint that matches the exact device using the deviceId constraint. Devices will also, if possible, be configured according to the constraints. We can enable echo cancellation on microphones or set a specific or minimum width, height and frame rate of the video from the camera.Below is a brief about how to use media constraints in an advanced way.

async function findConnectedDevices(type) {
    const devices = await navigator.mediaDevices.enumerateDevices();
    return devices.filter(device => device.kind === type)
}

async function openCamera(cameraId, minWidth, minHeight) {
    const constraints = {
        'audio': {'echoCancellation': true},
        'video': {
            'deviceId': cameraId,
            'width': {'min': minWidth},
            'height': {'min': minHeight},
            'frameRate': {'min': 10},
            }
        }

    return await navigator.mediaDevices.getUserMedia(constraints);
}

const cameras = findConnectedDevices('videoinput');
if (cameras && cameras.length > 0) {
    const stream = openCamera(cameras[0].deviceId, 1280, 720);
}

The full documentation for the MediaStreamConstraints interface can be found on the MDN web docs.

Playing the video locally

Once a media device has been opened and we have a MediaStream available, we can assign it to a video or audio element to play the stream locally. The HTML needed for a typical video element used with getUserMedia() will usually have the attributes autoplay and playsinline. The autoplay attribute will cause new streams assigned to the element to play automatically. The playsinline attribute allows video to play inline, instead of only in full screen, on certain mobile browsers. It is also recommended to use controls=”false” for live streams, unless the user should be able to pause them.

async function playVideoFromCamera() {
    try {
        const constraints = {'video': true, 'audio': true};
        const stream = await navigator.mediaDevices.getUserMedia(constraints);
        const videoElement = document.querySelector('video#localVideo');
        videoElement.srcObject = stream;
    } catch(error) {
        console.error('Error opening video camera.', error);
    }
}
<html>
<head><title>Local video playback</video></head>
<body>
    <video id="localVideo" autoplay playsinline controls="false" />
</body>
</html>

This post briefly describes how to use the getUserMedia API currently available in all modern browsers. To explore this API further for understanding how it can be used with advanced settings and configurations needed for creating a production grade video conferencing applications, refer to this blog post.

If you are planning to build a simple P2P(peer to peer) video conferencing application, then check this blog post also to understand another important RTCPeerConnection API. One should be able to build a p2p video conferencing app using these 2 important APIs.

if you have any questions related getUserMedia / WebRTC as a whole, you can ask all your questions to get prompt answers in this dedicated WebRTC forum.

if you want to learn WebRTC to build a sound understanding of it along with all the technology in it’s protocol stack like ICE, STUN,TURN, DTLS, SRTP, SCTP etc., then check out our live online/onsite instructor led WebRTC training programs here. If you wish to register for one of our upcoming training programs, then you can do so using the registration form link provided there.

Here is an example public github repo for creating a simple p2p video conferencing app build using these 2 API with Nodejs as signalling server. Feel free to download the example and play around to understand the basics of WebRTC.

If you want to build some serious production grade video conferencing applications, then check out this open source github repo for a production grade WebRTC signalling server built using NodeJs which you can use to build and deploy your video calling app to any cloud. If your need is not fulfilled by this repo, feel free to visit our services page to know more about services.

If you want a custom video application without having to go through the pain of building it, then check out our products page to know more about our scalable, customizable and fully managed video conferencing / live streaming application as a service along with custom branding.

Feel free to reach out to us at hello@centedge.io for any kind of help and support need with WebRTC.

WebRTC for Absolute beginners, an overview

WebRTC for Absolute beginners, an overview

With WebRTC, you can add real-time communication capabilities to your application that works on top of an open standard. It supports video, voice, and generic data to be sent between peers, allowing developers to build powerful voice- and video-communication solutions. The technology is available on all modern browsers as well as on native clients for all major platforms. The technologies behind WebRTC are implemented as an open web standard and available as regular JavaScript APIs in all major browsers. For native clients, like Android and iOS applications, a library is available that provides the same functionality. The WebRTC project is an open source project and supported by Apple, Google, Microsoft and Mozilla, amongst others.

What can WebRTC really do?

There are many different use-cases for WebRTC, from basic web apps that uses the camera or microphone, to more advanced video-calling applications and screen sharing. WebRTC can be used to build anything and everything starting from weekend side projects to build a simple one to one video chat app or build a complex enterprise grade video conferencing apps with security and other necessary features.

What exactly is WebRTC?

A WebRTC application usually goes through a common application flow. To make it simple, it can be understood in 4 steps, accessing the media devices, opening peer connections, discovering peers, and start streaming. It is a collection of a set of APIs which facilitate these above mentioned steps. Creating a new application based on the WebRTC technologies can be overwhelming if one is unfamiliar with these APIs.

WebRTC APIs

The WebRTC standard covers, on a high level, two different technologies: media capture devices and peer-to-peer connectivity.

Media capture devices includes video cameras and microphones, but also screen capturing “devices”. For cameras and microphones, we use navigator.mediaDevices.getUserMedia() to capture MediaStreams. For screen recording, we use navigator.mediaDevices.getDisplayMedia() instead.

The peer-to-peer connectivity is handled by the RTCPeerConnection interface. This is the central point for establishing and controlling the connection between two peers in WebRTC.

In the upcoming posts, the APIs will be elaborated with easy understand and follow examples. Stay tuned to learn WebRTC from scratch.

Share the camera or share the screen, do it all with these browser APIs

Share the camera or share the screen, do it all with these browser APIs

For the last 5 months, the demand for video conferencing has been skyrocketed. Majority of the human population on our planet have been locked up in their respective homes and all the work is getting done through video conferencing. The most primary requirement for a video conferencing is to share the camera and microphone with occasional screen sharing with everybody else so that the individual can be seen, heard and understood properly. Majority of these video conferences now a days run directly from a browser without the need to install any external software or even browser extension. The browsers these days have got some magical powers to do all thing related to camera, microphones and screen share. In this post, we will explore the magical powers of the browser to share these things on demand and the open secret behind these magical powers.

The open secret

The much awaited open secret is this browser API named navigator.mediadevices. This api provides the functionalities which includes getUserMedia to acquire the camera and microphones on request, enumerateDevices to list out all the available devices and getDisplayMedia to capture screen or application window or browser tab etc. These are the most commonly used apis in a typical video conferencing application.

Video conferencing applications can retrieve the current list of connected devices and also listen for changes, since many cameras and microphones connect through USB and can be connected and disconnected during the lifecycle of the application. Since the state of a media device can change at any time, it is recommended that applications register for device changes by using the necessary navigator.mediadevices apis in order to properly handle changes.

Media constraints

The next thing that needs discussion is media constraints which defines how one can access the camera and microphone or the screen share while passing specific instructions to the browser.

Capture camera using getUserMedia

For example, if there are 3 cameras available

to a browser, then a specific instruction can be given to browser as a constraint to access a specific camera out of the available 3 cameras for the video call.

The specific constraints are defined in a MediaTrackConstraint object, one for audio and one for video. The attributes in this object are of type ConstraintLong, ConstraintBoolean, ConstraintDouble or ConstraintDOMString. These can either be a specific value (e.g., a number, boolean or string), a range (LongRange or DoubleRange with a minimum and maximum value) or an object with either an ideal or exact definition. For a specific value, the browser will attempt to pick something as close as possible. For a range, the best value in that range will be used. When exact is specified, only media streams that exactly match that constraints will be returned.

// Camera with a resolution as close to 640x480 as possible
{
    "video": {
        "width": 640,
        "height": 480
    }
}
// Camera with a resolution in the range 640x480 to 1024x768
{
    "video": {
        "width": {
            "min": 640,
            "max": 1024
        },
        "height": {
            "min": 480,
            "max": 768
        }
    }
}
// Camera with the exact resolution of 1024x768
{
    "video": {
        "width": {
            "exact": 1024
        },
        "height": {
            "exact": 768
        }
    }
}

To determine the actual configuration of a certain track of a media stream has, we can call MediaStreamTrack.getSettings() which returns the MediaTrackSettings currently applied.

It is also possible to update the constraints of a track from a media device we have opened, by calling applyConstraints() on the track. This lets an application re-configure a media device without first having to close the existing stream.

Capture screen using getDisplayMedia

An application that wants to be able to perform screen capturing and recording must use the Display Media API. The function getDisplayMedia() (which is part of navigator.mediaDevices is similar to getUserMedia() and is used for the purpose of opening the content of the display (or a portion of it, such as a window). The returned MediaStream works the same as when using getUserMedia().

The constraints for getDisplayMedia() differ from the ones used for regular video or audio input.

{
    video: {
        cursor: 'always' | 'motion' | 'never',
        displaySurface: 'application' | 'browser' | 'monitor' | 'window'
    }
}

The code snipet above shows how the special constraints for screen recording works. Note that these might not be supported by all browser that have display media support.

Tips and tricks

A MediaStream represents a stream of media content, which consists of tracks (MediaStreamTrack) of audio and video. You can retrieve all the tracks from MediaStream by calling MediaStream.getTracks(), which returns an array of MediaStreamTrack objects.

A MediaStreamTrack has a kind property that is either audio or video, indicating the kind of media it represents. Each track can be muted by toggling its enabled property. A track has a Boolean property remote that indicates if it is source by a RTCPeerConnection and coming from a remote peer.

WebRTC RTCPeerConnection: The secret behind connecting peers in the new video call app

WebRTC RTCPeerConnection: The secret behind connecting peers in the new video call app

WebRTC RTCPeerConnection is the API which deals with connecting two applications on different computers to communicate using a peer-to-peer protocol. The communication between peers can be video, audio or arbitrary binary data (for clients supporting the RTCDataChannel API). In order to discover how two peers can connect, both clients need to connect to a common signalling server and also provide an ICE Server configuration. The ice server can either be a STUN or a TURN-server, and their role is to provide ICE candidates to each client which is then transferred to the remote peer. This transferring of ICE candidates is commonly called signalling. All these new terminologies may sound alien at the beginning but these are the secret behind successfully connecting a video call between 2 computers using only browsers.

Signalling is needed in order for two peers to share how they should connect. Usually this is solved through a regular HTTP-based Web API (i.e., a REST service or other RPC mechanism like web socket) where web applications can relay the necessary information before the peer connection is initiated. Signalling can be implemented in many different ways, and the WebRTC specification doesn’t prefer any specific solution.

Peer connection initiation

RTCPeerConnection API is responsible for creating the RTCPeerConnection object by instantiating it as described in the code snippet below. The constructor for this class takes a single RTCConfiguration object as its parameter. This object defines how the peer connection is set up and should contain information about the ICE servers to use.

Once the RTCPeerConnection is created we need to create an SDP offer or answer, depending on if we are the calling peer or receiving peer. Once the SDP offer or answer is created, it must be sent to the remote peer through a different channel. Passing SDP objects to remote peers is called signalling, to be specific and is not covered by the WebRTC specification.

To initiate the peer connection setup from the calling side, we create a RTCPeerConnection object and then call createOffer() to create a RTCSessionDescription object. This session description is set as the local description using setLocalDescription() and is then sent over our signalling channel to the receiving side. We also set up a listener to our signalling channel for when an answer to our offered session description is received from the receiving side.

Simple signalling server

// Set up an asynchronous communication channel that will be
// used during the peer connection setup
const signalingChannel = new SignalingChannel(remoteClientId);
signalingChannel.addEventListener('message', message => {
    // New message from remote client received
});

// Send an asynchronous message to the remote client
signalingChannel.send('Hello!');

Initiating the call from browser A

async function makeCall() {
    const configuration {'iceServers': [{'urls': 'stun:stun.l.google.com:19302'}]}
    const peerConnection = new RTCPeerConnection(configuration);
    signalingChannel.addEventListener('message', async message => {
        if (message.answer) {
            const remoteDesc = new RTCSessionDescription(message.answer);
            await peerConnection.setRemoteDescription(remoteDesc);
        }
    });
    const offer = await peerConnection.createOffer();
    await peerConnection.setLocalDescription(offer);
    signalingChannel.send({'offer': offer});
}

On the receiving side, we wait for an incoming offer before we create our RTCPeerConnection instance. Once that is done we set the received offer using setRemoteDescription(). Next, we call createAnswer() to create an answer to the received offer. This answer is set as the local description using setLocalDescription() and then sent to the calling side over our signalling server.

const peerConnection = new RTCPeerConnection(configuration);
signalingChannel.addEventListener('message', async message => {
    if (message.offer) {
        peerConnection.setRemoteDescription(new RTCSessionDescription(message.offer));
        const answer = await peerConnection.createAnswer();
        await peerConnection.setLocalDescription(answer);
        signalingChannel.send({'answer': answer});
    }
});

Once the two peers have set both the local and remote session descriptions they know the capabilities of their respective remote peer. This doesn’t mean that the connection between the peers has already been established. For this to work we need to collect the ICE candidates at each peer and transfer (over the signalling channel) to the other peer in order to establish the connection between them.

ICE Candidates

ICE means Internet Connectivity Establishment. Before two peers can communicate using WebRTC, they need to exchange connectivity information. Since the network conditions can vary depending on a number of factors, an external service is usually used for discovering the possible candidates for connecting to a peer. This service is called ICE and is using either a STUN or a TURN server. STUN stands for Session Traversal of User Datagram Protocol, and is usually used indirectly in most WebRTC applications.

TURN (Traversal Using Relay NAT) is the more advanced solution that incorporates the STUN protocols and most commercial WebRTC based services uses a TURN server for establishing connections between peers. The WebRTC API supports both STUN and TURN directly, and it is gathered under the more complete term Internet Connectivity Establishment. When creating a WebRTC connection, we usually provide one or several ICE servers in the configuration for the RTCPeerConnection object.

Trickle ICE

Trickle ICE is a technique which is used to reduce the call setup time between the 2 peers. Once a RTCPeerConnection object is created, the underlying framework uses the provided ICE servers to gather candidates for establishing connectivity based on the ICE candidates. The event icegatheringstatechange on RTCPeerConnection signals in what state the ICE gathering is (new, gathering or complete).

While it is possible for a peer to wait until the ICE gathering is complete, it is usually much more efficient to use this technique and transmit each ICE candidate to the remote peer as it gets discovered. This significantly reduces the setup time for the peer connectivity and allow a video call to get started with less delays.

To gather ICE candidates, simply add a listener for the icecandidate event. The RTCPeerConnectionIceEvent emitted on that listener will contain candidate property that represent a new candidate that should be sent to the remote peer using the Signalling mechanism as mentioned above.

// Listen for local ICE candidates on the local RTCPeerConnection
peerConnection.addEventListener('icecandidate', event => {
    if (event.candidate) {
        signalingChannel.send({'new-ice-candidate': event.candidate});
    }
});

// Listen for remote ICE candidates and add them to the local RTCPeerConnection
signalingChannel.addEventListener('message', async message => {
    if (message.iceCandidate) {
        try {
            await peerConnection.addIceCandidate(message.iceCandidate);
        } catch (e) {
            console.error('Error adding received ice candidate', e);
        }
    }
});

Once ICE candidates are being received, we should expect the state for our peer connection will eventually change to a connected state. To detect this, we add a listener to our RTCPeerConnection where we listen for connectionstatechange events.

// Listen for connectionstatechange on the local RTCPeerConnection
peerConnection.addEventListener('connectionstatechange', event => {
    if (peerConnection.connectionState === 'connected') {
        // Peers connected!
    }
});