WebRTC Architectures: From P2P to Scalable Video Conferencing

The Problem: How Do We Connect Multiple Video Streams?

Imagine you’re building a video conference app. Two people need to see and hear each other.

Easy: Direct connection. No server needed.

Now add 10 more people. Now 100. Now 1000.

Hard: How do you route video streams between all of them efficiently?

This is where WebRTC architectures come in. Different designs solve this problem in different ways, each with trade-offs.

Part 1: The Simplest Solution — Peer-to-Peer (P2P)

Direct Connection Between Two Peers

In WebRTC, a peer is simply a user. When two peers communicate directly, they establish a connection without any server in between.

1
User A ←→ User B
2
(WebRTC over UDP)

How it works:

User A and User B exchange SDP (Session Description Protocol) offers
WebRTC establishes a direct UDP connection
Video/audio streams flow directly between them
No server, no latency from routing through infrastructure

Advantages:

Minimal latency (direct connection)
No server cost
Privacy (data doesn’t pass through third parties)
Simple to implement

Disadvantages:

Only works for 2 people
Not scalable
Requires NAT traversal (STUN/TURN servers, but not for relaying)

Note

WebRTC Basics: WebRTC uses UDP instead of TCP because video calls prioritize speed over reliability. A dropped frame is acceptable; retransmitting it would increase latency. UDP is connectionless, fast, and perfect for real-time communication.

Part 2: Adding More Peers — The Mesh Problem

What Happens When a 3rd Peer Joins?

Now User C wants to join the call. User C needs to connect with both User A and User B.

1
    User A
2
    /    \
3
   /      \
4
User B -- User C

Each peer sends its stream to every other peer. This is called mesh P2P architecture.

The Mesh Topology

In a mesh network:

Every peer connects to every other peer
Each peer sends its stream N-1 times (where N = total peers)
Each peer receives N-1 streams

4 peers = 6 connections:

1
User A ↔ User B
2
User A ↔ User C
3
User A ↔ User D
4
User B ↔ User C
5
User B ↔ User D
6
User C ↔ User D

Mathematical formula: For N peers, connections = N × (N-1) / 2

1
2 peers:  1 connection
2
3 peers:  3 connections
3
4 peers:  6 connections
4
5 peers:  10 connections
5
10 peers: 45 connections
6
20 peers: 190 connections

Why Mesh Fails

1. Bandwidth explosion:

Each peer must upload its stream to N-1 peers. If each stream is 2 Mbps:

1
10 peers: Each peer uploads 2 × 9 = 18 Mbps
2
20 peers: Each peer uploads 2 × 19 = 38 Mbps

Your home internet might not support this.

2. CPU intensive:

Encoding multiple streams simultaneously is CPU-heavy. Your laptop’s processor maxes out.

3. Not fault-tolerant:

If one peer drops, the connection graph breaks. You need to reconnect everyone.

4. Terrible for mobile:

Battery drain from continuous upload. Data usage is astronomical.

Solution: We need a server to centralize connections.

Note

Mesh is only practical for very small groups (2-4 people). Anything larger needs a central server. This is why P2P video calls are limited to 1-on-1; group calls need infrastructure.

Part 3: Introducing a Server — MCU (Multipoint Control Unit)

Centralized Mixing

Instead of mesh, introduce a server in the middle:

1
User A \
2
User B  → MCU Server → [Mixed Stream] → All Users
3
User C /

The MCU (Multipoint Control Unit) is a server that:

Receives streams from all peers
Mixes them into a single video (like a video grid)
Broadcasts the mixed stream to all peers

How MCU Works

Imagine a grid of 4 video tiles:

1
[User A] [User B]
2
[User C] [User D]

The MCU:

Receives raw streams from User A, B, C, D
Resizes and arranges them into a grid
Composes them into a single video output
Encodes and sends the same video to all users

Advantages of MCU

Bandwidth efficient (each peer uploads once, downloads once)
Scalable (server handles mixing)
Simple for clients (receive one stream, display it)

Disadvantages of MCU

CPU intensive: Real-time video mixing/compositing is expensive
High latency: Processing multiple streams, composing, encoding = delays. Lag is noticeable
Expensive to run: Server costs scale with processing power
Inflexible: Fixed layout. If one user goes fullscreen, everyone sees the same stream. You can’t show different content to different users
Quality loss: Compositing often reduces quality to save CPU

Real example: Old WebEx systems used MCU. Noticeable lag and visual artifacts.

Note

The Compositing Problem: Combining multiple video streams in real-time is non-trivial:

Resize each stream to fit the grid
Blend them together
Re-encode the output
All in under 33ms (for 30fps video)

This is why MCU requires powerful servers.

Part 4: The Better Approach — SFU (Selective Forwarding Unit)

No Mixing, Just Forwarding

SFU is an evolution of MCU that solves the mixing problem by… not mixing at all.

1
User A → \
2
User B → → SFU Server → [Stream A, Stream B, Stream C, Stream D] → All Users
3
User C → /

The SFU (Selective Forwarding Unit):

Receives streams from all peers (like MCU)
Does NOT mix them (unlike MCU)
Forwards raw streams to all peers
Lets the client decide how to render

How SFU Works

Instead of sending a single mixed video:

1
SFU → User A: [Stream B, Stream C, Stream D]
2
SFU → User B: [Stream A, Stream C, Stream D]
3
SFU → User C: [Stream A, Stream B, Stream D]
4
SFU → User D: [Stream A, Stream B, Stream C]

Each peer receives raw video streams from every other peer and arranges them client-side.

The client decides:

How to layout the grid
Which video to highlight
Which streams to mute
Which stream to display fullscreen

Advantages of SFU

1. Low latency: No mixing = minimal processing. Just forward streams. Latency is much lower than MCU.

2. Flexible rendering: Each user sees what they choose. User A can fullscreen User B while User C sees everyone in grid mode.

1
User A's screen: [User B fullscreen]
2
User C's screen: [Grid: A, B, C, D]
3
Different clients, different layouts

3. Easy to mute: Want to stop seeing User B? Client simply stops downloading their stream.

4. Scalable: Server doesn’t do heavy processing. It’s basically a smart router. More CPU = more users.

5. Efficient bandwidth: Each peer uploads once. Downloads scale with group size, but that’s the same for all architectures. Benefit: only download what you need.

Disadvantages of SFU

More bandwidth for clients: Each client receives multiple streams (one per peer). Download = N-1 streams
Client-side processing: Decoding and rendering multiple streams uses CPU on your device
Network congestion: If a peer has poor upload, it affects everyone (they receive a poor-quality stream)

Trade-off: SFU moves work from server to clients. Servers are more powerful, but clients distribute the load.

Part 5: Comparing the Three Architectures

Aspect	Mesh P2P	MCU	SFU
Connections per peer	N-1	1	1
Bandwidth per peer	High	Low	Medium-High
Server processing	None	Very High	Low
Latency	Lowest	Highest	Low
Scalability	Poor	Medium	High
Flexibility	High	Low	High
Cost	Low	High	Medium
Best for	1-1 calls	Small groups with fixed layout	Large groups

Part 6: Real-World Examples

Zoom

Zoom primarily uses SFU architecture.

Why?

Handles 1000+ participants (scalable)
Low latency (you can see people react in real-time)
Flexible UI (you choose gallery view, spotlight, fullscreen)
Server is a router, not a processor

When you’re in a 100-person Zoom call, the server isn’t composing a 100-person grid. It’s forwarding 100 separate video streams.

Google Meet

Google Meet uses SFU with some hybrid elements.

Large meetings: SFU (everyone receives multiple streams)
For display, sometimes uses client-side layout (no MCU mixing)

Teams/Skype

Microsoft uses SFU for most calls, with MCU for specific scenarios (like recording entire meeting as one video).

Discord/Telegram

Small group calls: SFU (Discord supports up to 25 people on video)

Why not larger? Because receiving 25 streams stresses home internet.

Part 7: When Do You Use Each?

Use Mesh P2P When:

Only 2 people (1-on-1 calls)
Privacy is critical (no server involvement)
Low latency is essential
You have good internet

Example: Secure peer-to-peer chat apps, some P2P game networking.

Use MCU When:

Small group (3-10 people)
Fixed layout is acceptable
You want to save client bandwidth
Recording the entire call as one video

Example: Traditional conference systems, old WebEx/Polycom.

Use SFU When:

Large groups (10+ people)
Flexible UI is important
Low latency is critical
Scalability matters

Example: Zoom, Google Meet, modern conferencing platforms.

Part 8: The Client Rendering Problem

In SFU, clients receive multiple streams. Now the client must:

Decode each stream (video codec, like VP8 or H264)
Resize each stream to fit the grid
Compose them into a canvas
Render at 30fps

This is CPU-intensive. Here’s why Zoom on a laptop can get hot:

1
10 participants × 30fps × decoding overhead = High CPU usage

Optimization: Clients often:

Reduce resolution of non-focal streams (you don’t need 4K for a small tile)
Pause decoding of streams you’re not looking at
Use hardware acceleration (GPU) for video decoding

Part 9: Conclusion: Architecture Affects Everything

The architecture you choose isn’t just technical. It affects:

Latency: How fast do users see reactions?
Scalability: How many people can join?
Cost: How much does the server cost?
Quality: What resolution and framerate?
User Experience: Can I mute someone? Go fullscreen? Control my view?

The trend: Modern platforms use SFU because:

Scales better than MCU
Lower latency than MCU
Better UX than P2P
Cost-effective

WebRTC Architectures: From P2P to Scalable Video Conferencing

The Problem: How Do We Connect Multiple Video Streams?

Part 1: The Simplest Solution — Peer-to-Peer (P2P)

Direct Connection Between Two Peers

Part 2: Adding More Peers — The Mesh Problem

What Happens When a 3rd Peer Joins?

The Mesh Topology

Why Mesh Fails

Part 3: Introducing a Server — MCU (Multipoint Control Unit)

Centralized Mixing

How MCU Works

Advantages of MCU

Disadvantages of MCU

Part 4: The Better Approach — SFU (Selective Forwarding Unit)

No Mixing, Just Forwarding

How SFU Works

Advantages of SFU

Disadvantages of SFU

Part 5: Comparing the Three Architectures

Part 6: Real-World Examples

Zoom

Google Meet

Teams/Skype

Discord/Telegram

Part 7: When Do You Use Each?

Use Mesh P2P When:

Use MCU When:

Use SFU When:

Part 8: The Client Rendering Problem

Part 9: Conclusion: Architecture Affects Everything

Further Reading