Logo
Overview
WebRTC Architectures: From P2P to Scalable Video Conferencing

WebRTC Architectures: From P2P to Scalable Video Conferencing

January 9, 2026
9 min read

The Problem: How Do We Connect Multiple Video Streams?

Imagine you’re building a video conference app. Two people need to see and hear each other.

Easy: Direct connection. No server needed.

Now add 10 more people. Now 100. Now 1000.

Hard: How do you route video streams between all of them efficiently?

This is where WebRTC architectures come in. Different designs solve this problem in different ways, each with trade-offs.


Part 1: The Simplest Solution — Peer-to-Peer (P2P)

Direct Connection Between Two Peers

In WebRTC, a peer is simply a user. When two peers communicate directly, they establish a connection without any server in between.

User A ←→ User B
(WebRTC over UDP)

How it works:

  1. User A and User B exchange SDP (Session Description Protocol) offers
  2. WebRTC establishes a direct UDP connection
  3. Video/audio streams flow directly between them
  4. No server, no latency from routing through infrastructure

Advantages:

  • Minimal latency (direct connection)
  • No server cost
  • Privacy (data doesn’t pass through third parties)
  • Simple to implement

Disadvantages:

  • Only works for 2 people
  • Not scalable
  • Requires NAT traversal (STUN/TURN servers, but not for relaying)
Note

WebRTC Basics: WebRTC uses UDP instead of TCP because video calls prioritize speed over reliability. A dropped frame is acceptable; retransmitting it would increase latency. UDP is connectionless, fast, and perfect for real-time communication.


Part 2: Adding More Peers — The Mesh Problem

What Happens When a 3rd Peer Joins?

Now User C wants to join the call. User C needs to connect with both User A and User B.

User A
/ \
/ \
User B -- User C

Each peer sends its stream to every other peer. This is called mesh P2P architecture.

The Mesh Topology

In a mesh network:

  • Every peer connects to every other peer
  • Each peer sends its stream N-1 times (where N = total peers)
  • Each peer receives N-1 streams

4 peers = 6 connections:

User A ↔ User B
User A ↔ User C
User A ↔ User D
User B ↔ User C
User B ↔ User D
User C ↔ User D

Mathematical formula: For N peers, connections = N × (N-1) / 2

2 peers: 1 connection
3 peers: 3 connections
4 peers: 6 connections
5 peers: 10 connections
10 peers: 45 connections
20 peers: 190 connections

Why Mesh Fails

1. Bandwidth explosion:

Each peer must upload its stream to N-1 peers. If each stream is 2 Mbps:

10 peers: Each peer uploads 2 × 9 = 18 Mbps
20 peers: Each peer uploads 2 × 19 = 38 Mbps

Your home internet might not support this.

2. CPU intensive:

Encoding multiple streams simultaneously is CPU-heavy. Your laptop’s processor maxes out.

3. Not fault-tolerant:

If one peer drops, the connection graph breaks. You need to reconnect everyone.

4. Terrible for mobile:

Battery drain from continuous upload. Data usage is astronomical.

Solution: We need a server to centralize connections.

Note

Mesh is only practical for very small groups (2-4 people). Anything larger needs a central server. This is why P2P video calls are limited to 1-on-1; group calls need infrastructure.


Part 3: Introducing a Server — MCU (Multipoint Control Unit)

Centralized Mixing

Instead of mesh, introduce a server in the middle:

User A \
User B → MCU Server → [Mixed Stream] → All Users
User C /

The MCU (Multipoint Control Unit) is a server that:

  1. Receives streams from all peers
  2. Mixes them into a single video (like a video grid)
  3. Broadcasts the mixed stream to all peers

How MCU Works

Imagine a grid of 4 video tiles:

[User A] [User B]
[User C] [User D]

The MCU:

  1. Receives raw streams from User A, B, C, D
  2. Resizes and arranges them into a grid
  3. Composes them into a single video output
  4. Encodes and sends the same video to all users

Advantages of MCU

  • Bandwidth efficient (each peer uploads once, downloads once)
  • Scalable (server handles mixing)
  • Simple for clients (receive one stream, display it)

Disadvantages of MCU

  • CPU intensive: Real-time video mixing/compositing is expensive
  • High latency: Processing multiple streams, composing, encoding = delays. Lag is noticeable
  • Expensive to run: Server costs scale with processing power
  • Inflexible: Fixed layout. If one user goes fullscreen, everyone sees the same stream. You can’t show different content to different users
  • Quality loss: Compositing often reduces quality to save CPU

Real example: Old WebEx systems used MCU. Noticeable lag and visual artifacts.

Note

The Compositing Problem: Combining multiple video streams in real-time is non-trivial:

  • Resize each stream to fit the grid
  • Blend them together
  • Re-encode the output
  • All in under 33ms (for 30fps video)

This is why MCU requires powerful servers.


Part 4: The Better Approach — SFU (Selective Forwarding Unit)

No Mixing, Just Forwarding

SFU is an evolution of MCU that solves the mixing problem by… not mixing at all.

User A → \
User B → → SFU Server → [Stream A, Stream B, Stream C, Stream D] → All Users
User C → /

The SFU (Selective Forwarding Unit):

  1. Receives streams from all peers (like MCU)
  2. Does NOT mix them (unlike MCU)
  3. Forwards raw streams to all peers
  4. Lets the client decide how to render

How SFU Works

Instead of sending a single mixed video:

SFU → User A: [Stream B, Stream C, Stream D]
SFU → User B: [Stream A, Stream C, Stream D]
SFU → User C: [Stream A, Stream B, Stream D]
SFU → User D: [Stream A, Stream B, Stream C]

Each peer receives raw video streams from every other peer and arranges them client-side.

The client decides:

  • How to layout the grid
  • Which video to highlight
  • Which streams to mute
  • Which stream to display fullscreen

Advantages of SFU

1. Low latency: No mixing = minimal processing. Just forward streams. Latency is much lower than MCU.

2. Flexible rendering: Each user sees what they choose. User A can fullscreen User B while User C sees everyone in grid mode.

User A's screen: [User B fullscreen]
User C's screen: [Grid: A, B, C, D]
Different clients, different layouts

3. Easy to mute: Want to stop seeing User B? Client simply stops downloading their stream.

4. Scalable: Server doesn’t do heavy processing. It’s basically a smart router. More CPU = more users.

5. Efficient bandwidth: Each peer uploads once. Downloads scale with group size, but that’s the same for all architectures. Benefit: only download what you need.

Disadvantages of SFU

  • More bandwidth for clients: Each client receives multiple streams (one per peer). Download = N-1 streams
  • Client-side processing: Decoding and rendering multiple streams uses CPU on your device
  • Network congestion: If a peer has poor upload, it affects everyone (they receive a poor-quality stream)

Trade-off: SFU moves work from server to clients. Servers are more powerful, but clients distribute the load.


Part 5: Comparing the Three Architectures

AspectMesh P2PMCUSFU
Connections per peerN-111
Bandwidth per peerHighLowMedium-High
Server processingNoneVery HighLow
LatencyLowestHighestLow
ScalabilityPoorMediumHigh
FlexibilityHighLowHigh
CostLowHighMedium
Best for1-1 callsSmall groups with fixed layoutLarge groups

Part 6: Real-World Examples

Zoom

Zoom primarily uses SFU architecture.

Why?

  • Handles 1000+ participants (scalable)
  • Low latency (you can see people react in real-time)
  • Flexible UI (you choose gallery view, spotlight, fullscreen)
  • Server is a router, not a processor

When you’re in a 100-person Zoom call, the server isn’t composing a 100-person grid. It’s forwarding 100 separate video streams.

Google Meet

Google Meet uses SFU with some hybrid elements.

  • Large meetings: SFU (everyone receives multiple streams)
  • For display, sometimes uses client-side layout (no MCU mixing)

Teams/Skype

Microsoft uses SFU for most calls, with MCU for specific scenarios (like recording entire meeting as one video).

Discord/Telegram

Small group calls: SFU (Discord supports up to 25 people on video)

Why not larger? Because receiving 25 streams stresses home internet.


Part 7: When Do You Use Each?

Use Mesh P2P When:

  • Only 2 people (1-on-1 calls)
  • Privacy is critical (no server involvement)
  • Low latency is essential
  • You have good internet

Example: Secure peer-to-peer chat apps, some P2P game networking.

Use MCU When:

  • Small group (3-10 people)
  • Fixed layout is acceptable
  • You want to save client bandwidth
  • Recording the entire call as one video

Example: Traditional conference systems, old WebEx/Polycom.

Use SFU When:

  • Large groups (10+ people)
  • Flexible UI is important
  • Low latency is critical
  • Scalability matters

Example: Zoom, Google Meet, modern conferencing platforms.


Part 8: The Client Rendering Problem

In SFU, clients receive multiple streams. Now the client must:

  1. Decode each stream (video codec, like VP8 or H264)
  2. Resize each stream to fit the grid
  3. Compose them into a canvas
  4. Render at 30fps

This is CPU-intensive. Here’s why Zoom on a laptop can get hot:

10 participants × 30fps × decoding overhead = High CPU usage

Optimization: Clients often:

  • Reduce resolution of non-focal streams (you don’t need 4K for a small tile)
  • Pause decoding of streams you’re not looking at
  • Use hardware acceleration (GPU) for video decoding

Part 9: Conclusion: Architecture Affects Everything

The architecture you choose isn’t just technical. It affects:

  • Latency: How fast do users see reactions?
  • Scalability: How many people can join?
  • Cost: How much does the server cost?
  • Quality: What resolution and framerate?
  • User Experience: Can I mute someone? Go fullscreen? Control my view?

The trend: Modern platforms use SFU because:

  • Scales better than MCU
  • Lower latency than MCU
  • Better UX than P2P
  • Cost-effective

Further Reading