Building a Twilio Video Backend That Actually Works in Production

Twilio Video makes the first demo easy. Making it reliable, multi-tenant, and observable in production is a different problem entirely — and that is the one worth writing about.

The Actual Problem With Video Infrastructure

Integrating Twilio Programmable Video at the API level takes an afternoon. You create a video room, generate a JWT access token with the appropriate video grant, hand it to the client, and the demo works. What does not work, yet, is everything that happens around that: webhook delivery failures, composition jobs that stall silently, participant state that gets out of sync with what Twilio knows, and multi-tenant token isolation that has to be correct every single time.

The PawSquad and VetCloud telemedicine platforms needed a production-grade video backend that could support veterinary consultations at scale — group rooms, optional session recording, and real-time participant state visible to coordinators. The failure modes that matter in a telehealth context are not theoretical. A dropped participant event means a coordinator does not know the room is still live. A failed composition means a recorded consultation is gone.

The architecture decision made early was to treat Twilio as an external dependency to be isolated, not a foundation to build on. That distinction shapes everything that follows.

Token Generation, Room Lifecycle, and Local State

Room creation goes through a thin service layer that configures the room type, participant limits, and recording preferences before calling Twilio's room API. Every room, session, and participant record is immediately written to a local database via Entity Framework Core — not because Twilio's API is unreliable, but because availability coupling is a risk that compounds over time. The local record is the source of truth for the application; Twilio is the transport.

JWT access tokens are generated by a dedicated token service. Each token carries the claims Twilio requires — account credentials, room scope, and participant identity — along with a short expiration window. Participant identities go through format validation and a reserved-word blocklist before the token is issued. In multi-tenant video platforms, a spoofed or malformed identity is a real attack surface: it can pollute session logs, trigger incorrect billing events, or bypass participant-level access controls.

Multi-tenancy is enforced at every layer. Each client operates under its own API credentials, all database queries are scoped by tenant identifier, and credential lookups are accelerated by a caching layer with a configurable TTL. The tradeoff between cache freshness and lookup overhead is an explicit configuration decision, not an implementation assumption.

Webhooks, Signature Validation, and Real-Time State

Twilio delivers room lifecycle events — room-started, room-ended, participant-connected, participant-disconnected, dominant-speaker-changed — as signed HTTP POST callbacks. Every incoming webhook is validated against the X-Twilio-Signature header before any payload is processed. This is a non-negotiable step: an unvalidated webhook endpoint in a telehealth platform is an unauthenticated state-mutation surface that any caller with the URL can exploit. We do not skip it.

Validated events update local participant and session state, then propagate to connected web clients through a SignalR hub. Join and leave events, dominant speaker changes — all pushed in real time without polling. The real-time push layer is decoupled from the webhook processing pipeline through a defined interface, which keeps the two concerns independently replaceable without touching either side's logic.

The reason for this architecture over a polling approach is latency and server load. Coordinator dashboards in telemedicine need to reflect room state within seconds. Polling at a useful frequency generates significant unnecessary traffic against both the application server and Twilio's API; SignalR push eliminates that overhead entirely.

Async Composition Pipeline and Background Workers

Recording introduces the most operationally complex part of the system. When a session ends with recording enabled, Twilio does not immediately produce a composed video file — it produces individual track resources that must be assembled via a separate composition API call. This is a deliberate boundary on Twilio's side, and it means video composition is an asynchronous, multi-step process that continues after the consultation ends.

Two background workers manage this pipeline independently. The first picks up sessions with unprocessed recordings and initiates composition jobs against Twilio's API on a regular polling cycle. The second tracks status updates on in-progress compositions and writes the results back to local state. Both workers operate against local data, so a transient Twilio API failure does not block the next processing cycle. The tradeoff is that final video availability is eventually consistent — for post-consultation review workflows in veterinary telemedicine, that boundary is acceptable and explicitly documented.

Structured logging and error tracking give full visibility into both workers. A stalled composition surfaces as a monitoring alert, not a support ticket filed three days later when a clinic notices a missing recording.

Twilio Video production backend architecture: Client → Token Service → Twilio Room (isolated external transport) → Webhook Handler → Real-time Push via SignalR. Result: less than 3% call failure rate in production.
Production backend architecture — Twilio isolated as external transport, local database as source of truth

Architecture Is the Deployment Plan

The Twilio SDK handles the protocol. The architecture handles the failure modes — stale state, composition delays, webhook replay, tenant isolation. None of those are Twilio problems; they are system design problems that every production video integration eventually encounters.

At Smartnet, this backend runs in production for telemedicine workflows where reliability is not a nice-to-have. The patterns here — local entity caching, validated webhook ingestion, async worker pipelines, SignalR push — are transferable to any domain where video infrastructure needs to be observable, tenant-safe, and recoverable.

This backend is also where the AI reporting pipeline connects. Read about how we reduced post-consultation report writing time by 56% using the session data this architecture produces.

Need a video backend that holds up in production?

Smartnet builds production-grade real-time communication systems on ASP.NET Core — from architecture to deployment.

Contact us →