WebRTC Voice Applications

WebRTC is commonly used for voice applications but video takes the highlights

Relatively speaking, WebRTC video communications is massively overstated in importance. That may sound like heresy, coming from a WebRTC analyst, but it’s also true. Now that doesn’t mean that video is unimportant, nor that it isn’t going to grow hugely in scope, but it’s certainly not the only game in town. And it highlights the surprising lack of voice-only use-cases for WebRTC so far.


This reflects a common fallacy in the telecoms industry that so-called “richer” or “multimedia” forms of communication are inherently better, when in fact, they’re just better-suited to certain use-cases or contexts.

Indeed, one only has to look at the huge proliferation of messaging-type applications in recent years, from SMS to web chat to Twitter to email to the various mobile IM models, to realise that often “less is more” in communications. (The obvious counterpoint is RCS/joyn, which amply illustrates that being “rich” doesn’t make you popular).

Given a broad choice of options, consumers tend to pick whatever seems to be the “right tool for the job”. Even when offered a “multimedia chainsaw”, there are still plenty of occasions when a good old-fashioned textual screwdriver or audio spanner is more appropriate. Globally, around 4-5 billion people use voice and text communications regularly. For video, it’s probably more like 100-200m – and for multi-party video, only a small fraction of that.

Too many commentators lazily refer to WebRTC as “Skype in the browser”, invoking an image of video chat or conferencing as the default mode. Few people use terms like “VoIP in the browser” or “Viber in the browser”. Yet ironically, it’s the audio codecs which are agreed, while video is still subject to debate.

Now to be fair, there are various WebRTC audio conferencing products out, while Vonage launched one of the very first mobile WebRTC apps last year. A number of internal contact-centre solutions use a browser dashboard instead of a traditional telephony platform. Twilio, Plivo and Tropo have voice-centric cloud platforms, while a couple of Telco-OTT propositions evolve the normal telephony model to WebRTC. There’s even one or two music-jamming applications around.

But these are exceptions. Most prototypes, demos and commercial WebRTC platforms are video-centric. There are dozens of lookalike video chat services, or video contact-centre concepts. There are innumerable presentations and white papers extolling a new age of video interviews, video telemedicine, video dating and connected video-capable “things”.

Yet almost no thought, design or marketing goes into new ways to extend human speech – or other forms of audio – view WebRTC. It all eyes, but no ears. It’s as if 120 years of “phone calls” has blinded (deafened?) us to the viability of other formats for voice.

Now, it could just be that video is just “shiny” and demo-friendly in a way that audio generally isn’t. It also attracts vendors selling bigger and costlier network boxes too, as mixing and transcoding aren’t as commoditised or easily-addressed by open-source. It could also be down to psychological or design-related reasons – talking to a browser seems a bit weird for some reason, compared to talking to a standalone app.

But the fact is that the bulk of today’s realtime communications is voice-centric, often for good practical reasons. A lot of people cannot or will not use video for many cases – it may be dangerous (eg while driving/walking), distracting, invasive or uncomfortable. In a multi-tasking world, looking at a camera often involves too much cognitive load (especially as you watch your own image), and may inhibit concurrent tasks such as note-taking, or reading presentation slides.

WebRTC-powered video will absolutely have many uses cases, but it equally can never be ubiquitous or the default mode for all instances of communications.

So it seems strange that so few WebRTC applications and services have been targeted at audio-only, or even audio-primary usage. There seems to be a significant gap for companies (or open-source) solutions to enable more pure-audio WebRTC than is currently seen. In particular, the assumption that anything based around speech is necessarily a “call” and could/should be interoperable with the phone system is wrong.

Yet even within the traditional telecoms industry, we’ve long had other formats for voice communications – walkie-talkies, private radio, push-to-talk, voice messaging, hoot-n-holler and so forth. Add in cloud capabilities like speech recognition, storage, translation, audio-processing of various types and we should have a wide range of WebRTC possibilities. Where’s the “Voice Instagram” that allows people to converse in Glaswegian accents or Donald Duck squawks? Where are the realtime profanity bleep-outs, or inline stress-analysis lie detectors?

And going beyond the actual transmission of spoken words, there’s another world of intent and purpose. Why exactly are people talking, and what are they actually hoping to achieve? How can the web – and the network – enhance that? The contextual capabilities of browsers and devices should be able to add to the experience of audio communications – recognising when to capture and emphasise the sounds of crashing waves on beach during a call. Or when to block out the sound of a crashing bore in the background at a party.

WebRTC video offers huge opportunities. But at the same time, we should remember that voice communications has delivered trillions of dollars in revenue in the past, and could continue to do so. Let’s ensure that the Future of Voice is as vibrantly-coloured as the Future of Video.

Read more