Media handling tips - how to best utilize the Media Stream API?

It's been more than six years now, since introducing HTML5 as an official recommendation from the World Wide Web Consortium (in short W3C). But nonetheless, HTML 5 standards are still evolving, and many developers do not imagine their life without bundled APIs in HTML5. Not to mention that given the strong popularity of media among users, modern web development increasingly requires more and more complex media handling to account for users’ functional expectations of every application. Having said that, let's dive into an API which makes media handling simple - let me introduce you to the Media Stream and Capture API as well as some of its closest relatives.

Historic recap

Starting from the beginning, the possibility of simple audio and video capture was always one of the goals of the web developers' community. To achieve this aim, plug-ins such as Adobe Flash or Microsoft Silverlight were created. The questionable popularity of plug-ins goes without saying, however, things have changed once HTML5 was presented to a wider audience. This time, the W3C organisation put more emphasis on modularity, defining the characteristics of specific interfaces and developing them as separate specifications. Among the most popular APIs are, for example: Canvas, WebGL (OpenGL port), Drag & Drop, History API and, of course, the protagonist of this article which is the Streams API. What’s important, this breakpoint had a truly significant impact on web development, as we know it today. Now let’s dig into WebRTC which is the focal point of our topic.

Introducing WebRTC API

The term WebRTC stands for 'Web Real Time Communication' and as such is self-explanatory. The underlying purpose of this API is to provide a fast and reliable way of connecting peers. If two or more users want to communicate in real time (RT), the server working as a proxy between them is not only redundant but even hinders a proper operational flow. In a perfect case scenario, latencies around messages should amount to precisely zero milliseconds, so peer-to-peer (p2p) communication seems to be a better alternative in this situation. However, we cannot use such infrastructure, if we don't have a single message to send, can we? As a result, this became the main motivation for the creation of Media Capture and Streams API (also called the Media Streams API). If we have excellent infrastructure for RT communication, we need just as perfect messages to send. What better candidate for that than audio and video?

GetUserMedia - a deep dive into capturing users' media

After this short history lesson, we're good to go with the main point. As always, our concern is centred around functions. This time, however, there will be only one function to consider. That’s correct - only one function to be able to access an extremely powerful functionality. It's possible because this single function encapsulates basically everything that is included in Media Streams API, and even features which go beyond that. You probably know it as the simple ‘getUserMedia’:

getUserMedia();

or, to be more precise:

async Window.Navigator.MediaDevices.getUserMedia()

Canvas has a similar functionality, allowing programmers to capture streams or just a single picture from the canvas. Let's break it down step by step.

Capabilities and Constraints function

Capabilities and Constraints are very basic functional areas within the Media Streams API. They're obtainable via the following method of the mentioned interface and are basically the 'frames' for our stream:

MediaDevices.getSupportedConstraints();

We pass these 'media requirements' as an argument to our hero-function and that's all we need to do. The Capabilities functionality constitutes the other side of this same coin. It is an optional step but before you conceive your constraints, you could check if they are supported to begin with - and if not, you should go back to the auxiliary requirements. Some examples of requirements include (but are not limited to): height and width. If the user is on the phone, they probably have more than one camera - and the decision about which one will be utilized is up to you. A full list of the possible customizations is available in the W3 official Media Capture and Streams documentation.

GetUserMedia function breakdown

If we know what constraints limit us and how to build our own ones, let's see how to deal with the core function.

async getUserMedia()

It's an asynchronous function, so it will return a Promise that resolves to obtaining the MediaStream object or producing an error, if the user refuses to give permissions to access his or her camera and microphone. So after invoking this function you have to make use of the Promise API and invoke respectively the `.then()` and `.catch()` functions, or simply wait for the event loop with `async/await` keywords. This is illustrated in the example JavaScript code snippet shown below:

const showUserHisFace = () => { navigator.mediaDevices.getUserMedia({ video: true, audio: true, }).then(stream => { const videoElement = document.getElementById('mirror-video'); videoElement.srcObject = stream; }) .catch(err => console.log(err)); };

This is the most basic usage of the getUserMedia function and as such it will capture the user camera input and set it as video `sourceObj`. If this page has a corresponding HTML `video` element - it will work as a mirror, showing the user their face. MediaStream objects consist of several synchronised MediaStreamTracks, each one representing a single input channel. In most cases, there will be two tracks, one for audio and one for video, so you can separate this if you need to save the media. This is a great example of how HTML APIs work together. Nearly everything comes out-of-the-box, so there is no need for data parsing, converting it to an objectUrl like we all had to in the past, etc. These operations are significantly simplified now, as all you need to do is just pass an object that you received to an HTML5 <video>, and it's done. No need for "src" attribute, manipulating or "feeding" it with <source> tags. To sum up, the amount of code is kept to an absolute minimum, while the number of functionalities is enormous. Therefore, it is a mixture that every developer literally loves.

Security matters

What is important to know, the Media Streams API requires your worker or window to exist in a secure context. What does it mean? Basically, if your resources do not come from your local environment, they must be served over a secured protocol, e.g. WSS or HTTPS. Most HTML5 APIs require sending via a secure connection, too. The reason for this is the fact that they are extremely powerful in terms of privacy interference and, as a result, privacy issues come first - ahead of functionalities. A bunch of useful tips on security in web development can also be found here.

As far as security matters are concerned, check out also our strategies on how to secure your API!

What's next

If you are interested in the topic of multimedia resources, we recommend exploring the rest of 'Media APIs' which are derived from the WebRTC API. Its closest relative is the HTMLMediaElement interface. While MediaStream focuses on getting the user input, its core responsibility is to manage media resources by utilizing the <audio> and <video> HTML elements. The next one among the WebRTC-related APIs is the MediaStream Recording API, which enables you to save the effects of your work. The MediaStream API itself may not be comprehensive enough to create your dream functionality, but the other APIs mentioned above are a complementary extension to it.

What is the MediaStream API’s typical business application? Real-world examples of usage usually give us a general big picture of the usability of a given software solution, so I'll name some of the most recognizable ones. Of course, the first place goes to real-time communicators, like Google Hangouts or Microsoft Teams. The second one to streaming services, like Twitch or Youtube and, lastly, the third one goes to, in my opinion, its most interesting use case which is eye tracking.

Bonus - eye tracking with WebGazer

In simplest terms, the eye tracking technique allows you to track and analyze the movement of a user's eyeballs. What does it mean? In essence, it allows someone with access to such data to state where the user's focus currently is. This is powerful information because it allows you to manage where the important data should go. If you have a heatmap, indicating which region of your webpage has most of the user's attention, you can easily expose the most important data. For this reason, tracking somebody's eye movement is far from trivial.

Maybe you're familiar with solutions for eye tracking like Tobii, which is heavily used by streamers on platforms like Twitch.tv. In fact, Tobii requires third-party hardware extensions which may seem somewhat limiting. Is that a fact, though? Seeing that you can utilize the MediaStream API along with a camera, these solutions prove absolutely sufficient in building an effective real-time eye-tracking functionality. There are two ways in which you can approach this. First, you can do it from scratch (some basic machine learning knowledge will be required) or, alternatively, by using WebGazer.js.

The main motto of the WebGazer project is - "democratizing webcam eye tracking on the browser" which accurately reflects its mission. You can even see a pattern by comparing the MediaStream API and WebGazer in that both the required amount of code and the offered functionality are at a similar level in both cases. An important advantage is that everything happens "under the hood", inasmuch as the library, processing the whole logic on its own, requires a largely similar configuration.

So what is the WebGazer and what exactly does it do? WebGazer is a self-calibrating, client-side, framework-agnostic library, written entirely in JS which `predicts` user eye movement, utilizing a `js-objectdetect` library for face and eye recognition as well as the `TFJS` library for machine learning. TFJS stands for 'TensorFlow JavaScript' and it is a hardware-accelerated JavaScript library for training and deploying machine learning models. This library trains various regression models that match pupil positions and eye features. If you are curious about how exactly it works, you can explore it in the official research description.

Now let’s explore the WebGazer in more detail and take a look at its integration and basic usage as well as theoretical usage ideas. First of all, you have to install the WebGazer. As stated in the WebGazer documentation, you have a freedom to choose between a CDN (local or not) and a package manager.

Now, that the WebGazer is up and running and has an extended namespace. Now you should provide the gazer with input which will cause it to start collecting data. The function that you need looks as follows:

const startDataCollecting = async () => { webgazer.params.showVideoPreview = true; // required for working webgazer .setRegression('ridge') .begin(); };

Simple, isn't it? Let's take a look at what exactly is going on here. The WebGazer instance exists as a singleton, so invoking this function will set its internal properties (such as the used regression), or it will determine if the video preview should be visible. Subsequently, data collecting and analysing begins, which means that the only thing we're missing is the last step - the returned data usage.

const updateGazePredictionListener = () => { webgazer.setGazeListener((prediction, elapsedTime) => { setXPrediction(prediction.x); setYPrediction(prediction.y); }); }; const setCurrentPredictions = async () => { const prediction = await webgazer.getCurrentPrediction(); setXPrediction(prediction.x); setYPrediction(prediction.y); };

You can access it, using the two above-mentioned functions. The first one is a constant debounced listener, while the second one is a one-off. The returned data in both cases is the same, namely an object with X and Y coordinates, relative to the viewport. In fact, these X, Y coordinates indicate the point at which the user is looking!

Conclusion

As we can see, the Media Stream API along with its closest relatives are very powerful yet simple solutions for establishing interactions between software intermediaries. Its basic usage, although seemingly trivial, lets you work in an unconstrained way - where the limit is set only by your imagination - like in the WebGazer example. Simple media tracking allows you to build such extraordinary things! As the above examples show, the MediaStream API allows you to create very complex logic and functionalities, especially when combined with other complementary HTML5 APIs, just by letting you capture users' media. For a comprehensive API experience, we recommend trying out other APIs related to the WebRTC, and using the MediaStreams API as part of WebGazer for eye-tracking.

Do you already have ideas on how to use Media Capture and Streams API in your applications? Join us to share them!

2021-01-15