/dev tips

🎊   The Modern DevTools course has now started! Visit ModernDevTools.com to find out more   🎊

(117) Inspect and learn about function scope Node.js code hotswapping with DevTools (119)

Cloud Vision API: Hear what your phone sees with image detection from a webpage

18 Oct 2016
Hear what your phone sees with image detection from a webpage

What do you think about this tip? Email umar.hansa@gmail.com with feedback

Introduction

This post roughly explains how I made a web application which can recognise images captured through a webcam or mobile camera and audibly announce to the user what was detected. For example, if you point your camera device at a book, the web page announces 'Book'.

Video demo below (includes audio)

This has some interesting use cases:

Overall approach

  1. A video feed from the user is taken (webcam/device camera) with the MediaStreamTrack and getUserMedia APIs.
  2. The video plays on a canvas.
  3. At frequent intervals (1 second), the base64 encoded image is sent to the Google Cloud Vision API.
  4. The Speech Synthesis API reads the response (e.g. dog, book, chair) to the user.

How?

Note, these are over-simplified code examples which don't work on their own. Please read the documentation for the relevant API if you wish to do this yourself.

Get camera input

MediaStreamTrack.getSources(sources => {
    const [{id}] = sources.filter(source => source.kind === 'video')

    navigator.webkitGetUserMedia(
        {id},
        stream => console.info(stream),
        err => console.error(err)
    );
});

Prepare image payload for identification

setInterval(() => {
    const url = 'https://vision.googleapis.com/v1/images:annotate?key=';
    const image = canvasElement.toDataURL('image/jpeg', 0.5);
    const payload = { url, image };
}, 1000);

Identify!

const {labelAnnotations} = await fetch(url, {
    method: 'POST',
    body: JSON.stringify(payload)
});

console.log(labelAnnotations) // ball, circle, apple, sphere

Speak

const utterance = new SpeechSynthesisUtterance('apple');
const voice = window.speechSynthesis.getVoices()[0];
utterance.voice = voice;
window.speechSynthesis.speak(utterance);

I have not shared a live demo since using the Cloud Vision API costs me money.

No server side component is needed for this web application.

Sign up to receive a developer tip, in the form of a gif, in your inbox each week.