Cloud Vision API: Hear what your phone sees with image detection from a webpage
Last updated: 19th July 2020What do you think about this tip? Email umar.hansa@gmail.com with feedback
Introduction
This post roughly explains how I made a web application which can recognise images captured through a webcam or mobile camera and audibly announce to the user what was detected. For example, if you point your camera device at a book, the web page announces 'Book'.
Video demo below (includes audio)
This has some interesting use cases:
- Understanding the correct pronunciation for the name of an object in front of you
- A visually impaired user understanding their surroundings with a point-and-listen approach
Overall approach
- A video feed from the user is taken (webcam/device camera) with the MediaStreamTrack and getUserMedia APIs.
- The video plays on a
canvas
. - At frequent intervals (1 second), the base64 encoded image is sent to the Google Cloud Vision API.
- The Speech Synthesis API reads the response (e.g. dog, book, chair) to the user.
How?
Note, these are over-simplified code examples which don't work on their own. Please read the documentation for the relevant API if you wish to do this yourself.
Get camera input
MediaStreamTrack.getSources(sources => {
const [{id}] = sources.filter(source => source.kind === 'video')
navigator.webkitGetUserMedia(
{id},
stream => console.info(stream),
err => console.error(err)
);
});
Prepare image payload for identification
setInterval(() => {
const url = 'https://vision.googleapis.com/v1/images:annotate?key=';
const image = canvasElement.toDataURL('image/jpeg', 0.5);
const payload = { url, image };
}, 1000);
Identify!
const {labelAnnotations} = await fetch(url, {
method: 'POST',
body: JSON.stringify(payload)
});
console.log(labelAnnotations) // ball, circle, apple, sphere
Speak
const utterance = new SpeechSynthesisUtterance('apple');
const voice = window.speechSynthesis.getVoices()[0];
utterance.voice = voice;
window.speechSynthesis.speak(utterance);
I have not shared a live demo since using the Cloud Vision API costs me money.
No server side component is needed for this web application.