Cloud Vision API: Hear what your phone sees with image detection from a webpage

Last updated: October 18, 2016

What do you think about this tip? Email umar.hansa@gmail.com with feedback

Introduction

This post roughly explains how I made a web application which can recognise images captured through a webcam or mobile camera and audibly announce to the user what was detected. For example, if you point your camera device at a book, the web page announces 'Book'.

Video demo below (includes audio)

This has some interesting use cases:

Understanding the correct pronunciation for the name of an object in front of you
A visually impaired user understanding their surroundings with a point-and-listen approach

Overall approach

A video feed from the user is taken (webcam/device camera) with the MediaStreamTrack and getUserMedia APIs.
The video plays on a canvas.
At frequent intervals (1 second), the base64 encoded image is sent to the Google Cloud Vision API.
The Speech Synthesis API reads the response (e.g. dog, book, chair) to the user.

How?

Note, these are over-simplified code examples which don't work on their own. Please read the documentation for the relevant API if you wish to do this yourself.

Get camera input

MediaStreamTrack.getSources(sources => {
	const [{id}] = sources.filter(source => source.kind === 'video')

	navigator.webkitGetUserMedia(
		{id},
		stream => console.info(stream),
		err => console.error(err)
	);
});

Prepare image payload for identification

setInterval(() => {
	const url = 'https://vision.googleapis.com/v1/images:annotate?key=';
	const image = canvasElement.toDataURL('image/jpeg', 0.5);
	const payload = { url, image };
}, 1000);

Identify!

const {labelAnnotations} = await fetch(url, {
	method: 'POST',
	body: JSON.stringify(payload)
});

console.log(labelAnnotations) // ball, circle, apple, sphere

Speak

const utterance = new SpeechSynthesisUtterance('apple');
const voice = window.speechSynthesis.getVoices()[0];
utterance.voice = voice;
window.speechSynthesis.speak(utterance);

I have not shared a live demo since using the Cloud Vision API costs me money.

No server side component is needed for this web application.

« Previous tip Next tip »