Automized image descriptions

Date of publication

In the two blog articles Images without Barriers and The Right Image Description, I explained how paraphrasing texts for images can be added manually and what makes a good explanation of photos or graphics. Only with this knowledge can automatically created image explanations be evaluated at all. In my following remarks, I combine the formulation of what is visually represented and the reading out of texts.

How do you get automatic image descriptions?

There are a few services that provide image descriptions. These are based on machine learning. Whereas a few years ago you had to laboriously teach people to distinguish a cat from a dog, it is now possible to recognise individual components of images and in some cases to relate them to each other. This results in very neat picture descriptions. However, they do not (yet) replace a picture description that was created by hand.

The different possibilities


There are various ways to obtain image descriptions.


The screenreader JAWS offers the possibility to have pictures described on request. The image you want to have described is sent to Microsoft, Facebook or Google and you then receive the description.


A similar behaviour can be retrofitted in NVDA with the Online OCR extension. However, this works rather badly than well.


The Apple VoiceOver screeenreader can also use a picture description. However, this is the only option that works locally on the iPhone itself, so it works without an internet connection. Only one set of data for machine learning has to be downloaded to the device. From a data protection perspective, Apple's approach is very welcome. Through the seamless integration of the screen reader into the operating system, even cover art is described in Apple Music. Currently, however, the descriptions are still in English.

In addition, there is the possibility of rudimentary reading of apps that have not been programmed barrier-free by programmers. The usual symbols and icons are interpreted. However, this is definitely no substitute for apps that have been properly programmed for accessibility.



Facebook has already integrated the image descriptions in its own products, so that one already receives an automated image description for unlabelled images. However, this image description is more like a keyword list. "Nature, sky, tree" would be an example.


Google has added its image description algorithm to the Chrome browser since 2019. If you open the context menu on an image, you will find the option to use image descriptions from Google. In contrast to Facebook's descriptions, they are much more extensive here. For example, I was able to read a screenshot to the extent that I knew it was a screenshot. It was also clear which app the screenshot was from and what text was on it. This resulted in a very complete description in this example.

However, there is no general answer to how good such a description is.


Microsoft's image description is advertised as describing images as well as a human would. With very simple motifs, the descriptions come very close to human descriptions. The more complex the image, however, the greater the distance.

The image descriptions from Microsoft can be examined in the app Seeing AI, among others. The special feature of this app is that it is possible to explore recognised images with your finger. This gives an impression of where certain objects or text are located in the image. In addition, this app works with the live image from the camera, so in many cases there is no need to take a photo beforehand.

Envison AI

For the sake of completeness, Envision AI should be mentioned here as a counterpart to Seeing AI. The concept of this app is very similar to that of Seeing AI. However, it is available for both iOS and Android.

Envision AI uses different platforms for image recognition, so that theoretically the most advanced technology can always be used.


Data protection is not to be neglected in this topic. Many of the options presented here for automatic image descriptions use one or more cloud services. It should therefore be noted that cloud services may send personal data to third parties for evaluation without knowing what exactly happens to it. It is advisable to inform oneself before use so as not to be in breach of contract.

Apple with image recognition in VoiceOver thankfully works offline. Seeing AI and Envision AI also offer at least fast text recognition locally on the smartphone without an internet connection. As computing power increases, it is conceivable that more functions will be handled locally. But only time will tell whether this will happen.

A look into the future

Despite full-bodied promises, I do not believe that manual image description will become superfluous. At least not in the near future. When the guide dog is recognised as a cat, that's funny. But it also shows the error-proneness of such automatic systems.

The strengths of such AI-controlled systems, however, already lie where there are no image descriptions. For example, the iPhone 12 Pro models can determine the distance to objects or people in real time with astonishingly low latency. In a next step, such functionality will be built into glasses so that people can keep their hands free.

The company behind Envision AI has made a start with the Envision Glasses. Other companies will follow.

Nevertheless, the personally created picture description remains the means of choice for publications of any kind. This gives you control over which aspects of the image are to be conveyed.

And this is the practice

The following screencast is an example of what is currently possible.

Screencast: Automatic image descriptions

Profile picture for user DeepL

DeepL is a deep learning company that develops AI systems for languages. The company, based in Cologne, Germany, was founded in 2009 as Linguee, and introduced the first internet search engine for translations. Linguee has answered over 10 billion queries from more than 1 billion users.

Profile picture for user dennis.westphal

Dennis Westphal

Dennis is an IT consultant at the Company for the Development of Things. His field is accessibility. Helpfully, Dennis has been blind since birth. He creates his screencasts with open source software.