header graphic - photo of a paper clip in a wooden frame with image description - symbolic image for AI-based visual search

AI visual search: searching for images with natural language

Reading time: 8 minutes

Table of contents

In the field of artificial intelligence (AI), new language models such as GPT or Gemini are currently bringing about groundbreaking changes. Also in the professional media management / DAM sector, possibilities are opening up that previously seemed unthinkable. One of these is visual search using natural language. Both images and videos can be searched for content using this technology – without the need for metadata.

In this article, we explore the technical foundations and practical benefits of AI-based visual search. As this is a completely new bundle of functions, no standardized term has yet been established. In English, the term AI visual search is usually used. For the sake of brevity, we will mostly refer to visual search in the following. It should be noted, however, that the correct generic term is neural search, because the technology, like other AI searches, is based on specially trained artificial neural networks. More on this in the section after next. First, however, it should be clarified what exactly is meant by the second important term: natural language.

What is natural language?

Natural language is nothing other than human language, both in spoken and written form. Fully developed sign languages are also included. However, only the written form is relevant for us. Of course, words can also be spoken and gestures recorded, but it would amount to the same thing, because the information always has to be translated into binary-coded characters for machine speech processing.

In practice, visual searches can simply use words, word combinations, sentences or half sentences to find images. There are no special rules to be observed beyond everyday language usage. You are therefore extremely flexible when formulating a search query. This can also be very specific and could look like this, for example:

Photo of an elderly man with sunhat sitting in a rowing boat and fishing

If no hits are achieved, less important search criteria should be gradually removed. Example:

An elderly man sits in a boat and fishes

Etc., whereby the rules for upper and lower case are not relevant. The same applies to the position of sentence elements (as long as the meaning of the sentence is retained). The sentences A man fishing at the lake and At the lake a man is fishing should therefore lead to the same search result.

AI visual search also works with less common languages (but not always with the same precision). Implementation is already possible for over a hundred languages – from Afrikaans to Zulu.

What are the technical basics of visual search?

The visual search uses large language models (LLMs) to analyze images in a new way (including video frames). Hundreds of millions of image-text pairs are usually required to train the underlying artificial neural networks (ANNs).

The goal is to capture the semantic relationships between depicted objects and associated texts such as image descriptions or keywords using deep learning methods and store them in vectorized form. For this purpose, the image and text information of each data pair is mapped in a common vector space. Finally, the semantic proximity (or distance) between certain images and texts needs to be recognized and reinforced. As a result, such a model should generate suitable descriptions for newly presented images. This works even if the depicted objects were not explicitly used as training examples.

In fact, a visual search developed in this way recognizes all everyday things (including well-known products and brands) with a high degree of reliability. Text within images, videos or documents is also reliably captured. Manually labeled training data sets therefore only need to be created if particularly specific objects are to be recognized.

Conclusion: with visual search, image content can be reliably found using text input in natural language – without metadata or additional training. The technology is therefore proving to be an absolute game changer in the field of AI-based image recognition.

3 advantages of the visual search in the DAM area

Several advantages of visual AI search have already been mentioned for professional media management (digital asset management, or DAM for short). Here we list the three most important ones once again:

  1. Increased efficiency: since it is no longer necessary to manually tag and categorize each image with the visual search, a lot of time and resources are saved. Overall, AI-driven analysis and classification of image content speeds up work processes enormously.
  2. Improved findability: the visual search maximizes the findability of image and video files, even when searching for very specific or rare content. Users can enter precise search queries in natural language. They receive relevant results without having to rely on manually added metadata.
  3. Accessibility: users with different levels of technical understanding can use the visual search, as the search queries can be made in simple everyday language. This lowers the barriers to entry and enables a wider audience to use DAM systems effectively.

In addition to these three main benefits, visual search also helps to improve collaboration within teams by enabling faster and more accurate delivery of required media content.

Best in combination

AI-based visual search will not replace metadata-based search in all areas of application. In certain industries, it makes sense to combine the various technologies. Metadata will continue to play an important role in legal requirements or specific industry standards. Historical archives, scientific research institutions, museums or specialized stock photo agencies will probably never be able to work without human-verified metadata. Some content can (so far) only be correctly described, evaluated and classified with specialist knowledge.

However, content that does not require academic expertise to be cataloged can now be indexed almost incidentally using AI-driven technologies. This allows users to search hierarchically organized metadata and AI-generated vector data (without hierarchy) in parallel. This also increases findability and provides more flexibility overall.

Solutions that combine classic metadata structures and AI-based search functions are therefore the new gold standard in the DAM sector.

Application examples

In practice, visual search (embedded in a DAM system) can provide greater convenience and efficiency in a wide range of industries. Some examples:

  • Professional sports industry: images of specific game scenes or emotional moments can be found quickly and easily after a day of competition by entering action descriptions in natural language. Time-consuming viewing of new photo and film material is no longer necessary. A meaningful search query could be, for example: Soccer players in red jerseys cheer after scoring a goal
  • Marketing and advertising industry: suitable campaign motifs can be found more quickly through the visual search because specific emotions or scenarios can be formulated directly in the search query. The full flexibility of natural language is at your disposal: A young woman lies on a green meadow and looks up at the sky with a slight smile. Now such a motif just has to be available :)
  • E-commerce: in the fashion industry, for example, customers could search more specifically for visual product features to see which of the products on offer match their personal specifications and style preferences. This improves product presentation and the shopping experience for customers. Example: leather boots for women, in green and with zipper. They should be available!

Conclusion

AI-based visual search is currently revolutionizing the entire DAM industry. Providers of professional media management software will soon no longer be competitive without the use of this technology. The increase in efficiency that results from automatic indexing is so enormous that no one will be able to avoid it. Manual keywording will no longer be necessary in many companies. Although there will still be industries that rely on human-checked metadata, combined methods will save a lot of time and resources there too.

It also makes it easier to manage digital assets. By using natural language to identify image and video content, even less tech-savvy users benefit greatly from the technology.

Test the Visual Search from teamnext

So far, there are only two providers in the GSA region that have integrated this new technology into a DAM solution. One of these pioneers is teamnext from Kassel. With our AI engineers, we are always at the cutting edge and can already offer you all the functions described in this article with our Visual Search.

Have we aroused your curiosity? You are welcome to experience our AI-based search in practice. With a free 14-day test phase for the teamnext | Media Hub, you have the opportunity to try out the Visual Search extensively. Alternatively, you can also make an appointment for a personal online demo with an expert.

If you have any further questions, please contact our support team at any time. Further information can be found at teamnext.de/en/contact.

You might also be interested in

Employee manages photos on Mac - symbol image for collecting photos from groups
Header image: Microsoft Sharepoint in comparison with DAM systems