• Insights
    • Webinars
    • Blog
    • Perspectives
    • Toolboxes
    • Events
    • Awards
  • Offering
    • Academy
      • Innovation courses
      • Innovation programs
    • Consulting
      • Onsite consultants
      • Project management
    • Services
      • Scout it
        • Business & strategy
        • Technology management
      • Invent it
        • Proposition management
        • Feasibility & IP development
      • Create it
        • Strategic design
        • Demonstrators & prototypes
      • Scale it
        • Go-to-market
        • Industrialization & production
    • Funding
      • Innovation grants
      • Seed capital
      • Subsidy applications
    • Solutions
      • MyStartUp
      • MyStartUp Portfolio
      • MyInnovationFactory
      • MyInnovationTalent
      • MyFutureProduct
  • Markets
    • Smart Space & Security
    • Smart FMCG
    • Smart Life Sciences
    • Smart Industry
  • Capabilities
    • Strategic Innovation
      • OpenLab
      • DesignLab
      • InnoLab
    • Digital innovation
      • AILab
      • DigitalLab
      • EmbeddedLab
    • Product innovation
      • MechLab
      • PhysicsLab
      • FabLab
    • High-tech innovation
      • OpticsLab
  • Technologies
    • Technology portfolio
    • IoT & sensors
    • AI & data science
    • Robotics & autonomy
    • Cooling, heating & fluidics
  • About
    • Our story
    • News
  • Jobs
  • Contact
Verhaert Masters in InnovationVerhaert Masters in Innovation
Verhaert Masters in InnovationVerhaert Masters in Innovation
  • Insights
        • Insights on current innovation and technology trends

        • Perspectives
        • Blog
        • Webinars
        • Toolboxes
        • Awards
  • Offerings
        • Services

        • Consulting

        • Solutions

        • Strategic innovation
        • Digital innovation
        • Product innovation
        • High-tech innovation
        • Design & engineering
        • Project & innovation management
        • MyStartUp - Disruptive innovation
        • MyInnovationFactory - Adjacent innovation
        • MyFutureProduct - Adjacent innovation
        • MyInnovationTalent - Core innovation
  • Markets
        • Smart Space & Security

        • Microgravity
          Earth observation
          Navigation
          Exploration
          Security
        • Smart FMCG

        • Dispensers
          Cooling & heating
          Servers
          Smart packaging
          Vending equipment
        • Smart Life Sciences

        • MedTech
          BioTech
          HealthTech
          Ophthalmic
        • Smart Industry

        • Mobility & logistics
          Chemical & material
          Home, building & construction
          Manufacturing & equipment
          Energy
  • Capabilities
        • Labs fueling integrated teams

        • OpenLab
        • DesignLab
        • InnoLab
        • DigitalLab
        • AILab
        • EmbeddedLab
        • MechLab
        • PhysicsLab
        • FabLab
        • OpticsLab
  • Technologies
        • Technologies

        • IoT & sensors
        • AI & data science
        • Robotics & autonomy
        • Cooling, heating & fluidics
        • Portfolio

  • About
        • About

        • News
        • Our story
  • Careers
  • Contact

ML’s elephant in the room: data labeling

13 September 2022 Posted by Niels Verleysen Digital innovation

From healthcare and manufacturing to space and marketing, machine learning proves to be a great tool to reduce costs, save time, and increase revenue. Managing this process, however, will prove one of the main challenges for businesses in the years to come. Once you’ve identified machine learning as your AI opportunity, there are two primary building blocks for building this model: data and – often overlooked – data labels. Labeling those datasets might be a lot trickier than you thought though. Here are some tips to navigate this challenge.

ML and data labeling

Collecting datasets

In our previous blog, we defined different steps to discover your AI opportunities. Once you’ve identified the process you want to automate and the information you hope to obtain, you’ll need data to feed the model. These are the camera images, audio signals, text messages, or sensor measurements the model will analyze to provide you with answers to your questions. Whether you are looking to predict the stock market or develop a medical application, having low-quality, biased or unreliable data makes the task impossible. Take for example a study on blood oxygenation levels that fails to consider the difference in sensor response of the pulse oximeter between patients with different skin colors. This would significantly reduce the probability of detecting occult hypoxemia in black patients compared to white patients.

Problem understanding is indispensable to producing a valuable dataset. Your team should understand the variability relevant to defining the problem in practice. Often, people tend to overly bias the dataset toward the most accessible data. A self-driving car whose algorithms are trained only on roads the developers happen to travel regularly is not robust. Not entirely unlike humans, ML algorithms might find it challenging to assess unknown situations. For Machine Learning models, this results in unpredictable model outcomes because machine learning models can’t learn outside the data. So high volumes of information gathered in various circumstances are crucial to developing a trustworthy algorithm.

Finetuning the labeling process

The importance of data as a crucial building block in a machine learning project is gaining recognition. However, apart from raw, high-quality data, a machine learning project is built upon the data labels. They’re the ground truth of your model and represent the outcome your model should output. Think of it like this, a parent won’t just point at items to show their baby, they will also say the name of the item. This way the baby will learn to recognize and name these items in its surroundings. With a machine learning algorithm, this is exactly the same.

Obtaining labels can be complex and labor-intensive. Machine vision problems often require manual labeling for specific objects in each image. Depending on the application, the human labelers must have the appropriate qualifications to label medical scans, images of technical defects or any other specific image type.

Some things to consider during the labeling process:

  • Different labeling requirements come at different prices. Only requesting a classification label for the complete image is a tenth of the cost of delineating all instances in the picture. The figure below illustrates different labeling approaches.
  • While developing the model, it pays off to evaluate the current weaknesses so you know which labels you need to improve. Knowing what the model struggles with allows you to maximize the return on new data.
  • When you outsource the labeling task to specialized companies, these are critical suppliers. Your team should treat them as such. You should monitor their results adequately. Too often, the perceived simplicity of the task makes people forget to define strict and well-thought-out quality metrics on the results.

Illustration of different label types. Point annotation (top left) costs less than full mask labelling (bottom right).
Squiggles (top right) and bounding box annotation are in between these extremes. (Source)

Maximizing the return

A dataset of delineated images is necessary to build a model to delineate objects. Currently, techniques are being developed to train models based on weakly supervised data. These techniques aim to use latent information in cheaper, low-information labels to prepare models for high-information output. In the classical approach, models require the same level of information in the labels and the desired result. This is expensive, so you’ll need a human to provide you with ‘examples’ of this valuable output.

Whether you are building an algorithm to read text documents or you are building a self-driving car, the message is the same. You don’t just need data, you need a high volume of qualitative data in all relevant circumstances and you should definitely not forget to gather qualitative labels. Do this and you’ll be one step closer to the optimal solution for your next ML project. Interested in learning more? Subscribe to our AI blog mail or visit the AILab page.

This article was co-written by Jan Alexander.

Tags: Artificial intelligenceMachine & deep learning

You also might be interested in

Featured image - Perspective - AI in medical

AI to create valuable new (medical) solutions

May 7, 2018

Handy guide to use Artificial Intelligence.

Featured image - Perspective - Advancing medical technology with groundbreaking AI

Advancing medtech with groundbreaking AI

Sep 13, 2019

DNA sequencing and other preventative and personalized medicine are increasingly important. How can AI help advance medical technology?

AI-enabled sustainability through process optimization

AI-enabled sustainability through process optimization

Nov 29, 2022

AI and machine learning enable sustainable production thanks to process optimization. Find out how in this InnoDays webinar.

Like this blog? Subscribe to the blogmail and don't miss any content!
Latest digital innovation blogposts
  • 30/08/2023
    The power of innovation at Henkel: Insights from Johannes Kieven
  • 16/08/2023
    Software vs. giants: Tesla’s stand against Android Auto and CarPlay
  • 11/07/2023
    Driving app retention through customer connections
  • 19/06/2023
    How AI is bridging the gap to safer roads
  • 18/04/2023
    From science fiction to reality: AI & robotics in healthcare

Verhaert Masters in Innovation is a pioneering innovation group helping companies and entrepreneurs to innovate, creating new products, businesses and services.

Verhaert icon LinkedIn Verhaert icon Facebook Verhaert icon SlideShare Verhaert icon YouTube Verhaert icon Twitter

SERVICES
FUNDING
SOLUTIONS
MARKETS
CAPABILITIES
TECHNOLOGY
PERSPECTIVES
BLOGS
WEBINARS
ABOUT
NEWS
JOBS
CONTACT

© 2023 Verhaert New Products & Services NV • BE 0439.039.420 • Privacy policy • Terms of use