The artificial intelligence powering computer vision is much more likely to mislabel protective masks as things like duct tape, jewelry and gags when women are wearing them than men, according to a new experiment from a data scientist at Wunderman Thompson.
The agency’s director of data science, Ilinca Barsan, recently published a blog post in which she tested 530 images, split evenly between men and women, wearing masks on three major artificial intelligence-powered object recognition systems from Google, Microsoft and IBM.
The experiment evolved out of a side project that Barsan had been exploring in which she would rank different street corners based on how many people street camera feeds measured as wearing masks.
While each of the systems tested had their own quirks in terms of what kinds of objects they misinterpreted the masks as, the disparity between genders was a consistent through-line.
Google’s Cloud Vision tool, for instance, identified 28% of women’s masks as duct tape and just 19% correctly as personal protective equipment (PPE). Men’s masks were more likely to be mistaken for facial hair (27%), but 36% were correctly identified and just 15% were thought to be duct tape.
Barsan backs up why this disparity represents more than an innocuous foible with a simple Google image search: The query “duct tape man” produces mostly pictures of men in full-body duct tape garb, while the same search for “duct tape woman” shows mostly women with duct tape gags.
“Neural networks trained on biased textual data will at best sooner rather than later embarrass its makers, and at worst they will perpetuate harmful stereotypes,” Barsan writes. “The same holds true for machines that learn to see through a skewed lens.
“That lens, of course, is our very own culture, offline as well as online—a culture in which violence against women, be it fictional or real, is often normalized and exploited,” she added.
Meanwhile, Microsoft’s Azure system was more likely to misinterpret women’s masks as fashion accessories (40%) or lipstick (14%), with just 5% correctly identified. Men’s masks were labeled alternately as fashion accessories (13%) and beards (12%), and were correctly identified as masks just 9% of the time.
IBM’s system misidentified women’s masks as “restraint chains” or “gags” 23% of the time, and correctly identified just 5% of masks. Men’s masks were correctly identified at a rate of 12%, and mislabeled restraint chains or gags 10% of the time each.
Numerous examples have been documented of computer vision, facial recognition and other types of AI reflecting human bias and discrimination embedded in the data on which they are trained. The problem is further complicated by the fact that many of these systems are black-box algorithms in which it’s difficult to pinpoint specific problems.
“It’s fascinating, albeit not surprising, to realize that for each of the three services tested, we stumbled upon gender bias when trying to solve what seemed like a fairly simple machine learning problem,” Barsan wrote. “That this happened across all three competitors, despite vastly different tech stacks and missions, isn’t surprising precisely because the issue extends beyond just one company or one computer vision model.”