Skip to main content

Transgender YouTubers had their videos grabbed to train facial recognition software

Transgender YouTubers had their videos grabbed to train facial recognition software

/

In the race to train AI, researchers are taking data first and asking questions later

Share this story

Biometrics Considered For National Identity Card
Photo by Ian Waldie/Getty Images

About five or six years ago, one of Karl Ricanek’s students showed him a video on YouTube. It was a time lapse of a person undergoing hormone replacement therapy, or HRT, in order to transition genders. “At the time, we were working on facial recognition,” Ricanek, a professor of computer science at the University of North Carolina at Wilmington, tells The Verge. He says he and his students were always trying to find ways to break the systems they worked on, and that this video seemed like a particularly tricky challenge. “We were like, ‘Wow there’s no way the current technology could recognize this person [after they transitioned].’”

Ricanek turned to YouTube to find images of transgender people

To tackle the problem, Ricanek did what all good scientists do: he started collecting data. Like all AI systems, facial recognition software requires stacks of information to train on, and although there are a number of sizable and freely available face databases available (ranging in size from thousands to millions of images), there was nothing documenting faces before and after HRT. So, Ricanek turned to the internet — a decision that would later prove to be controversial.

On YouTube, he found a treasure trove. Individuals undergoing HRT often document their progress and post the results online, sometimes keeping regular diaries, and sometimes making time-lapse videos of the entire process. “I shared my videos because I wanted other trans people to see my transition,” says Danielle, who posted her transition video on YouTube years ago. “These types of transition montages were helpful to me, so I wanted to pay it forward,” she tells The Verge.

The videos also happen to be gold for AI researchers, as each contains dozens of varied, true-to-life photos. As Ricanek wrote on a webpage for the dataset he would compile from the videos: “[It] includes an average of 278 images per subject that are taken under real-world conditions, and hence, include variations in pose, illumination, expression, and occlusion.”

But the problem is: do the people in these videos know or care that the personal journey they shared to help others is being used to improve facial recognition software?

“How is this even legal?”

Adam Harvey, an artist and researcher whose work examines privacy and technology, tells The Verge over email that this sort of data-scraping is “beyond common.” It was Harvey who found the HRT Transgender Dataset during research for an upcoming project examining exactly this sort of AI-training practice. He shared it on Twitter, where reactions were not good. “How is this even legal?” asked one user. “Not okay,” said another.

Ricanek wasn’t aware that his work was being discussed in this way when we reached out to him. He did, however, want to clarify a number of things about the research. First, that the dataset itself was just a set of links to YouTube videos, rather than the videos themselves; second, that he never shared it with anyone for commercial purposes (“Our job is just to illuminate what problem areas exist.”); and third, that he stopped giving access to it altogether three years ago.

“The reason for that is that it felt a little uncomfortable in the current climate to provide those things out there,” he told The Verge. “I have no inclination to distribute even the links any longer, for political reasons. People can use this for harm, and that was not my intent.” He says his team did try to contact individuals whose videos he listed and ask their permission “as a courtesy,” but admitted that if someone didn’t respond, they might have been included anyway.

Individuals were included in the dataset without their consent

Danielle, who is featured in the dataset and whose transition pictures appear in scientific papers because of it, says she was never contacted about her inclusion. “I by no means ‘hide’ my identity,” she told The Verge using an online messaging service. “But this feels like a violation of privacy.” She said she was gratified to know that there are limits on the use of the dataset (especially that it wasn’t sold to companies), but said this sort of biometric collection had “all sorts of implications for the trans community.”

“Someone who works in ‘identity sciences’ should understand the implications of identifying people, particularly those whose identity may make them a target (i.e., trans people in the military who may not be out),” she said. “Within the trans community, there's a non-trivial segment of people terrified by YouTube videos or other content that helps people figure out how to ‘spot the trans person.’”

For Harvey, this story is not surprising. “The lack of public discourse around data collection ethics has allowed researchers to continue amassing vast troves of biometric data from social media sources, namely Flickr and YouTube,” he says. These images can be given a Creative Commons (CC) license by default, allowing them to be downloaded freely and used to train facial-recognition systems even when the research is funded by for-profit companies.

And compared to other datasets, Ricanek’s is a minnow. The MegaFace dataset compiled by the University of Washington, for example, contains 4.7 million images of roughly 627,000 individuals — all taken from Flickr users. The project’s sponsors include Samsung, Intel, and Google, and the data itself is used by researchers from all over the world, whose work almost certainly feeds into paid products.

Example faces from the MegaFace dataset.
Example faces from the MegaFace dataset.

Harvey says that putting aside issues of legality and consent, there are “deeper ethical questions about the actual content in these datasets.” He points out that the two most common categories of images in MegaFace are “family” and “wedding.” Which makes sense, as who do we like to takes pictures of more than our loved ones? A look inside the database, says Harvey, “reveals countless personal photos of people's homes, weddings, picnics, beach trips, selfies, and even photos of children. Most, if not all, people in these photos are unaware that biometric companies around the world are honing facial recognition algorithms on their friends, family, and children.”

Law enforcement and national security agencies are also interested in this data. Ricanek’s research is partly funded by the FBI and the Army (although he says the transgender dataset was never shared with any government agencies nor was it funded by them). Ricanek justified the research as a solution to a fantastical border threat. But a system using this kind of research could exacerbate the harassment and humiliation that transgender people already face at travel checkpoints.

“As academics, we see great challenges ... but behind those challenges are real people.”

“What kind of harm can a terrorist do if they understand that taking this hormone can increase their chances of crossing over into a border that’s protected by face recognition? That was the problem that I was really investigating,” he says. “I’m deeply apologetic for any type of pain this may have caused any people in these videos. That’s certainly not where I’m coming from. As academics, we see great challenges and we want to work on them, but behind those challenges are real people, who may be impacted in ways we have not comprehended.”

Harvey says there’s currently “little debate” about the ethics of this sort of data collection. It’s a complex topic, and although individuals might be outraged that their image is being used without permission, there’s little they can do about it.

There is pushback in some instances (like when a researcher scraped 40,000 selfies from Tinder without permission and posted the dataset online), but in the debate about what is the right and the wrong way to go about acquiring data, the loudest voices are those of big companies. This leads to situations like in the UK, where Google’s AI subsidiary DeepMind made an illegal deal to access medical records belonging to 1.6 million individuals.

In a way, we’re used to this deal. It’s the bargain that underpins so much of the modern internet: you give away information about your life, and in return you get free services. But in the age of AI, as the data gathered becomes more and more personal — not just your anonymized browsing habits, but pictures of you, your family, your personal moments — and the systems it creates are more and more controlling, it’s perhaps time to ask ourselves, once again, are we giving away too much?

Today’s Storystream

Feed refreshed Two hours ago Dimorphos didn’t even see it coming

R
Twitter
Richard LawlerTwo hours ago
A direct strike at 14,000 mph.

The Double Asteroid Redirection Test (DART) scored a hit on the asteroid Dimorphos, but as Mary Beth Griggs explains, the real science work is just beginning.

Now planetary scientists will wait to see how the impact changed the asteroid’s orbit, and to download pictures from DART’s LICIACube satellite which had a front-row seat to the crash.


M
The Verge
We’re about an hour away from a space crash.

At 7:14PM ET, a NASA spacecraft is going to smash into an asteroid! Coverage of the collision — called the Double Asteroid Redirection Test — is now live.


E
Twitter
Emma RothSep 26
There’s a surprise in the sky tonight.

Jupiter will be about 367 million miles away from Earth this evening. While that may seem like a long way, it’s the closest it’s been to our home planet since 1963.

During this time, Jupiter will be visible to the naked eye (but binoculars can help). You can check where and when you can get a glimpse of the gas giant from this website.


Asian America learns how to hit back

The desperate, confused, righteous campaign to stop Asian hate

Esther WangSep 26
E
Twitter
Emma RothSep 26
Missing classic Mario?

One fan, who goes by the name Metroid Mike 64 on Twitter, just built a full-on 2D Mario game inside Super Mario Maker 2 complete with 40 levels and eight worlds.

Looking at the gameplay shared on Twitter is enough to make me want to break out my SNES, or at least buy Super Mario Maker 2 so I can play this epic retro revamp.


R
External Link
Russell BrandomSep 26
The US might still force TikTok into a data security deal with Oracle.

The New York Times says the White House is still working on TikTok’s Trump-era data security deal, which has been in a weird limbo for nearly two years now. The terms are basically the same: Oracle plays babysitter but the app doesn’t get banned. Maybe it will happen now, though?


R
Youtube
Richard LawlerSep 26
Don’t miss this dive into Guillermo del Toro’s stop-motion Pinocchio flick.

Andrew Webster and Charles Pulliam-Moore covered Netflix’s Tudum reveals (yes, it’s going to keep using that brand name) over the weekend as the streamer showed off things that haven’t been canceled yet.

Beyond The Way of the Househusband season two news and timing information about two The Witcher projects, you should make time for this incredible behind-the-scenes video showing the process of making Pinocchio.


R
External Link
Russell BrandomSep 26
Edward Snowden has been granted Russian citizenship.

The NSA whistleblower has been living in Russia for the 9 years — first as a refugee, then on a series of temporary residency permits. He applied for Russian citizenship in November 2020, but has said he won’t renounce his status as a U.S. citizen.


E
External Link
Emma RothSep 26
Netflix’s gaming bet gets even bigger.

Even though fewer than one percent of Netflix subscribers have tried its mobile games, Netflix just opened up another studio in Finland after acquiring the Helsinki-based Next Games earlier this year.

The former vice president of Zynga Games, Marko Lastikka, will serve as the studio director. His track record includes working on SimCity BuildIt for EA and FarmVille 3.


A
External Link
Vietnam’s EV aspirant is giving big Potemkin village vibes

Idle equipment, absent workers, deserted villages, an empty swimming pool. VinFast is Vietnam’s answer to Tesla, with the goal of making 1 million EVs in the next 5-6 years to sell to customers US, Canada and Europe. With these lofty goals, the company invited a bunch of social media influencers, as well as some auto journalists, on a “a four-day, multicity extravaganza” that seemed more weird than convincing, according to Bloomberg.


J
James VincentSep 26
Today, 39 years ago, the world didn’t end.

And it’s thanks to one man: Stanislav Petrov, a USSR military officer who, on September 26th, 1983, took the decision not to launch a retaliatory nuclear attack against the US. Petrov correctly guessed that satellite readings showing inbound nukes were faulty, and so likely saved the world from nuclear war. As journalist Tom Chivers put it on Twitter, “Happy Stanislav Petrov Day to those who celebrate!” Read more about Petrov’s life here.


Soviet Colonel who prevented 1983 nuclear response
Photo by Scott Peterson/Getty Images
J
The Verge
James VincentSep 26
Deepfakes were made for Disney.

You might have seen the news this weekend that the voice of James Earl Jones is being cloned using AI so his performance as Darth Vader in Star Wars can live on forever.

Reading the story, it struck me how perfect deepfakes are for Disney — a company that profits from original characters, fans' nostalgia, and an uncanny ability to twist copyright law to its liking. And now, with deepfakes, Disney’s most iconic performances will live on forever, ensuring the magic never dies.


E
External Link
Hurricane Fiona ratcheted up tensions about crypto bros in Puerto Rico.

“An official emergency has been declared, which means in the tax program, your physical presence time is suspended,” a crypto investor posted on TikTok. “So I am headed out of the island.” Perhaps predictably, locals are furious.


R
The Verge
Richard LawlerSep 26
Teen hacking suspect linked to GTA 6 leak and Uber security breach charged in London.

City of London police tweeted Saturday that the teenager arrested on suspicion of hacking has been charged with “two counts of breach of bail conditions and two counts of computer misuse.”

They haven’t confirmed any connection with the GTA 6 leak or Uber hack, but the details line up with those incidents, as well as a suspect arrested this spring for the Lapsus$ breaches.