Transgender YouTubers had their videos grabbed to train facial recognition software

Photo by Ian Waldie/Getty Images

About five or six years ago, one of Karl Ricanek’s students showed him a video on YouTube. It was a time lapse of a person undergoing hormone replacement therapy, or HRT, in order to transition genders. “At the time, we were working on facial recognition,” Ricanek, a professor of computer science at the University of North Carolina at Wilmington, tells The Verge. He says he and his students were always trying to find ways to break the systems they worked on, and that this video seemed like a particularly tricky challenge. “We were like, ‘Wow there’s no way the current technology could recognize this person [after they transitioned].’”

To tackle the problem, Ricanek did what all good scientists do: he started collecting data. Like all AI systems, facial recognition software requires stacks of information to train on, and although there are a number of sizable and freely available face databases available (ranging in size from thousands to millions of images), there was nothing documenting faces before and after HRT. So, Ricanek turned to the internet — a decision that would later prove to be controversial.

On YouTube, he found a treasure trove. Individuals undergoing HRT often document their progress and post the results online, sometimes keeping regular diaries, and sometimes making time-lapse videos of the entire process. “I shared my videos because I wanted other trans people to see my transition,” says Danielle, who posted her transition video on YouTube years ago. “These types of transition montages were helpful to me, so I wanted to pay it forward,” she tells The Verge.

The videos also happen to be gold for AI researchers, as each contains dozens of varied, true-to-life photos. As Ricanek wrote on a webpage for the dataset he would compile from the videos: “[It] includes an average of 278 images per subject that are taken under real-world conditions, and hence, include variations in pose, illumination, expression, and occlusion.”

But the problem is: do the people in these videos know or care that the personal journey they shared to help others is being used to improve facial recognition software?

Adam Harvey, an artist and researcher whose work examines privacy and technology, tells The Verge over email that this sort of data-scraping is “beyond common.” It was Harvey who found the HRT Transgender Dataset during research for an upcoming project examining exactly this sort of AI-training practice. He shared it on Twitter, where reactions were not good. “How is this even legal?” asked one user. “Not okay,” said another.

Ricanek wasn’t aware that his work was being discussed in this way when we reached out to him. He did, however, want to clarify a number of things about the research. First, that the dataset itself was just a set of links to YouTube videos, rather than the videos themselves; second, that he never shared it with anyone for commercial purposes (“Our job is just to illuminate what problem areas exist.”); and third, that he stopped giving access to it altogether three years ago.

“The reason for that is that it felt a little uncomfortable in the current climate to provide those things out there,” he told The Verge. “I have no inclination to distribute even the links any longer, for political reasons. People can use this for harm, and that was not my intent.” He says his team did try to contact individuals whose videos he listed and ask their permission “as a courtesy,” but admitted that if someone didn’t respond, they might have been included anyway.

Danielle, who is featured in the dataset and whose transition pictures appear in scientific papers because of it, says she was never contacted about her inclusion. “I by no means ‘hide’ my identity,” she told The Verge using an online messaging service. “But this feels like a violation of privacy.” She said she was gratified to know that there are limits on the use of the dataset (especially that it wasn’t sold to companies), but said this sort of biometric collection had “all sorts of implications for the trans community.”

“Someone who works in ‘identity sciences’ should understand the implications of identifying people, particularly those whose identity may make them a target (i.e., trans people in the military who may not be out),” she said. “Within the trans community, there's a non-trivial segment of people terrified by YouTube videos or other content that helps people figure out how to ‘spot the trans person.’”

For Harvey, this story is not surprising. “The lack of public discourse around data collection ethics has allowed researchers to continue amassing vast troves of biometric data from social media sources, namely Flickr and YouTube,” he says. These images can be given a Creative Commons (CC) license by default, allowing them to be downloaded freely and used to train facial-recognition systems even when the research is funded by for-profit companies.

And compared to other datasets, Ricanek’s is a minnow. The MegaFace dataset compiled by the University of Washington, for example, contains 4.7 million images of roughly 627,000 individuals — all taken from Flickr users. The project’s sponsors include Samsung, Intel, and Google, and the data itself is used by researchers from all over the world, whose work almost certainly feeds into paid products.

Example faces from the MegaFace dataset.

Harvey says that putting aside issues of legality and consent, there are “deeper ethical questions about the actual content in these datasets.” He points out that the two most common categories of images in MegaFace are “family” and “wedding.” Which makes sense, as who do we like to takes pictures of more than our loved ones? A look inside the database, says Harvey, “reveals countless personal photos of people's homes, weddings, picnics, beach trips, selfies, and even photos of children. Most, if not all, people in these photos are unaware that biometric companies around the world are honing facial recognition algorithms on their friends, family, and children.”

Law enforcement and national security agencies are also interested in this data. Ricanek’s research is partly funded by the FBI and the Army (although he says the transgender dataset was never shared with any government agencies nor was it funded by them). Ricanek justified the research as a solution to a fantastical border threat. But a system using this kind of research could exacerbate the harassment and humiliation that transgender people already face at travel checkpoints.

“What kind of harm can a terrorist do if they understand that taking this hormone can increase their chances of crossing over into a border that’s protected by face recognition? That was the problem that I was really investigating,” he says. “I’m deeply apologetic for any type of pain this may have caused any people in these videos. That’s certainly not where I’m coming from. As academics, we see great challenges and we want to work on them, but behind those challenges are real people, who may be impacted in ways we have not comprehended.”

Harvey says there’s currently “little debate” about the ethics of this sort of data collection. It’s a complex topic, and although individuals might be outraged that their image is being used without permission, there’s little they can do about it.

There is pushback in some instances (like when a researcher scraped 40,000 selfies from Tinder without permission and posted the dataset online), but in the debate about what is the right and the wrong way to go about acquiring data, the loudest voices are those of big companies. This leads to situations like in the UK, where Google’s AI subsidiary DeepMind made an illegal deal to access medical records belonging to 1.6 million individuals.

In a way, we’re used to this deal. It’s the bargain that underpins so much of the modern internet: you give away information about your life, and in return you get free services. But in the age of AI, as the data gathered becomes more and more personal — not just your anonymized browsing habits, but pictures of you, your family, your personal moments — and the systems it creates are more and more controlling, it’s perhaps time to ask ourselves, once again, are we giving away too much?


If you’re not paying for the product, you are the product.

Doesn’t apply in this case. The videos are being scraped by researchers, whether you’re paying for your service or not – if it’s publicly available, someone’s training a bot with it.

Youtube videos are copyright protected. I’m not sure how research use interacts with that… Especially when used to develop/train private or corporate AI’s.

From YouTube Terms of Service:
[…]You also hereby grant each user of the Service a non-exclusive license to access your Content through the Service, and to use, reproduce, distribute, display and perform such Content as permitted through the functionality of the Service and under these Terms of Service. The above licenses granted by you in video Content you submit to the Service terminate within a commercially reasonable time after you remove or delete your videos from the Service. […]
As long as video is online everybody can use it without breaking copyright. At least this is how I understand that part.

I read that as being that anyone can view the content as long as they abide by the TOS. Not sure taking screenshots and publishing those in research papers counts.

I get that there is some tricky issues with the sensitive nature of these videos.
But you are publishing these videos and pictures on the internet for anyone to use. You loose a lot of privacy by doing that and I don’t think there should be anything ethically wrong in using these when the people are freely putting them out there.

But there is a complication to that line of thought. People might be ok about strangers perusing their pictures, some be even ok with the occasional download, but by training AIs you can be potentially recognized & followed everywhere there is a video feed. As these systems mature, the impact to everyone’s privacy beyond sharing some vacation pictures could be off the scale.

I do kind of get that, but people need to seriously think about what they publicly post.
People have been warned for years now not to post anything online that they don’t want to be public domain. While there are risks of AI doing this, we are well past that point and people need to realize that.

Yes they do need to think about that, but as I wrote in my post below, this is not the question in this article.
This is about research ethics

This might be semantics confusion, but YouTube isn’t public domain, those videos are copyrighted to the creator.

For research purposes, it might as well be, since machine learning is just analysis and falls under fair use.

Also you can delete a youtube video. When someone takes a picture of that video and publishes it without your knowledge, you can not simply delete it.

Sure, some say "Well think about that before, there are always bad people". But this is a question about research ethics, not what kind of people are out there.

It said in the article they only included links, not the actual video, so it’s hard to see an ethical issue. They were posted publicly for all to see, and the dataset only included the address.

Of course it’s creepy, but what do you expect posting stuff publicly? You should see the notices on the walls at my work – graffiti all over them. The difference is quantity, not quality.

If they aren’t using trans people then I think terrorists will be surprised if he(since most are men) goes on HRT. As someone who is agender trans(and relates to trans women), and has many trans women friends it’s going to be rough. So think about someone who isn’t trans goes on HRT.

First of all testosterone is more of a bully hormone vs estrogen; so one will have to be on HRT at least a year before one may see any real changes. Plus, all the voice training one will need to do, because if they weren’t on puberty blockers or started HRT around the age of 15-16 their voice really isn’t going to change. I don’t know many trans people on E and T blockers who don’t have dysphoria(or serious case of it), but around the 5-6 month mark on HRT is when depression and emotions can really fuck with a person, and this can last for at least 6 months. Then there all these essentialist notions that trans women have to meet in order to pass, like shaved legs, wearing dresses, and so forth. Let’s also not forget trans people, more specifically trans women of color are at higher risk of violence, harassment. And if they are from a country or state that doesn’t offer protection, they could be sent to the prison or worse killed. Hell, in 49 states(California being the exception) you can use trans or gay panic(saying the person disclosed they are lgbtq, panicked and killed the person) as a defense for killing someone.

My point is, it is not easy to be trans(especially trans woman), on hrt. The barrier to just transition just with hrt can be take 6 months if they are lucky until like 2 years for some and even then some don’t have passing privilege(not too keen on the term).

trans people in the military who may not be out

I understand the privacy implications, but this seems like a poor example. Why would someone not out be outing themselves as transgender on a public YouTube video?

To a computer scientist, this seems super obvious. People really need to be educated about that.

But this is nothing really. They are a few companies selling databases with way more info about​ you. Incurring political leaning and sexual preferences. Such databases are used by politicians (and not only nasty ones but also nice guys like Obama). And you could use this data to make all kind of predictions, linking the databases together and such. A lot of organisations have probably that capacity already.

And for those of us who aren’t transgender….are also having our faces used to train AI. Yawn

Exactly , I don’t see how a trans person having their pictures used to train AI is any different than a non trans person. We’re all just people. As for the moral implications of having your pictures used to train AI, I personally wouldn’t have any issue with it, it’s just scientific research.

Gimme a break. If you post a video on YouTube, you are putting yourself out there. This is like people who tweet something objectionable and think they are immune to consequences like being fired from your job.

If they plan to publish the results, then it’s research. If it’s research, then they need to consent the subjects.

But a system using this kind of research could exacerbate the harassment and humiliation that transgender people already face at travel checkpoints

It would reduce the humiliation at checkpoints. If facial recognition properly recognizes a person then there is no human interaction required. If facial recognition software says you arent who you say you are because it isnt advanced enough then prepare to answer a lot of questions.

There are dystopian theories for any kind of database of people, but for things like checkpoints it will make their lives much easier.

The important thing is that none of these things have superior rights over us, so if a border agent claims we are not who we say we are, they have to prove it, not us.

By the way, the same should be for voting or any other supposedly permissioned right — we self testify to our eligibility, and anyone who challenges us has to prove it, not us.

You all with me?

View All Comments
Back to top ↑