The whole concept behind the show “The Voice” is pretty simple. Take a bunch of artists and have them sign for four judges with their backs are turned. The tension builds as people of all shapes and sizes perform one by one in hopes they can get this famous musician’s attention with voice alone. The concept arose out of a bit of body-shaming and bad blood for the other talent shows because for a long time, we watched people get booted from American Idol by Simon Cowell for not having “the look.” We just started to assume it was a game for the prettiest, not the most talented. Not exactly inspiring for those of us who don’t look like supermodels.
Based on the awards (or lack thereof) that any winner has taken home, I’d say they haven’t quite found the model for award winning musicians but they have given chances to people who may not have won on talent alone, even if every show says that’s the deciding factor.
That got me thinking – do we really think someone sounds better just because of how they look? How does our brain attach a face to a sound or make meaning from something we see quickly? Our brain is built to make connections from one thing to another, many times a fallacy – and we don’t know any better. I was curious how this might play in the interview room.
Traditionally, interviews are scary. That’s why we created – interviewing.io. Shameless plug – I’ll only do one – it’s a platform where people can practice technical interviewing anonymously and, in the process, find jobs based on their interview performance rather than their resumes. Since we started, we’ve amassed data from thousands of technical interviews, and routinely share some of the surprising stuff we’ve learned.
So I wanted to take a Voice-esque philosophy and put it into practice of hiring. We built real-time voice masking to investigate the magnitude of bias against women in technical interviews. In short, we made men sound like women and women sound like men and looked at how that affected their interview performance. We also looked at what happened when women did poorly in interviews, how drastically that differed from men’s behaviour, and why that difference matters for the thorny issue of the gender gap in tech.
When an interviewer and an interviewee match on our platform, they meet in a collaborative coding environment with voice, text chat, and a whiteboard and jump right into a technical question. Interview questions on the platform tend to fall into the category of what you’d encounter at a phone screen for a back-end software engineering role, and interviewers typically come from a mix of large companies like Google, Facebook, Twitch, and Yelp, as well as engineering-focused startups like Asana, Mattermark, and others.
After every interview, interviewers rate interviewees on a few different dimensions.
As you can see, we ask the interviewer if they would advance their interviewee to the next round. We also ask about a few different aspects of interview performance using a 1-4 scale. On our platform, a score of three or above is generally considered good.
Women historically haven’t performed as well as men…
One of the big motivators to think about voice masking was the increasingly uncomfortable disparity in interview performance on the platform between men and women1. At that time, we had amassed over a thousand interviews with enough data to do some comparisons and were surprised to discover that women really were doing worse. Specifically, men were getting advanced to the next round 1.4 times more often than women. Interviewee technical score wasn’t faring that well either — men on the platform had an average technical score of three out of four, as compared to a 2.5 out of four for women.
[bctt tweet=”Men were getting advanced to the next round 1.4 times more often than women says @alinelernerLLC” username=”ATCevent”]
Despite these numbers, it was really difficult for me to believe that women were just somehow worse at computers, so when some of our customers asked us to build voice masking to see if that would make a difference in the conversion rates of female candidates, we didn’t need much convincing.… so we built voice masking. I knew in order to achieve true interviewee anonymity, hiding gender would be something we’d have to deal with but put it off for a while because it wasn’t technically trivial to build a real-time voice modulator. Some early ideas included sending female users a Bane mask. When the Bane mask thing didn’t work out, we decided we ought to build something within the app, and if you play the videos below, you can get an idea of what voice masking on interviewing.io sounds like. In the first one, I’m talking in my normal voice.
Armed with the ability to hide gender during technical interviews, we were eager to see what the hell was going on and get some insight into why women were consistently underperforming.
The setup for our experiment was simple. Every Tuesday evening at 7 PM Pacific, we host what we call practice rounds. In these practice rounds, anyone with an account can show up, get matched with an interviewer and go to town. And during a few of these rounds, we decided to see what would happen to interviewees’ performance when we started messing with their perceived genders.
In the spirit of not giving away what we were doing and potentially compromising the experiment, we told both interviewees and interviewers that we were slowly rolling out our new voice masking feature and that they could opt in or out of helping us test it out. Most people opted in, and we informed interviewees that their voice might be masked during a given round and asked them to refrain from sharing their gender with their interviewers. For interviewers, we simply told them that interviewee voices might sound a bit processed.
We ended up with 234 total interviews (roughly 2/3 male and 1/3 female interviewees), which fell into one of three categories:
- Completely unmodulated (useful as a baseline)
- Modulated without pitch change
- Modulated with pitch change
You might ask why we included the second condition, i.e. modulated interviews that didn’t change the interviewee’s pitch. As you probably noticed, if you played the videos above, the modulated one sounds fairly processed. The last thing we wanted was for interviewers to assume that any processed-sounding interviewee must summarily have been the opposite gender of what they sounded like. So we threw that condition in as a further control.
After running the experiment, we ended up with some rather surprising results. Contrary to what we expected (and probably contrary to what you expected as well!), masking gender had no effect on interview performance with respect to any of the scoring criteria (would advance to next round, technical ability, problem solving ability). If anything, we started to notice some trends in the opposite direction of what we expected: for technical ability, it appeared that men who were modulated to sound like women did a bit better than unmodulated men and that women who were modulated to sound like men did a bit worse than unmodulated women. Though these trends weren’t statistically significant, I am mentioning them because they were unexpected and definitely something to watch for as we collect more data.
On the subject of sample size, we have no delusions that this is the be-all and end-all of pronouncements on the subject of gender and interview performance. We’ll continue to monitor the data as we collect more of it, and it’s very possible that as we do, everything we’ve found will be overturned. I will say, though, that had there been any staggering gender bias on the platform, with a few hundred data points, we would have gotten some kind of result. So that, at least, was encouraging.
So if there’s no systemic bias, why are women performing worse?
[bctt tweet=”So if there’s no systemic bias, why are women performing worse? Asks @alinelernerLLC” username=”ATCevent”]
After the experiment was over, I was left scratching my head. If the issue wasn’t interviewer bias, what could it be? I went back and looked at the seniority levels of men vs. women on the platform as well as the kind of work they were doing in their current jobs, and neither of those factors seemed to differ significantly between groups. But there was one nagging thing in the back of my mind. I spend a lot of my time poring over interview data, and I had noticed something peculiar when observing the behaviour of female interviewees. Anecdotally, it seemed like women were leaving the platform a lot more often than men. So I ran the numbers.
What I learned was pretty shocking. As it happens, women leave interviewing.io roughly seven times as often as men after they do badly in an interview. And the numbers for two bad interviews aren’t much better.
So, if these are the kinds of behaviours that happen in our microcosm, how much is applicable to the broader world of software engineering? Please bear with me as I wax hypothetical and try to extrapolate what we’ve seen here to our industry at large. And also, please know that what follows is very speculative, based on not that much data, and could be totally wrong… but you gotta start somewhere.
If you consider the attrition data points above, you might want to do what any reasonable person would do in the face of an existential or moral quandary, i.e. fit the data to a curve. An exponential decay curve seemed reasonable for attrition behaviour, and you can see what I came up with below. The x-axis is the number of what I like to call “attrition events”, namely things that might happen to you over the course of your computer science studies and subsequent career that might make you want to quit. The y-axis is what portion of people are left after each attrition event. The red curve denotes women, and the blue curve denotes men.
Prior art, or why maybe this isn’t so nuts after all
In a study investigating the effects of perceived performance to lihood of subsequent engagement, Dunning (of Dunning-Kruger fame) and Ehrlinger administered a scientific reasoning test to male and female undergrads and then asked them how they did. Not surprisingly, though there was no difference in performance between genders, women underrated their own performance more often than men. Afterwards, participants were asked whether they’d like to enter a Science Jeopardy contest on campus in which they could win cash prizes. Again, women were significantly less likely to participate, with participation likelihood being directly correlated with self-perception rather than actual performance.3
And of course, what survey of gender difference research would be complete without an allusion to the wretched annals of dating? When I told the interviewing.io team about the disparity in attrition between genders, the resounding response was along the lines of, “Well, yeah. Just think about dating from a man’s perspective.” Indeed, a study published in the Archives of Sexual Behaviour confirms that men treat rejection in dating very differently than women, even going so far as to say that men “reported they would experience a more positive than negative affective response after… being sexually rejected.”
Maybe tying coding to sex is a bit tenuous, but, as they say, programming is like sex — one mistake and you have to support it for the rest of your life.
Prior art aside, I would like to leave off on a high note. I mentioned earlier that men are doing a lot better on the platform than women, but here’s the startling thing. Once you factor out interview data from both men and women who quit after one or two bad interviews, the disparity goes away entirely. So while the attrition numbers aren’t great, I’m massively encouraged by the fact that at least in these findings, it’s not about systemic bias against women or women being bad at computers or whatever. Rather, it’s about women being bad at dusting themselves off after failing, which, despite everything, is probably a lot easier to fix.
This article first appeared on RecruitingDaily on July 19th, 2016.
Leave a Reply