If you’ve been wondering what could knock out one of the United States’ three big cellular carriers’ ability to deliver calls and text messages — and keep it that way for most of an entire day — T-Mobile now has a partial answer that pertains to its extensive nationwide outage Monday.
The company issued an apology late Tuesday that you can read in its entirety below, and on Thursday, CTO Neville Ray provided a further explanation you’ll find at the bottom of this post.
The short version, if we’re reading this correctly: a fiber-optic circuit failed, and its backup circuit also failed, which caused a chain reaction that strained the network to the point that many calls and texts couldn’t make it through.
The longer version:
June 16th, 2020 6:23pm PST
Update on T-Mobile Voice and Text Performance
Every day we see the vital role technology plays in keeping us connected, and we know T-Mobile customers rely on our network to ensure they have connections with family, loved ones and service providers. This is a responsibility my team takes very seriously and is our highest priority. Yesterday, we didn’t meet our own bar for excellence.
Many of our customers experienced a voice and text issue yesterday, specifically with VoLTE (Voice over LTE) calling. My team took immediate action — hundreds of our engineers worked tirelessly alongside vendors and partners throughout the day to resolve the issue starting the minute we were aware of it. Data connections continued to work, as did our non-VoLTE calling for many customers and services like FaceTime, iMessage, Google Meet, Google Duo, Zoom, Skype and others allowed our customers to stay in touch. Additionally, many customers were able to use circuit-switched voice connections and customers on the Sprint network were unaffected. VoLTE and text in all regions were fully recovered by 10 p.m. PDT last night. I’m happy to say the network is fully operational… and we’re working day in and day out to keep it that way.
Our engineers worked through the night to understand the root cause of yesterday’s issues, address it and prevent it from happening again. The trigger event is known to be a leased fiber circuit failure from a third party provider in the Southeast. This is something that happens on every mobile network, so we’ve worked with our vendors to build redundancy and resiliency to make sure that these types of circuit failures don’t affect customers. This redundancy failed us and resulted in an overload situation that was then compounded by other factors. This overload resulted in an IP traffic storm that spread from the Southeast to create significant capacity issues across the IMS (IP multimedia Subsystem) core network that supports VoLTE calls.
We have worked with our IMS (IP Multimedia Subsystem) and IP vendors to add permanent additional safeguards to prevent this from happening again and we’re continuing to work on determining the cause of the initial overload failure.
So, I want to personally apologize for any inconvenience that we created yesterday and thank you for your patience as we worked through the situation toward resolution.
T-Mobile President of Technology
It’s not clear which third-party provider’s fiber circuit failed. There was a report on Monday that Level 3, one of the world’s major internet backbone providers, was experiencing an outage, but a spokesperson told TechCrunch differently.
On Thursday, Ray downplayed the outage during a presentation at the Wells Fargo Virtual 5G Forum, claiming that only 20 percent of T-Mobile’s calls were dropped because customers were able to complete other calls using mobile data instead.
“The whole thing was triggered by a common garden fiber outage,” he said, adding that it “exposed an issue in a routing issue configuration which led to one of these IP floods across the network,” which in turn “created all kinds of capacity and protection measures in the core architecture”.
“What we did to kind of get through that was to add a lot of capacity on the fly, after we figured out where the problems really existed,” Ray said.
“We have to do better,” said Ray, without offering any particular suggestions about how T-Mobile might prevent such an issue in the future. He characterized the outage as a coincidence: “It was a series of events that, in many ways, from the fiber outage, to the routing network, to the core vulnerability, all of those things happened simultaneously and that’s the outage we saw.”
“Never say never, outages are always part of being a technology company, but we apologize and we’re in a better place.”