Not all our eggs are in one basket: Limited impact from Microsoft Azure outage on S’pore govt ops

The Microsoft Azure outage was caused by a utility power surge in South-east Asia at 3.19am that tripped some cooling units. PHOTO: AFP

SINGAPORE – Most government operations and digital services were not affected by an 18-hour outage at cloud platform Microsoft Azure on Feb 8 as “not all our eggs are in one basket”, a high-level Singapore government official told The Straits Times.

The Government had made a calculated move to diversify its cloud vendor mix, as well as host critical and sensitive operations internally, to avoid being overly dependent on a single platform.

Commenting for the first time since the outage hit global organisations in South-east Asia, Mr Chan Cheow Hoe, government chief digital technology officer of the Smart Nation and Digital Government Office (SNDGO), said: “Most of our services are up despite the service degradation. The impact is limited.”

For instance, the Central Provident Fund (CPF) Board website was down, but the CPF app was working as they are hosted on different platforms, he noted.

“Our strategy to move to cloud is premised on the need for options to have better resiliency, agility in app development and security,” said Mr Chan. “We have a very calibrated, considered approach to spread the risks. Not all our eggs are in one basket.”

Besides using Microsoft Azure to run the operations of public agencies, the Government also taps Google Cloud and Amazon Web Services. However, highly sensitive and critical national systems such as those for air traffic control or defence are hosted internally on infrastructure managed by the Government.

SNDGO would not provide more details or examples of applications hosted on the cloud, citing security reasons.

“Our risk mitigation approach does not eliminate all risks,” said Mr Chan, noting that there are security and cost trade-offs to consider. If digital channels are unavailable, there are always physical counters for citizens to access services, he added.

American tech firms Google Cloud, Amazon Web Services and Microsoft Azure command about two-thirds of the world’s cloud infrastructure, with the rest dominated by China’s Alibaba and Tencent. The five cloud giants service every conceivable sector including finance, healthcare, energy, security, transportation, manufacturing, consumer goods and entertainment.

Companies buy cloud services for cost savings of up to 60 per cent from not having to buy, install and maintain dedicated hardware and software to host tech developments and applications.

Cloud users can also tap a global ecosystem of third-party apps built on open application programming interfaces that allow these apps to “talk” to one another. For instance, if one needs a payment service when developing an e-commerce store, one can connect to modules such as Stripe or PayPal. If one needs an integrated communication platform to talk to customers, one can plug in to Twilio. 

Also, building new applications will take a much shorter time – weeks, rather than months or years – as there is no need to write every single line of code.

Currently, close to 60 per cent of all Singapore public agency operations are hosted on the cloud services of the three American tech giants. Plans are on track to increase that to 70 per cent by end-2023, said Mr Chan.

Singapore’s status as a data centre hub is linked to the presence of cloud data centres here.

According to the Ministry of Trade and Industry, the country’s data centre investment turnover was $397.82 million in 2021 – 1.85 times the volume in the previous year. As at 2021, there were more than 70 operational data centres here with a total available capacity of about 1,000 megawatts of power.

Data from Structure Research shows that in 2020, the Asia-Pacific region accounted for 45 per cent of the world’s data centre revenues – the largest – amounting to US$24.1 billion (S$32.1 billion). North America came in second, accounting for 37 per cent of the market at US$19.8 billion, followed by Europe, the Middle East and Africa (17 per cent) and Latin America (1 per cent).

The Microsoft Azure outage was caused by a utility power surge in South-east Asia at 3.19am that tripped some cooling units in its own data centre in an undisclosed location in the region, prompting customers to question why the company did not have redundancy measures to overcome power trips.

Some customers in South-east Asia were taken offline for about 18 hours on Feb 8. These included the CPF Board, transit card issuer EZ-Link, the Esplanade and Nanyang Technological University in Singapore.

Users of Microsoft 365 – a suite of cloud-based productivity and collaboration tools – were disrupted intermittently for hours that day. In particular, users of Microsoft Teams in South-east Asia were unable to collaborate with one another on the platform for up to four hours.

One customer who spoke anonymously told ST: “Until now, Microsoft has still not given a full picture of what caused the disruption.”

Others questioned why Microsoft did not protect its own infrastructure against disruptions by “eating its own dog food”. They were referring to the “Azure availability zone” service that the tech firm sells to minimise the impact caused by outages. 

On the Microsoft Azure website, the “Azure availability zone” service promises “less than two milliseconds” disruption by running the same application from three different data centre sites. In the event that one site is down, the others can immediately kick in.

Asked why Microsoft did not shield its infrastructure against disruptions, a spokesman said: “We would like to reaffirm that we have resolved the connectivity issues some users in South-east Asia may have experienced. We will continue to investigate to establish the root cause and prevent future occurrences.”

This is the second Azure outage in two weeks. On Jan 25, customers globally experienced a five-hour intermittent outage, during which they could not access all sorts of applications including productivity and collaboration tools. Microsoft attributed the disruption to a network engineer who tweaked the settings of a router using a command that had not been thoroughly vetted.

Almost all the world’s largest companies are on Azure, which has over 500 million active users.

On Sept 14, 2020, cooling loss at an Azure data centre in Britain forced Microsoft to shut down the facility, and brought down the country’s Covid-19 tracking website for more than 10 hours.

Join ST's Telegram channel and get the latest breaking news delivered to you.