Wednesday, July 08, 2015

No it's a not appropriate to load test on your citizens in production - particularly when it's a critical service

The last week has seen a range of major issues for the Australian Government's new MyTax service.

As reported across both traditional and social media, people using MyTax to file their tax returns have experienced shut-outs, had the process freeze when they were almost complete and had it fail to autofill their pre-saved details.

MyTax is an online version of the eTax software which had been the primary way for people to digitally complete their tax returns for the last fifteen years. eTax improved year on year and had enormous take-up. In all respects it was a major success for the ATO.

This is the first year the Australian Tax Office has deployed the MyTax system and integrated it with MyGov. While the intention was, and is, good - to give Australians a single way to validate themselves with multiple government agencies - the implementation in this case hasn't withstood the real world.

This isn't a unique experience and it isn't limited to government. We've seen it with certain banking services, with retailers (particularly on a certain contrived Australian online shopping day each year), with A-grade games (such as SimCity) and with a range of other online services such as Apple maps.

In fact this issue is relatively rare, in comparison to the private sector, in government, with the last major issue of this type internationally being with healthcare.gov in the US, and the last I recall in Australia being with the MySchools site launch.

This type of issue will happen from time to time. Unforeseen bugs or network issues, denial of service attacks or other environmental issues can bring down even the most robust service, particularly at launch.

In every one of these cases there's a backlash from customers - and in every one of these cases the organisation responsible is judged based on how they manage and recover from the disaster.

In the MyTax case, while the ATO were probably aware of the risks, and may even have learnt some lessons from several of the issues highlighted above, it appears they're still struggling to manage and recover from the situation.

When asked about the siituation the CIO of the ATO, Jane King, wrote, as reported in the Sydney Morning Herald, that "Capacity planning and testing was completed as part of the rolling out of the new digital design, however due to the complexity of our environment, production is always the real test."

I read this as her saying that while they did conduct testing, they were actually relying on real citizens, at real tax time, to fully evaluate how the MyTax system would perform.

Just as the UTS professor John Leaney, quoted in the SMH article above, says - this type of statement just isn't good enough.

"We're not in the 1950s; we're not even in 1990s, we've learnt a lot and from what we've learnt we apply the techniques for proper capacity modelling," Leaney said. "There should have been much better testing; it's not something you should learn the hard way on a major government system."

The ATO needs to do better at risk planning around situations like this. It needs to test capability properly and not hide behind the 'too many users' defense.

Government agencies need to carefully watch and learn from this experience - and learn the right lessons.

The first lesson is to conduct appropriate capacity testing. Look at the ABS's implementation of eCensus and the level of testing and resilience it put in place the first time eCensus was used in 2006. The ABS gave a great presentation on the topic, which I attended, which highlighted the risk mitigation steps they'd taken - from capacity testing through to multiple redundant systems and real-time monitoring with developers on standby and fallback manual systems in place.

The second lesson is to not release major systems at a time when they are going to come under a huge load. Release a new tax system in February or March, or after tax time in October, giving time to shakeout the production system and address issues before it hits peak load.

The third lesson is to avoid releasing major systems. Instead release smaller, but useful, services and progressively integrate them into a major new service, testing each carefully as they go. This is how Facebook totally replaced its back-end without any disruption to people's use of the service - modularly upgrading aspects of the service until it was completely done.

The final lesson is to plan your recovery before your system fails. Design a failover plan for what happens if the system doesn't work for people, a manual solution if required. The ATO should direct anyone with issues to a hotline where they can complete their tax return over the phone, or via screen sharing, so no-one is left waiting for days or in a position of financial distress due to not receiving a tax return fast.

I feel for the ATO (particularly their ICT team) and don't blame them for the issues they're having with MyTax, however I do hold the agency responsible for how the ATO recovers from this disaster.

They need to stop defending their implementation of MyTax and focus on ways to meet citizen needs - even outside the MyTax system - to ensure that the 'tax returns get through'.

Otherwise this issue could turn into another Apple maps-style disaster, or even worse, as there's no 'competitor' to the ATO that citizens can turn to to complete their tax returns. At least, not yet...

No comments:

Post a Comment