Wednesday, November 21, 2012

Having a website crash due to high traffic is a failure of management, not load

Today has provided an interesting lesson for several organisations, with the crash of both the David Jones and ClickFrenzy websites in Australia.

But first, some background.

ClickFrenzy is a new 24-hour sale for Australian online retailers starting from 7pm on Tuesday 20 November.

Based on the US 'Cyber Monday' sale, which now attracts over 10 million buyers, ClickFrenzy was designed to entice Australian online shoppers to buy from local online retailers by offering massive discounts on product prices for a short period of time.

The event was announced over a month before it was due to start and has been promoted through newspapers, online and in some retail stores, with the ClickFrenzy team expecting thousands of shoppers to log on, likening it to a "digital boxing day sale".

I've kept an eye on the ClickFrenzy site and signed up to receive an email alert when the sale began.

Just before the sale started I hopped back onto the ClickFrenzy site to see how it was going, and only saw a basic page of text, with no graphics or formatting. Puzzled I tried reloading - and the site wouldn't load at all.

That's when I hopped onto Twitter and learnt from the #clickfrenzy hashtag that the ClickFrenzy site had already crashed from the load and no-one had any idea when it would be back online.

This meant that the list of participating retailers (many of whom had been kept secret) was inaccessible. No shopper knew who had the specials, meaning few sales could occur. Of the retailers that were known to be participating, two-thirds of their sites crashed too (such as Priceline and Myers).

In competition with ClickFrenzy, David Jones had decided to run its own independent 24-hour sale over a similar time period. Their sale, named 'Christmas Frenzy', was to be run from their main website.

How did their launch go? Their site also crashed, and was down for several hours, taking down not only the shopping site but all their corporate information.


So we had two major online sales on the same day from Australian retails, and both experienced crashes due to the volume of traffic.

What was to blame? Both claimed the failure was due to unprecedented demand. So many people tried to get onto both sites that their servers could not cope (the same reason given for the mySchools website issues at launch in 2010 and the CFA website issues during the Victorian fires in 2009).

Let's unpick that reasoning.

The world wide web is twenty years old. Amazon.com is 18 years old. The US 'Cyber Monday' sale is six years old.

David Jones is an experienced retailer, with significant IT resources and has been operating an online store for some time. Their Christmas Frenzy sale was planned and well promoted.

Click Frenzy is being run by experienced retailers as well. They built an emailing list of people interested in the event and also widely promoted the sale. The retailers supporting them are large names and operate established online shopping sites as well.

In both cases the organisers had a wealth of experience to draw on. The growth of Amazon, the US Cyber Monday sales, their own website traffic figures and email list sign-ups, not to mention a host of public examples of how to manage web server load well, and badly, from media sites, social networks and even government sites (such as mySchools and CFA examples above).

There are many IT professionals with experience on how to manage rapid load changes on web servers.

There's scalable hosting solutions which respond almost instantly to fast-increasing loads, such as during an emergency or with breaking news, and 'scale up' the site to support much larger numbers of simultaneous users. (Though in the case of Christmas Frenzy and Click Frenzy a large increase in load was expected, rather than unexpected.)

There's even automated processes for testing how much load a website will be able to bear by simulating the impact of thousands or millions of visitors.


In other words, there's no longer any technical reason why any organisation should have their website fail due to expected or anticipated load.


Load is not a reason, it is a justification.

We have the experience, knowledge and technology to manage load changes.

What the Click Frenzy and Christmas Frenzy failures illustrate is that some organisations fail to plan for load. They haven't learnt from the experience of others, don't invest in the right infrastructure and may not even test their sites.

They are literally crossing their fingers and praying that their website won't crash.

A website crashing when it receives a high level of load that could be expected or planned for is crashing due to a failure of management.


The next time your agency's management asks you to build a website which is expected to have a big launch or large traffic spikes, ask them if they're prepared to invest the funds necessary for a scalable and tested website, built on the appropriate infrastructure to mitigate the risk of sudden large increases in traffic.

If they aren't then let them know to cross their fingers and pray - and that a website crash due to high traffic is a failure of management, not load.

You might even get a Downfall parody video to memorialise the failure - as Click Frenzy received within two hours of their launch crash.

9 comments:

  1. Surely this stuff is well enough understood that we should be looking for other reasons for the failure? Conspiracy theory time?

    My current theory is that this was a deliberate strategy on ClickFrenzy's part to play up the demand for the sale. For many in the general public they may buy this line - but as you have pointed out, there is really no technical excuse for this type of failure with the range of hosting available today.

    If this was some sort of PR move, it has certainly been successful in raising (negative) awareness - it's likely to feature in tv and radio news all day today. Unfortunately, it may have the effect of further delaying the emergence of an Australian online retail sector.

    My next question would be - did retailers pay to participate and do they think they got what they paid for?

    ReplyDelete
  2. It's been an interesting period of incredible drama. Retailers were paying for the specific link. No doubt a few very angry ones out there today.

    There is a need to ensure that we support those retailers who adopted this as a key channel for brand awareness and potential growth opportunity in the market. This was not their fault (not that I see any form of blame in the context of the above commentary). I felt very sorry for all of them and especially those without any social media following as this appeared to be a quick way of reassuring consumers the sale was still in effect on their sites.

    As a courtesy to them, the retention of a fair assessment of them as independent from the ClickFrenzy platform, I believe to be an essential component of the next phase of this event. The retailers deserve an opportunity to recover.

    Bob Keen
    MD: Trusted Websites Pty Ltd

    ReplyDelete
    Replies
    1. Hi Bob,

      There's no 'need' to support retailers adopting online as a key channel. Retailers exist to service a community need and make a profit doing so. Their reward is their profit and it varies according to how well they meet community needs.

      Retailers also don't 'deserve an opportunity to recover'.

      Amazon is 18 years old. Most large Australian retailers haven't invested in this area substantially even today. If they lose out to overseas players because they've not taken online seriously that is not something that Australian shoppers should take any responsibility for.

      I feel no sympathy for retailers who have ignored environmental changes, any more than I feel sorry for retailers or enterprises from previous decades who missed major environmental shifts - such as Kodak.

      Delete
  3. There's no excuse for not properly testing your site these days. It's a trivial time/money investment to test your site by spining up a bunch EC2 (for example) instances to stress your site and expose any design/implementation flaws.

    ReplyDelete
  4. ...and The Age in Melbourne gave you the last word, Craig.

    If I had my guess, I'd say underinvestment in a scalable infrastructure with enough initial capacity to withstand the transaction load. 1 million visitors in 24 hours? No one designs a solution architecture to that sort of metric. Peak transactions per second, with a bit of hypothesis and behavioural forecasting to cover risk scenarios.

    ReplyDelete
  5. The US has a population of 314 million people and, according to this article, attracts 10 million shoppers - that's 3% of their population. The managers of ClickFrenzy said they successfully tested for up to 1 million shoppers (5% of our 22 million population) and had over 2 million shoppers log on. That's 9% of our population verses the 3% experienced by Cyber Monday in the US.

    Based on those figures, perhaps the organisers of ClickFrenzy thought they had allowed sufficient leeway for their servers. It seems they underestimated the success of their promotions and the interest it generated.

    On another note - and perhaps one that was overlooked - the US has been going through rather difficult economic times following the GFC which kicked off in 2007. By comparison, Australia has been faring rather well, which may explain why the numbers of potential buyers was so woefully underestimated.

    ReplyDelete
    Replies
    1. Hi Copy Chick, I appreciate your comment, however servers are not managed in this way.

      While Click Frenzy may have predicted a million visitors over a 24-hour period, it is well known in network management that these visitors don't arrive in an orderly fashion at predictable intervals - they arrive in surges.

      Therefore provisioning for an average of 700 users per minute (which equates to a million visitors in 24 hours) would be insufficient. Instead they should have provisioned for peak surges at least 100x larger - 70,000 visitors per minute.

      Sites need the capacity to handle 100 times the expected level of overall visits to cope with peak periods - using scalable infrastructure that means they only pay for what they use.

      If Click Frenzy had been planning for a million visitors over 24hrs (700/minute), it needed infrastructure capable of handling the equivalent of 100 million visits over 24hrs (70,000/minute) - simply to reflect peak usage levels.

      Given the site had an announced and widely known launch time, it probably needed to provision for an order of magnitude more visitors than this.

      ClickFrenzy also had clear indications that they would receive greater numbers of visitors than they expected from the enormous media attention, level of social media discussion and the number of email sign-ups they had. It appears all of this data was ignored.

      if I'd had 500,000 email subscriptions before launch, I'd be provisioning for a load of at least 5 million visitors over the 24 hours - therefore scalable infrastructure supporting a peak of 500 million visitors for that 24 hour period to cope with peak usage.

      Delete
    2. Thanks for explaining that so well Craig. I'm more of a word-nerd than a computer geek,so I appreciate your detailed explanation.

      Despite who was at fault, it's a shame they haven't been more generous in issuing an apology and assurances that "lessons have been learnt so next year will be bigger & better". There's a lot to be said for how you handle a stuff-up and it seems they're a bit behind the 8-ball on this one too.

      Delete
  6. And more issues have emerged with the security of the Click Frenzy site:
    ZDNET: Password exposed in Click Frenzy security slip http://www.zdnet.com/password-exposed-in-click-frenzy-security-slip-7000007707/

    ReplyDelete