Friday, December 05, 2008

Storing and exposing public data sets

Governments collect and distribute a massive amount of public data each year. It is a continual challenge to make this data accessible and usable for citizens, commercial organisations, researchers, scientists and policy makers.

This challenge isn't limited to a few dedicated statistical organisations, such as the ABS. Many other government departments collect, collate and publish extensive public data about their customers, about the market and about their operations.

Putting on my previous private sector hat, public data can be difficult to locate, download and use in a meaningful way to add value to an organisation. I have struggled at times to discover all of the data I needed and combine the different datasets (from different public providers) with internal data in ways that add value to my employers.

The challenges around public data have led Amazon to launch Public Data Sets on its Amazon Web Services platform.

Described as a "convenient way to share, access, and use public data", the system is designed to provide "a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications."

Why is this significant?

The approach makes it much faster and easier for organisations to locate, download, customise and analyse large public data sets - such as census, scientific or industry data.

Using Amazon's system,
Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

Amazon has already exposed data sets such as the US Census and various labour statistics. Shortly it will also provide transport databases and economic databases.

All of these are public data sets being provided by US government bureaus.

Also available are scientific information such as Human Genome data, a collection of all publicly available DNA sequences and chemical structures.

Amazon is also working to provide further public domain or non-proprietary data sets and invites organisations to submit applications for data to be included.

Given that this data capacity sits alongside Amazon's cloud computing service, providing an expandable virtual computing environment, it becomes possible for a range of organisations, researchers and individuals to access and make more effective use of large sets of public data, supporting innovation and democratising the marketplace.

It also allows for the creation of data mash-ups, combining data across different agencies with other data sources, maps, graphics, charts and analysis tools to generate new ways of experiencing data and new insights.

I don't expect Amazon to be the only provider of this type of capacity, Google is very committed to cloud computing and organising the world's data. Microsoft and IBM are also moving rapidly into these spaces.

In the long run I see this type of platform as a very valuable distribution tool for governments seeking to make their public data accessible and usable by the broadest possible group of citizens and organisations.

In turn this will broaden and deepen innovation and permit new realisations based on cross-referencing data from different providers - becoming a competitive advantage for countries savvy enough to make their public data more accessible.

What would it take for Australia to make its public data available via this type of channel? A phone call or email to Amazon and some work in structuring our datasets.

That's a low entry cost compared to the challenge of building a replica system.

No comments:

Post a Comment