Friday, July 06, 2012

If citizens can help explore galaxies, unfold proteins, track birds and transcribe texts, why can't they help analyse government data?

One area of Gov 2.0 I really think hasn't been thoroughly considered or adopted by many governments, including in Australia, is the process of having citizens help in the creation, exploration and analysis of data.

Is it due to a lack of time, money, imagination or courage?

I don't know, but I would dearly love to see more government agencies consider how they could engage citizens in crowdsourcing initiatives that could help society.

Let me give a few examples of what I mean.

Galaxy Zoo is a collaborative effort from a range of universities and astronomers to classify galaxies in our universe. The site launched in 2007 with a paltry one million galaxies visualised.

The site worked by allowing people to register to classify galaxies (as either spiral or elliptical), with multiple classifications used to verify that each classification was correct.

The team behind the site thought it might take two years to classify all million galaxies, however within 24 hours of launch, the site was receiving 70,000 classifications an hour.

In total more than 50 million classifications were received by the project during its first year, from almost 150,000 people.

This effort was so successful that the team took a selection of 250,000 galaxies and asked people to analyse them for more detailed information, calling this Galaxy Zoo 2. Over 14 months users helped the team make over 60,000,000 classifications.

This work has led into a number of lines of research and supported scientists in understanding more about how our universe works.

Planet Hunter takes a more focused approach, looking for planets around other stars. A collaboration between the group behind Galaxy Zoo and Yale University, it works on a similar basis whereby users register to look for signs of planets based on data from radio telescopes.

Users mark likely targets and, over time, when sufficient users have marked a star as a likely target, the professional astronomers analyse that star in depth.

The site is an experiment, and there's no indication of how many planets have been found using the process, however as the human eye is particularly good at detecting patterns or aberrations, while computers can struggle, it has a good shot at success. The classifications by humans may also help in improving the computer algorithms and therefore make computers better at detecting patterns in data which may indicate planets, or could be used for detecting patterns in all kinds of other data as well.

eBird is an initiative from the Cornell Lab of Ornithology and National Audubon Society launched in 2002. What it does is aggregate bird sightings by location from professionals and amateurs to better match the range, migration patterns and changing distribution of bird species.

The system is the largest database of its kind in the world and in March 2012 alone participants reported more than 3.1 million bird observations across North America - data that is valuable to educators, land managers, ornithologists, and conservation biologists amongst other groups.

The data can be viewed on maps by species or as bar and line charts to explore when in the year particular birds are in a particular region. The site also supports gamification elements, listing the top 100 eBirders and tracking each user's personal record of sightings. is a site where users can solve scientific math problems through playing games. The site is most famous for the speed at which gamers solved an AIDS protein puzzle that had stumped traditional scientific approaches. Gamers solved the puzzle in less than three weeks while scientists had been struggling with it for thirty years.

Supported by both universities and corporate interests, the site is exploring many biological puzzles related to protein folding that offer hope for solving many of the worse diseases and conditions afflicting humans and our domesticated animals and plants.

Again the site includes a ranked ladder of the most successful players and offers ways to socialise and share information. is a great site for whale lovers as it's a place where people can listen to whale songs from Killer and Pilot whales in order to match their patterns. Supported by Scientific America, the site contains thousands of samples of whale songs.

Users can listen to snatches of song and listen for patterns, providing data that help marine researchers answer questions such as how large is the call repertoire of pilot whales and do the long and short finned pilot whales have different call repertoires (or ‘dialects’)?

Teamsurv also has a watery focus, involving mariners to help create better charts of coastal waters, by logging depth and position data whilst they are at sea, and uploading the data to the web for processing and display.

The information collected by the site helps improve nautical maps and thereby reduces risks at sea, helping sailors and reducing rescue costs.

While still in early stages and very european focused, this crowdsourcing site has great promise. I'd like to see a similar concept extended onto land, using cars with GPS as the collection point of atmospheric and traffic data that can be used to map microclimates and plan traffic measures.

BlueServo, on the other hand, focuses on collecting land-based data on the movements of illegal immigrants across the Mexican-US border. Using a range of web cameras, users are asked to watch for movement and report people crossing the border to the Texas Border Sheriff.

Called the Virtual Border Watch, the approach currently involves twelve cameras and sensors at high risk locations, though the site doesn't actually list how successful the project has been (though why would it).

reCAPTCHA is the crowdsourcing tool that people don't notice they're participating in. In fact you've probably participated in it yourself.

The system, now owned by Google, uses snippets of digitalised books and documents as 'CAPTCHA codes' - those images of letters and numbers used to help stop spambots, programs designed to break into systems to send spam messages.

Whenever you verify you are human by retyping the letters in a reCAPTCHA image you are contributing to the preservation of millions of vintage books through digitalisation, with a 99.5% accuracy rate. In fact, the accuracy of reCAPTCHA matches that of human "key and verify" transcription techniques in which two professional human transcribers independently type the data and discrepancies are corrected.

Trove is last crowdsourcing project I'll mention, but definitely not the least, the project by the National Library of Australia to digitalise old newspapers, using people to correct errors in digital scanning. I've discussed Trove before and it continues to go from strength to strength, judging from the Hall of Fame of content correctors.

Tens of millions of lines in newspapers have been corrected, improving the accuracy of Australia's historic record (the Trove site even lists my blog in its archive.

If you're interested in finding more examples of crowdsourcing, a good first stop is the Wikipedia page listing crowdsourcing projects.

Can't governments, with all that data sitting in archives, find uses for crowdsourcing too?


  1. This is very timely considering the concern about manipulated hospital data in the ACT currently.

    Apparently the AIHW had commented the unusual distribution of admission times was "too perfect"

    Previously in Canada, the release of charity financial data for the first time lead to citizens (and data driven journalists!) finding $3.2 billion in tax fraud

  2. That's the kind of data that lends itself to processes such as are used by the New York government :) reported here: