Tuesday, August 06, 2013

Is it easy for non-programmers to reuse government open data?

Opening up data is one thing, but using it in a productive way is another.

Data may be released in formats that are hard to reuse, data may be 'dirty' (with mistakes) or incomplete.

However when organisations release data in machine-readable formats, with a reasonable level of completeness, it can be surprisingly easy for even a novice with no programming experience to reuse it in meaningful ways.

Below are two examples of how I've recently reused very different sets of data, an example of data released directly by a government agency, and an example of how to capture and reuse data that is public but technically not open.

Example 1: Mapping Australian polling places

Earlier today @Maxious tweeted the release of the Australian Electoral Commission's (AEC) expected polling places for the federal election as a CSV file. CSV is a standard format, like a basic spreadsheet, where every value is separated from the next by a comma, making it easy to import into (or export from) Microsoft Excel, OpenOffice Calc, Google Spreadsheet or other spreadsheets or databases.

The polling locations data is valuable, but in the CSV format simply appears as lines and lines of data. I thought it would be interesting and useful to visually map the polling locations on a map of Australia, making it easy for people to find the polling booths nearest to them.

So I downloaded the CSV file from the AEC website (www.aec.gov.au/About_AEC/cea-notices/election-pp.htm) and went to Google Drive, which supports a type of spreadsheet called Fusion Tables which can map geographic data.

Fortunately the AEC was smart enough to include latitude and longitude for each polling location. This can be easily mapped by Fusion Tables. The CSV also contained address, postcode and state information, which I could also have used, less accurately, to map the locations.

I uploaded the CSV into a newly created Fusion Table, which automatically organised the data into columns and used the Lat/Long coordinates to map the locations - job done! Or so I thought....

When I looked at the map, it only showed NSW polling locations - about 2,400 of them - while the original CSV listed over 8,000.

Clearly something hadn't worked properly, so I tried reloading the data into a new Fusion Table - with the same result - it didn't seem to be a problem with the CSV or the import process.

I went into the CSV using Microsoft Excel and studied at the data. There were many columns of data I didn't need for the map, so I deleted them - reducing the size of the spreadsheet by tens of thousands of cells.

I reimported the CSV into a Fusion Table and it worked! All eight and a half thousand expected polling locations appeared on the map. Clearly there had been too much (extraneous) data for Fusion to manage.

From here finishing the map was easy. It was simply a process of making the data a little more presentable by changing datasheet names and editing what appeared in the  information box that appeared when a polling location was clicked on.

I shared my Fusion Table and published the map so people could view and embed it (see below).

You can view (but not edit) my full Fusion Table at: https://www.google.com/fusiontables/DataSource?docid=1kzLZTqNRkXMu1w4eBdsOLRakx3S8FLHziu6PdbU



So job done - map created with useful information when you click a red dot.

However, these are only expected polling places - the AEC may update this information at any time as they confirm or remove specific polling places.

My map is current at 6 August 2013, however may become out-of-date quite fast. How do I ensure my map updates when the AEC updates their CSV?

The short answer is that I can't - using my Google Fusion Table.

Because the AEC has chosen to release the data in a format easy for them (a CSV, straight from their internal systems), it is less useful for outsiders who wish to keep their maps or mash-ups current.

A programmer would be able to write a script that checked the AEC page each day to see if the CSV had updated, download it into a program that updated a map and published it to the web with the changes - even providing a history of which polling stations were added or removed over time.

However the broader community, including me, don't have the programming skills to do this - and shouldn't need them.

To replicate what the programmer could do in a few lines, any non-programmer, such as me, would have to manually check the page, download the updated CSV (assuming the page provides a clue that it has changed), manually delete all unneeded columns (again) and upload the data into my Fusion Table, simply to keep my map current.

Of course, if the AEC had spent a little more time on their data - releasing it as a datafeed or an API (Application Programming Interface), it would be easy even for non-programmers to reuse the data in a tool like Google Maps for public visualisation - or the AEC could have taken the one additional step necessary to map the information themselves (still providing the raw data), providing a far more useful resource for the community.

This is one of the challenges with open data - releasing it in formats useful for the audience, rather than the agency.

Agencies often choose to release data in what they see as the fastest and easiest solution for them, even though it greatly increases the risk that their data will be reused online in out-of-date or inappropriate ways. Imagine the same issue with a listing of illegal drugs, accident hotspots or holiday dates - anyone who relied on old data, because it didn't automatically update in apps or third-party websites, would potentially be at significant risk.

However with a little more effort and thought, agencies can release their data in ways that biase online reuse towards remaining current and accurate - such as via APIs, which automatically update the information whenever a user accesses a mobile app or website which draws from it. With some data, APIs can potentially save lives - as well as reduce the risks to both agencies and developers.

Example 2: Analysing agency tweets

I'm interested in what government agencies say online and have been tracking the use of Twitter by Australian governments, including local, state and federal agencies, for six years. I track these accounts using my @egovau Twitter account, in two Twitter lists (as the maximum list size is 500 accounts):


Now it's great to track these accounts within Twitter, however how can I easily get a sense of which agencies are most active or have the largest following?

Followerwonk.com Twitter report
I use followerwonk.com for this purpose - a tool which can capture a snapshot of the number of followers, tweets and other details of every account at a particular time. In fact it is so good that I actually pay money for it.

These snapshots can be downloaded as CSVs and analysed in spreadsheets - which makes it easy to identify the most and least active government Twitter users (as I've blogged about in an infographic).

However what Followerwonk doesn't do is to capture and archive the actual tweets from the roughly 890 Australian government agencies and councils that use Twitter. If I want to analyse what they actually say in their tweets, rather than simply analyse the number of tweets, I need different tools.

While it is reasonably easy to archive the tweets from an individual Twitter account (you can download your own tweets from Twitter directly), or tweets that use particular terms or hashtags, using a tool like TweetArchivist, which is really useful for tracking conferences, it is harder to capture all the tweets from a large number of Twitter accounts at the same time - even if they are in the same Twitter list.

I've previously captured some Twitter list tweets using paper.li, which turns them into a daily 'newspaper'. In fact I have mapped Australian Federal parliamentarian tweets, by house and party, for those who wish a daily dose of political discussion in a condensed form.

The beauty of this approach is that paper.li updates as I update my @egovaupollies Twitter lists (where I follow Australian federal politicians) - the use of this datafeed ensures the 'newspapers' are always current.

However paper.li only selectively captures and reports tweets and doesn't allow them to be downloaded in a structured way. It doesn't really help me archive my government agency Twitter lists.

I have tried using a number of tools without success, including the fantastic IFTTT (If This, Then That) site, which allows the creation of 'recipes' which perform actions between different online social networks and web 2.0 tools. I have used IFTTT previously to do things such as automate the change of my Facebook profile image when I change my image in Twitter.

However the fantastic Digital Inspirations blog, written by Amit Agarwal, provides useful code 'recipes' that can be adapted to do all kinds of things by non-programmers.

I tried one of Amit's 'recipes' for converting a Twitter list into an RSS feed, however found it didn't work properly as Twitter had changed its own code. I tweeted to Amit (@labnol) and he graciously replied with a link to an updated post, A Simple Way to Create RSS Feeds for Twitter, which did indeed provide a simple way of doing this, with a step-by-step video.

I followed the video and, using the Twitter Widgets page and the Google script that Amit provided, was able to quickly create the RSS feeds I needed for my Twitter lists (one feed per list).

You can view these RSS feeds using the following (unpretty) web addresses:


However I had a new issue. Taking the tweets from the RSS feeds and archiving them in a structured way into a spreadsheet or database for later analysis.

I thought it would be relatively easy to find a free online or downloadable RSS reader which could archive all the tweets from these RSS feeds. I was wrong.

I could not find an RSS reader that was designed to capture, store and archive RSS - only ones designed to format and view them.

So I went back to IFTTT and searched for a recipe that might help.

Here I found the recipe, Backup RSS Feed to Google Spreadsheet by Martin Hawksey.

The recipe was simple. All I had to do was put in my first RSS feed (above) and adjust the name of the spreadsheet in which it would be stored. Then I activated the recipe, which connected to my Google Drive and created an archival spreadsheet that updated every time a government agency or council on the list tweeted.

As I had two lists, I replicated the recipe, using the second RSS feed and a new spreadsheet name. Then I left it to see what happened....

A few hours later, checking back, the spreadsheets were growing, with about a hundred tweets between them.

I am now able to easily analyse this data to build a picture of what government agencies actually talk about, providing insights that otherwise would never be captured (stay tuned!)

In this case study the government data was already public and visible in agency Twitter accounts, however it was not really 'open, - neither easy to capture nor easy to reuse.  No government or local council in Australia I am aware of currently releases its tweets as open data or in any specific format, such as RSS, which could be captured and stored, (even though many use RSS for media releases).

However these tweets are also useful government data. The tweets are able to paint a picture of how government uses social media, what they talk about, how they say it and who they interact with. It has both historic value for the country as well as current value for understanding what different agencies and local governments are focused on today.

Capturing and reusing these government tweets was harder than reusing the data from the AEC. The AEC at least released the poll locations as open data, albeit in an imperfectly reusable form.

However using some ingenuity, but without any coding, it was still possible for a non-programmer to capture all of government's tweets and make them more useful.

Conclusion

There's still a long, long way for agencies to go with open data. Right now the data released around the countries by state and local jurisdictions is often hard to match up, being in different formats, collected in different ways, presented in different ways and often is not directly comparable from jurisdiction to jurisdiction. Federally there's not the same issue, however different agencies use different geographic areas, different terminology and different formats, again, for releasing data.

Much data remains unreleased, and even where data is technically public (such as tweets or Facebook updates), archives of this data are not always easily available to the public.

However there are now many tools online which can help make some of this imperfect public data more usable and useful - and you no longer need to be a programmer to do it.

2 comments:

  1. Great post Craig.
    I'm interested in your thoughts on agencies releasing data in multiple formats to suit various "market segments": Excel for those people whose analysis skills outstrip their technology skills, CSV (or similar) for the slightly more astute and APIs for programmers or similar.

    ReplyDelete
  2. Hi David,

    Certainly there needs to be different data formats to suit different uses, however there's tools which allow data to be transmuted into a range of data formats automatically with no additional effort by the agency - so this is a very low hurdle to meet.

    ReplyDelete