When not to use crowdsourced data…

More or Less‘ is a good programme on BBC Radio 4. Regularly exploring the numbers and the evidence behind news stories and other important things, and checking if they stand out. However, the piece that was broadcast  this week about Golf courses and housing in the UK provides a nice demonstration of when not to use crowdsourced information. The issue that was discussed was how much actual space golf courses occupy, when compared to space that is used for housing. All was well, until they announced in the piece the use of clever software (read GIS) with a statistical superhero to do the analysis. Interestingly, the data that was used for the analysis was OpenStreetMap – and because the news item was about Surrey, they started doing the analysis with it.

For the analysis to be correct, you need to assume that all the building polygons in OpenStreetMap and all the Golf courses have been identified and mapped. My own guess that in Surrey, this could be the case – especially with all the wonderful work of James Rutter catalysed. However, assuming that this is the case for the rest of the country is, well, a bit fancy. I wouldn’t dare to state that OpenStreetMap is complete to such a level, without lots of quality testing which I haven’t seen. There is only the road length analysis of ITO World! and other bits of analysis, but we don’t know how complete OSM is.

While I like OpenStreetMap very much, it is utterly unsuitable for any sort of statistical analysis that works at the building level and then summing up to the country levelbecause of the heterogeneity of the data . For that sort of thing, you have to use a consistent dataset, or at least one that attempts to be consistent, and that data comes from the Ordnance Survey.

As with other statistical affairs, the core case that is made about the assertion as a whole in the rest of the clip is relevant here. First, we should question the unit of analysis (is it right to compare the footprint of a house to the area of Golf courses? Probably not) and what is to be gained by adding up individual building’s footprints to the level of the UK while ignoring roads, gardens, and all the rest of the built environment. Just because it is possible to add up every building’s footprint, doesn’t mean that you should. Second, this analysis is the sort of example of ‘Big Data’ fallacy which goes analyse first, then question (if at all) what the relationship between the data and reality.

Published by

mukih

Professor of GIScience, University College London

7 thoughts on “When not to use crowdsourced data…”

  1. I think it’s unfair to say don’t use crowd-sourced data. As for any other data analysis activity it is important to be aware of the caveats associated with the data: the most notable one being that OSM data has not generally been created with this type of use in mind. Personally, I think mappers should be aware that OSM data has analytic potential : which is why I’m talking about it next week at SotM-Eu!

    Broadly speaking I think OSM is probably fine as a source of golf courses, but unsuitable for building footprints at least in Surrey. I expand on this below.

    For golf courses it’s fairly straightforward to check that things mapped as golf courses are golf courses, and because they stand out even on Landsat imagery it’s not too hard to scan the imagery itself for any potential additional candidates. My guess is that over 90% of Surrey Golf courses have been added to OpenStreetMap and rather more in terms of total area (its small 9-hole courses which are most likely to be missed). My major concerns about such data are three-fold: substantial areas of Surrey were mapped early on in the project and many of the polygons are not as refined as one would expect from areas mapped now; maps from the 1940s were used for at least some of this and some old courses may have gone; several golf courses in Surrey (notably Wentworth and St George’s Hill are also intermixed with expensive residential houses and thus one might want some clarity about what exactly one includes in the area for a golf club.

    Building outlines are a different matter. It’s hard to quantify the accuracy and completeness of mapped building. There are suitable alternative data sets: the generalised polygons from OS VectorMapDistrict and polygons extracted from the more detailed buldings shown on OS StreetView (e.g., as done by Tim Sheerman-Chase’s mapseg program).

    Thus I think for Surrey golf courses it’s fairly easy to achieve a consistent data set from OSM, whereas I would avoid OSM for buildings.

    The usual problem about validating OSM datasets is having access to a suitable reference (typically OSGB data in the UK). If one has access to MasterMap then it can be used for the analysis; if not one must trim one’s sails to fit.

    The truth is that in the absence of access to the comprehensive map data of a national mapping agency, OpenStreetMap is often the only alternative.

    1. Thank you for the thoughtful comments – I didn’t say that crowdsourced data should not be used for analysis (I’ll be the last one to say that!) but here is a clear example of when it is not suitable. The generic case might be translated to ‘I want to aggregate properties in a consistent way across large area (region, country)’ and without testing very carefully for biases and dealing with them through reasoned assumptions or better datasets, than the data is just not fit for purpose, even for ‘digital back of an envelope’ calculations. I think that we agree on the general case…

  2. Listening to the podcast it’s hard to know exactly want data they used and what assumptions. It would be an interesting exercise to compare with mastermap. I think it’s probably possible to work out from places which are well mapped the building Vs golf course ratio and maybe extrapolate from that using other data sources such as population density across the country. But maybe this post is a little bit of reverse pyschology for OSM – because basically it sounds like a challenge and I imagine in a few years we could revisit this and have a much better answer. But the real goal as highlighted in the programme is to look for golf courses near transport and job hubs and identify them for mass compulsory purchase for housing…that’s real big data analysis challenge😉

    1. Thanks Tim, I’m with you on the challenge and compulsory purchase!

      Indeed, and in connection to Jerry’s comment above, it will be great to have a tool that can tell you: if you want to compare places X,Y & Z, it is highly likely that crowdsourced data will be good enough to your task to justify drawing conclusions from the analysis (and to have a definition of ‘good enough’ ‘highly likely’ for different types of analysis…). We came across interesting examples over at http://crowdgov.wordpress.com/ and in the report that will come out at the end of the month..

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s