Following the two previous assertions, namely that:
‘you can be supported by a huge crowd for a very short time, or by few for a long time, but you can’t have a huge crowd all of the time (unless data collection is passive)’ (original post here)
‘All information sources are heterogeneous, but some are more honest about it than others’ (original post here)
The third assertion is about pattern of participation. It is one that I’ve mentioned before and in some way it is a corollary of the two assertions above.
‘When looking at crowdsourced information, always keep participation inequality in mind’
Because crowdsourced information, either Volunteered Geographic Information or Citizen Science, is created through a socio-technical process, all too often it is easy to forget the social side – especially when you are looking at the information without the metadata of who collected it and when. So when working with OpenStreetMap data, or viewing the distribution of bird species in eBird (below), even though the data source is expected to be heterogeneous, each observation is treated as similar to other observation and assumed to be produced in a similar way.
Yet, data is not only heterogeneous in terms of consistency and coverage, it is also highly heterogeneous in terms of contribution. One of the most persistence findings from studies of various systems – for example in Wikipedia , OpenStreetMap and even in volunteer computing is that there is a very distinctive heterogeneity in contribution. The phenomena was term ‘Participation Inequality‘ by Jakob Nielsn in 2006 and it is summarised succinctly in the diagram below (from Visual Liberation blog) – very small number of contributors add most of the content, while most of the people that are involved in using the information will not contribute at all. Even when examining only those that actually contribute, in some project over 70% contribute only once, with a tiny minority contributing most of the information.
Therefore, when looking at sources of information that were created through such process, it is critical to remember the nature of contribution. This has far reaching implications on quality as it is dependent on the expertise of the heavy contributors, on their spatial and temporal engagement, and even on their social interaction and practices (e.g. abrasive behaviour towards other participants).
Because of these factors, it is critical to remember the impact and implications of participation inequality on the analysis of the information. There will be some analysis to which it will have less impact and some where it will have major one. In either cases, it need to be taken into account.
Following the last post, which focused on an assertion about crowdsourced geographic information and citizen science I continue with another observation. As was noted in the previous post, these can be treated as ‘laws’ as they seem to emerge as common patterns from multiple projects in different areas of activity – from citizen science to crowdsourced geographic information. The first assertion was about the relationship between the number of volunteers who can participate in an activity and the amount of time and effort that they are expect to contribute.
This time, I look at one aspect of data quality, which is about consistency and coverage. Here the following assertion applies:
‘All information sources are heterogeneous, but some are more honest about it than others’
What I mean by that is the on-going argument about authoritative and crowdsourced information sources (Flanagin and Metzger 2008 frequently come up in this context), which was also at the root of the Wikipedia vs. Britannica debate, and the mistrust in citizen science observations and the constant questioning if they can do ‘real research’.
There are many aspects for these concerns, so the assertion deals with the aspects of comprehensiveness and consistency which are used as a reason to dismiss crowdsourced information when comparing them to authoritative data. However, at a closer look we can see that all these information sources are fundamentally heterogeneous. Despite of all the effort to define precisely standards for data collection in authoritative data, heterogeneity creeps in because of budget and time limitations, decisions about what is worthy to collect and how, and the clash between reality and the specifications. Here are two examples:
Take one of the Ordnance Survey Open Data sources – the map present themselves as consistent and covering the whole country in an orderly way. However, dig in to the details for the mapping, and you discover that the Ordnance Survey uses different standards for mapping urban, rural and remote areas. Yet, the derived products that are generalised and manipulated in various ways, such as Meridian or Vector Map District, do not provide a clear indication which parts originated from which scale – so the heterogeneity of the source disappeared in the final product.
The census is also heterogeneous, and it is a good case of specifications vs. reality. Not everyone fill in the forms and even with the best effort of enumerators it is impossible to collect all the data, and therefore statistical analysis and manipulation of the results are required to produce a well reasoned assessment of the population. This is expected, even though it is not always understood.
Therefore, even the best information sources that we accept as authoritative are heterogeneous, but as I’ve stated, they just not completely honest about it. The ONS doesn’t release the full original set of data before all the manipulations, nor completely disclose all the assumptions that went into reaching the final value. The Ordnance Survey doesn’t tag every line with metadata about the date of collection and scale.
Somewhat counter-intuitively, exactly because crowdsourced information is expected to be inconsistent, we approach it as such and ask questions about its fitness for use. So in that way it is more honest about the inherent heterogeneity.
Importantly, the assertion should not be taken to be dismissive of authoritative sources, or ignoring that the heterogeneity within crowdsources information sources is likely to be much higher than in authoritative ones. Of course all the investment in making things consistent and the effort to get universal coverage is indeed worth it, and it will be foolish and counterproductive to consider that such sources of information can be replaced as is suggest for the census or that it’s not worth investing in the Ordnance Survey to update the authoritative data sets.
Moreover, when commercial interests meet crowdsourced geographic information or citizen science, the ‘honesty’ disappear. For example, even though we know that Google Map Maker is now used in many part
s of the world (see the figure), even in cases when access to vector data is provided by Google, you cannot find out about who contribute, when and where. It is also presented as an authoritative source of information.
Despite the risk of misinterpretation, the assertion can be useful as a reminder that the differences between authoritative and crowdsourced information are not as big as it may seem.
The Guardian Science Weekly podcast is dedicated to Citizen Science – another example of the growing interest in popular media in Citizen Science. However, the podcast conflate cases were non-professional scientists are involved in scientific project (and Chris Lintott discuss Galaxy Zoo, FoldIt and similar projects) with participation in scientific research through surveys. It is rather interesting that George MacKerron is usually explaining that Mappiness, despite the wide participation in it, is a social survey tool and not a citizen science project. It is also not strictly crowd-sourcing project, so calling the chronotype survey a crowd-sourced science, as the podcast does, is a bit of a hype…
20 March, 2010
The Digital Economy is a research programme of Research Council UK, and as part of it the University of Nottingham is running the Horizon Digital Economy research centre. The institute organised a set of theme days, and the latest one focused on ‘supporting the contextual footprint – infrastructure challenges‘. The day was excellent, covering issues such as background on location issues with a review of location technology and a demonstration of car pooling application, data ownership, privacy and control over your information and finally crowdsourcing. I was asked to give a presentation with a bit of background on OpenStreetMap, discuss the motivation of contributors and mention the business models that are based on open geographical information.
For the purpose of this demonstration, I teamed with Nama Raj Budhathoki who is completing his PhD research at the University of Illinois, Urbana-Champaign under the supervision of Zorica Nedović-Budić (now at University College Dublin). His research focuses on user-generated geographical information, and just before Christmas he run a survey of OpenStreetMap contributors, and I was involved in the design of the questionnaire (as well as being lucky enough to be on Nama’s advisory committee).
So here is the presentation and we plan to give more comprehensive feedback on the survey during State of the Map 2010.
17 July, 2009
Chris Parker, a PhD student at Loughborough University, organised a dedicated Volunteered Geographical Information research group site on ResearchGate. While I dislike the term – I usually interpret it as the version of ‘volunteered’ as in ‘mum volunteered me to help the old lady cross the street’ – there is no point in trying to change it. When Mike Goodchild coins an acronym, it will stick; it’s sort of a GIScience law!
If you are interested in user-generated geographical content, crowdsourced geographical information, commons-based peer-produced geographical information, or any other way to call this phenomena (for example VGI) – join the group. It will be good to keep in touch, share information and discuss research aspects.
If you are researching in this area you are also welcome to submit a paper to GISRUK 2010 which will be hosted at UCL – we are keen to have a VGI element in the programme, considering that UCL is the host of OpenStreetMap .