19 September, 2014
The Association of American Geographers is coordinating an effort to create an International Encyclopedia of Geography. Plans started in 2010, with an aim to see the 15 volumes project published in 2015 or 2016. Interestingly, this shows that publishers and scholars are still seeing the value in creating subject-specific encyclopedias. On the other hand, the weird decision by Wikipedians that Geographic Information Science doesn’t exist outside GIS, show that geographers need a place to define their practice by themselves. You can find more information about the AAG International Encyclopedia project in an interview with Doug Richardson from 2012.
As part of this effort, I was asked to write an entry on ‘Volunteered Geographic Information, Quality Assurance‘ as a short piece of about 3000 words. To do this, I have looked around for mechanisms that are used in VGI and in Citizen Science. This are covered in OpenStreetMap studies and similar work in GIScience, and in the area of citizen science, there are reviews such as the one by Andrea Wiggins and colleagues of mechanisms to ensure data quality in citizen science projects, which clearly demonstrated that projects are using multiple methods to ensure data quality.
Below you’ll find an abridged version of the entry (but still long). The citation for this entry will be:
Haklay, M., Forthcoming. Volunteered geographic information, quality assurance. in D. Richardson, N. Castree, M. Goodchild, W. Liu, A. Kobayashi, & R. Marston (Eds.) The International Encyclopedia of Geography: People, the Earth, Environment, and Technology. Hoboken, NJ: Wiley/AAG
In the entry, I have identified 6 types of mechanisms that are used to ensure quality assurance when the data has a geographical component, either VGI or citizen science. If I have missed a type of quality assurance mechanism, please let me know!
Here is the entry:
Volunteered geographic information, quality assurance
Volunteered Geographic Information (VGI) originate outside the realm of professional data collection by scientists, surveyors and geographers. Quality assurance of such information is important for people who want to use it, as they need to identify if it is fit-for-purpose. Goodchild and Li (2012) identified three approaches for VGI quality assurance , ‘crowdsourcing‘ and that rely on the number of people that edited the information, ‘social’ approach that is based on gatekeepers and moderators, and ‘geographic’ approach which uses broader geographic knowledge to verify that the information fit into existing understanding of the natural world. In addition to the approaches that Goodchild and li identified, there are also ‘domain’ approach that relate to the understanding of the knowledge domain of the information, ‘instrumental observation’ that rely on technology, and ‘process oriented’ approach that brings VGI closer to industrialised procedures. First we need to understand the nature of VGI and the source of concern with quality assurance.
While the term volunteered geographic information (VGI) is relatively new (Goodchild 2007), the activities that this term described are not. Another relatively recent term, citizen science (Bonney 1996), which describes the participation of volunteers in collecting, analysing and sharing scientific information, provide the historical context. While the term is relatively new, the collection of accurate information by non-professional participants turn out to be an integral part of scientific activity since the 17th century and likely before (Bonney et al 2013). Therefore, when approaching the question of quality assurance of VGI, it is critical to see it within the wider context of scientific data collection and not to fall to the trap of novelty, and to consider that it is without precedent.
Yet, this integration need to take into account the insights that emerged within geographic information science (GIScience) research over the past decades. Within GIScience, it is the body of research on spatial data quality that provide the framing for VGI quality assurance. Van Oort’s (2006) comprehensive synthesis of various quality standards identifies the following elements of spatial data quality discussions:
- Lineage – description of the history of the dataset,
- Positional accuracy – how well the coordinate value of an object in the database relates to the reality on the ground.
- Attribute accuracy – as objects in a geographical database are represented not only by their geometrical shape but also by additional attributes.
- Logical consistency – the internal consistency of the dataset,
- Completeness – how many objects are expected to be found in the database but are missing as well as an assessment of excess data that should not be included.
- Usage, purpose and constraints – this is a fitness-for-purpose declaration that should help potential users in deciding how the data should be used.
- Temporal quality – this is a measure of the validity of changes in the database in relation to real-world changes and also the rate of updates.
While some of these quality elements might seem independent of a specific application, in reality they can be only be evaluated within a specific context of use. For example, when carrying out analysis of street-lighting in a specific part of town, the question of completeness become specific about the recording of all street-light objects within the bounds of the area of interest and if the data set includes does not include these features or if it is complete for another part of the settlement is irrelevant for the task at hand. The scrutiny of information quality within a specific application to ensure that it is good enough for the needs is termed ‘fitness for purpose’. As we shall see, fit-for-purpose is a central issue with respect to VGI.
To understand the reason that geographers are concerned with quality assurance of VGI, we need to recall the historical development of geographic information, and especially the historical context of geographic information systems (GIS) and GIScience development since the 1960s. For most of the 20th century, geographic information production became professionalised and institutionalised. The creation, organisation and distribution of geographic information was done by official bodies such as national mapping agencies or national geological bodies who were funded by the state. As a results, the production of geographic information became and industrial scientific process in which the aim is to produce a standardised product – commonly a map. Due to financial, skills and process limitations, products were engineered carefully so they can be used for multiple purposes. Thus, a topographic map can be used for navigation but also for urban planning and for many other purposes. Because the products were standardised, detailed specifications could be drawn, against which the quality elements can be tested and quality assurance procedures could be developed. This was the backdrop to the development of GIS, and to the conceptualisation of spatial data quality.
The practices of centralised, scientific and industrialised geographic information production lend themselves to quality assurance procedures that are deployed through organisational or professional structures, and explains the perceived challenges with VGI. Centralised practices also supported employing people with focus on quality assurance, such as going to the field with a map and testing that it complies with the specification that were used to create it. In contrast, most of the collection of VGI is done outside organisational frameworks. The people who contribute the data are not employees and seemingly cannot be put into training programmes, asked to follow quality assurance procedures, or expected to use standardised equipment that can be calibrated. The lack of coordination and top-down forms of production raise questions about ensuring the quality of the information that emerges from VGI.
To consider quality assurance within VGI require to understand some underlying principles that are common to VGI practices and differentiate it from organised and industrialised geographic information creation. For example, some VGI is collected under conditions of scarcity or abundance in terms of data sources, number of observations or the amount of data that is being used. As noted, the conceptualisation of geographic data collection before the emergence of VGI was one of scarcity where data is expensive and complex to collect. In contrast, many applications of VGI the situation is one of abundance. For example, in applications that are based on micro-volunteering, where the participant invest very little time in a fairly simple task, it is possible to give the same mapping task to several participants and statistically compare their independent outcomes as a way to ensure the quality of the data. Another form of considering abundance as a framework is in the development of software for data collection. While in previous eras, there will be inherently one application that was used for data capture and editing, in VGI there is a need to consider of multiple applications as different designs and workflows can appeal and be suitable for different groups of participants.
Another underlying principle of VGI is that since the people who collect the information are not remunerated or in contractual relationships with the organisation that coordinates data collection, a more complex relationships between the two sides are required, with consideration of incentives, motivations to contribute and the tools that will be used for data collection. Overall, VGI systems need to be understood as socio-technical systems in which the social aspect is as important as the technical part.
In addition, VGI is inherently heterogeneous. In large scale data collection activities such as the census of population, there is a clear attempt to capture all the information about the population over relatively short time and in every part of the country. In contrast, because of its distributed nature, VGI will vary across space and time, with some areas and times receiving more attention than others. An interesting example has been shown in temporal scales, where some citizen science activities exhibit ‘weekend bias’ as these are the days when volunteers are free to collect more information.
Because of the difference in the organisational settings of VGI, a different approaches to quality assurance is required, although as noted, in general such approaches have been used in many citizen science projects. Over the years, several approaches emerged and these include ‘crowdsourcing ‘, ‘social’, ‘geographic’, ‘domain’, ‘instrumental observation’ and ‘process oriented’. We now turn to describe each of these approaches.
The ‘crowdsourcing’ approach is building on the principle of abundance. Since there are is a large number of contributors, quality assurance can emerge from repeated verification by multiple participants. Even in projects where the participants actively collect data in uncoordinated way, such as the OpenStreetMap project, it has been shown that with enough participants actively collecting data in a given area, the quality of the data can be as good as authoritative sources. The limitation of this approach is when local knowledge or verification on the ground (‘ground truth’) is required. In such situations, the ‘crowdsourcing’ approach will work well in central, highly populated or popular sites where there are many visitors and therefore the probability that several of them will be involved in data collection rise. Even so, it is possible to encourage participants to record less popular places through a range of suitable incentives.
The ‘social’ approach is also building on the principle of abundance in terms of the number of participants, but with a more detailed understanding of their knowledge, skills and experience. In this approach, some participants are asked to monitor and verify the information that was collected by less experienced participants. The social method is well established in citizen science programmes such as bird watching, where some participants who are more experienced in identifying bird species help to verify observations by other participants. To deploy the social approach, there is a need for a structured organisations in which some members are recognised as more experienced, and are given the appropriate tools to check and approve information.
The ‘geographic’ approach uses known geographical knowledge to evaluate the validity of the information that is received by volunteers. For example, by using existing knowledge about the distribution of streams from a river, it is possible to assess if mapping that was contributed by volunteers of a new river is comprehensive or not. A variation of this approach is the use of recorded information, even if it is out-of-date, to verify the information by comparing how much of the information that is already known also appear in a VGI source. Geographic knowledge can be potentially encoded in software algorithms.
The ‘domain’ approach is an extension of the geographic one, and in addition to geographical knowledge uses a specific knowledge that is relevant to the domain in which information is collected. For example, in many citizen science projects that involved collecting biological observations, there will be some body of information about species distribution both spatially and temporally. Therefore, a new observation can be tested against this knowledge, again algorithmically, and help in ensuring that new observations are accurate.
The ‘instrumental observation’ approach remove some of the subjective aspects of data collection by a human that might made an error, and rely instead on the availability of equipment that the person is using. Because of the increased in availability of accurate-enough equipment, such as the various sensors that are integrated in smartphones, many people keep in their pockets mobile computers with ability to collect location, direction, imagery and sound. For example, images files that are captured in smartphones include in the file the GPS coordinates and time-stamp, which for a vast majority of people are beyond their ability to manipulate. Thus, the automatic instrumental recording of information provide evidence for the quality and accuracy of the information.
Finally, the ‘process oriented’ approach bring VGI closer to traditional industrial processes. Under this approach, the participants go through some training before collecting information, and the process of data collection or analysis is highly structured to ensure that the resulting information is of suitable quality. This can include provision of standardised equipment, online training or instruction sheets and a structured data recording process. For example, volunteers who participate in the US Community Collaborative Rain, Hail & Snow network (CoCoRaHS) receive standardised rain gauge, instructions on how to install it and an online resources to learn about data collection and reporting.
Importantly, these approach are not used in isolation and in any given project it is likely to see a combination of them in operation. Thus, an element of training and guidance to users can appear in a downloadable application that is distributed widely, and therefore the method that will be used in such a project will be a combination of the process oriented with the crowdsourcing approach. Another example is the OpenStreetMap project, which in the general do not follow limited guidance to volunteers in terms of information that they collect or the location in which they collect it. Yet, a subset of the information that is collected in OpenStreetMap database about wheelchair access is done through the highly structured process of the WheelMap application in which the participant is require to select one of four possible settings that indicate accessibility. Another subset of the information that is recorded for humanitarian efforts is following the social model in which the tasks are divided between volunteers using the Humanitarian OpenStreetMap Team (H.O.T) task manager, and the data that is collected is verified by more experienced participants.
The final, and critical point for quality assurance of VGI that was noted above is fitness-for-purpose. In some VGI activities the information has a direct and clear application, in which case it is possible to define specifications for the quality assurance element that were listed above. However, one of the core aspects that was noted above is the heterogeneity of the information that is collected by volunteers. Therefore, before using VGI for a specific application there is a need to check for its fitness for this specific use. While this is true for all geographic information, and even so called ‘authoritative’ data sources can suffer from hidden biases (e.g. luck of update of information in rural areas), the situation with VGI is that variability can change dramatically over short distances – so while the centre of a city will be mapped by many people, a deprived suburb near the centre will not be mapped and updated. There are also limitations that are caused by the instruments in use – for example, the GPS positional accuracy of the smartphones in use. Such aspects should also be taken into account, ensuring that the quality assurance is also fit-for-purpose.
References and Further Readings
Bonney, Rick. 1996. Citizen Science – a lab tradition, Living Bird, Autumn 1996.
Bonney, Rick, Shirk, Jennifer, Phillips, Tina B. 2013. Citizen Science, Encyclopaedia of science education. Berlin: Springer-Verlag.
Goodchild, Michael F. 2007. Citizens as sensors: the world of volunteered geography. GeoJournal, 69(4), 211–221.
Goodchild, Michael F., and Li, Linna. 2012, Assuring the quality of volunteered geographic information. Spatial Statistics, 1 110-120
Haklay, Mordechai. 2010. How Good is volunteered geographical information? a comparative study of OpenStreetMap and ordnance survey datasets. Environment and Planning B: Planning and Design, 37(4), 682–703.
Sui, Daniel, Elwood, Sarah and Goodchild, Michael F. (eds), 2013. Crowdsourcing Geographic Knowledge, Berlin:Springer-Verlag.
Van Oort, Pepjin .A.J. 2006. Spatial data quality: from description to application, PhD Thesis, Wageningen: Wageningen Universiteit, p. 125.
14 August, 2014
As far as I can tell, Nelson et al. 2006 ‘Towards development of a high quality public domain global roads database‘ and Taylor & Caquard 2006 Cybercartography: Maps and Mapping in the Information Era are the first peer review papers that mention OpenStreetMap. Since then, OpenStreetMap received plenty of academic attention. More ‘conservative’ search engines such as ScienceDirect or Scopus find 286 and 236 peer review papers that mention the project (respectively). The ACM digital library finds 461 papers in the areas that are relevant to computing and electronics, while Microsoft Academic Research find only 112. Google Scholar lists over 9000 (!). Even with the most conservative version from Microsoft, we can see an impact on fields ranging from social science to engineering and physics. So lots to be proud about as a major contribution to knowledge beyond producing maps.
Michael Goodchild, in his 2007 paper that started the research into Volunteered Geographic Information (VGI), mentioned OpenStreetMap (OSM), and since then there is a lot of conflation between OSM and VGI. In some recent papers you can find statements such as ‘OpenstreetMap is considered as one of the most successful and popular VGI projects‘ or ‘the most prominent VGI project OpenStreetMap‘ so at some level, the boundary between the two is being blurred. I’m part of the problem – for example, in the title of my 2010 paper ‘How good is volunteered geographical information? A comparative study of OpenStreetMap and Ordnance Survey datasets‘. However, the more I was thinking about it, the more I am uncomfortable with this equivalence. I would think that the recent line from Neis & Zielstra (2013) is more accurate: ‘One of the most utilized, analyzed and cited VGI-platforms, with an increasing popularity over the past few years, is OpenStreetMap (OSM)‘. I’ll explain why.
Let’s look at the whole area of OpenStreetMap studies. Over the past decade, several types of research papers emerged.
There is a whole set of research projects that use OSM data because it’s easy to use and free to access (in computer vision or even string theory). These studies are not part of ‘OSM studies’ or VGI, as for them, this is just data to be used.
Thirdly, there are studies that also look at the interactions between the contribution and the data – for example, in trying to infer trustworthiness.
[Unfortunately, due to academic practices and publication outlets, a lot of these papers are locked behind paywalls, but this is another issue... ]
In short, this is a significant body of knowledge about the nature of the project, the implications of what it produces, and ways to understand the information that emerge from it. Clearly, we now know that OSM produce good data and know about the patterns of contribution. What is also clear that the many of these patterns are specific to OSM. Because of the importance of OSM to so many applications areas (including illustrative maps in string theory!) these insights are very important. Some of them are expected to be also present in other VGI projects (hence my suggestions for assertions about VGI) but this need to be done carefully, only when there is evidence from other projects that this is the case. In short, we should avoid conflating VGI and OSM.
9 August, 2014
Today, OpenStreetMap celebrates 10 years of operation as counted from the date of registration. I’ve heard about the project when it was in early stages, mostly because I knew Steve Coast when I was studying for my Ph.D. at UCL. As a result, I was also able to secured the first ever research grant that focused on OpenStreetMap (and hence Volunteered Geographic Information – VGI) from the Royal Geographical Society in 2005. A lot can be said about being in the right place at the right time!
Having followed the project during this decade, there is much to reflect on – such as thinking about open research questions, things that the academic literature failed to notice about OSM or the things that we do know about OSM and VGI because of the openness of the project. However, as I was preparing the talk for the INSPIRE conference, I was starting to think about the start dates of OSM (2004), TomTom Map Share (2007), Waze (2008), Google Map Maker (2008). While there are conceptual and operational differences between these projects, in terms of ‘knowledge-based peer production systems’ they are fairly similar: all rely on large number of contributors, all use both large group of contributors who contribute little, and a much smaller group of committed contributors who do the more complex work, and all are about mapping. Yet, OSM started 3 years before these other crowdsourced mapping projects, and all of them have more contributors than OSM.
Since OSM is described as ‘Wikipedia of maps‘, the analogy that I was starting to think of was that it’s a bit like a parallel history, in which in 2001, as Wikipedia starts, Encarta and Britannica look at the upstart and set up their own crowdsourcing operations so within 3 years they are up and running. By 2011, Wikipedia continues as a copyright free encyclopedia with sizable community, but Encarta and Britannica have more contributors and more visibility.
Knowing OSM closely, I felt that this is not a fair analogy. While there are some organisational and contribution practices that can be used to claim that ‘it’s the fault of the licence’ or ‘it’s because of the project’s culture’ and therefore justify this, not flattering, analogy to OSM, I sensed that there is something else that should be used to explain what is going on.
Then, during my holiday in Italy, I was enjoying the offline TripAdvisor app for Florence, using OSM for navigation (in contrast to Google Maps which are used in the online app) and an answer emerged. Within OSM community, from the start, there was some tension between the ‘map’ and ‘database’ view of the project. Is it about collecting the data so beautiful maps or is it about building a database that can be used for many applications?
Saying that OSM is about the map mean that the analogy is correct, as it is very similar to Wikipedia – you want to share knowledge, you put it online with a system that allow you to display it quickly with tools that support easy editing the information sharing. If, on the other hand, OSM is about a database, then OSM is about something that is used at the back-end of other applications, a lot like DBMS or Operating System. Although there are tools that help you to do things easily and quickly and check the information that you’ve entered (e.g. displaying the information as a map), the main goal is the building of the back-end.
Maybe a better analogy is to think of OSM as ‘Linux of maps’, which mean that it is an infrastructure project which is expected to have a lot of visibility among the professionals who need it (system managers in the case of Linux, GIS/Geoweb developers for OSM), with a strong community that support and contribute to it. The same way that some tech-savvy people know about Linux, but most people don’t, I suspect that TripAdvisor offline users don’t notice that they use OSM, they are just happy to have a map.
The problem with the Linux analogy is that OSM is more than software – it is indeed a database of information about geography from all over the world (and therefore the Wikipedia analogy has its place). Therefore, it is somewhere in between. In a way, it provide a demonstration for the common claim in GIS circles that ‘spatial is special‘. Geographical information is infrastructure in the same way that operating systems or DBMS are, but in this case it’s not enough to create an empty shell that can be filled-in for the specific instance, but there is a need for a significant amount of base information before you are able to start building your own application with additional information. This is also the philosophical difference that make the licensing issues more complex!
In short, both Linux or Wikipedia analogies are inadequate to capture what OSM is. It has been illuminating and fascinating to follow the project over its first decade, and may it continue successfully for more decades to come.
1 June, 2014
‘More or Less‘ is a good programme on BBC Radio 4. Regularly exploring the numbers and the evidence behind news stories and other important things, and checking if they stand out. However, the piece that was broadcast this week about Golf courses and housing in the UK provides a nice demonstration of when not to use crowdsourced information. The issue that was discussed was how much actual space golf courses occupy, when compared to space that is used for housing. All was well, until they announced in the piece the use of clever software (read GIS) with a statistical superhero to do the analysis. Interestingly, the data that was used for the analysis was OpenStreetMap – and because the news item was about Surrey, they started doing the analysis with it.
For the analysis to be correct, you need to assume that all the building polygons in OpenStreetMap and all the Golf courses have been identified and mapped. My own guess that in Surrey, this could be the case – especially with all the wonderful work of James Rutter catalysed. However, assuming that this is the case for the rest of the country is, well, a bit fancy. I wouldn’t dare to state that OpenStreetMap is complete to such a level, without lots of quality testing which I haven’t seen. There is only the road length analysis of ITO World! and other bits of analysis, but we don’t know how complete OSM is.
While I like OpenStreetMap very much, it is utterly unsuitable for any sort of statistical analysis that works at the building level and then summing up to the country level – because of the heterogeneity of the data . For that sort of thing, you have to use a consistent dataset, or at least one that attempts to be consistent, and that data comes from the Ordnance Survey.
As with other statistical affairs, the core case that is made about the assertion as a whole in the rest of the clip is relevant here. First, we should question the unit of analysis (is it right to compare the footprint of a house to the area of Golf courses? Probably not) and what is to be gained by adding up individual building’s footprints to the level of the UK while ignoring roads, gardens, and all the rest of the built environment. Just because it is possible to add up every building’s footprint, doesn’t mean that you should. Second, this analysis is the sort of example of ‘Big Data’ fallacy which goes analyse first, then question (if at all) what the relationship between the data and reality.
29 March, 2014
Thursday marked the launch of The Conservation Volunteers (TCV) report on volunteering impact where they summarised a three year project that explored motivations, changes in pro-environmental behaviour, wellbeing and community resilience. The report is worth a read as it goes beyond the direct impact on the local environment of TCV activities, and demonstrates how involvement in environmental volunteering can have multiple benefits. In a way, it is adding ingredients to a more holistic understanding of ‘green volunteering’.
One of the interesting aspects of the report is in the longitudinal analysis of volunteers motivation (copied here from the report). The comparison is from 784 baseline surveys, 202 Second surveys and 73 third surveys, which were done with volunteers while they were involved with the TCV. The second survey was taken after 4 volunteering sessions, and the third after 10 sessions.
The results of the surveys are interesting in the context of online activities (e.g. citizen science or VGI) because they provide an example for an activity that happen off line – in green spaces such as local parks, community gardens and the such. Moreover, the people that are participating in them come from all walks of life, as previous analysis of TCV data demonstrated that they are recruiting volunteers across the socio-economic spectrum. So here is an activity that can be compared to online volunteering. This is valuable, as if the pattern of TCV information are similar, then we can understand online volunteering as part of general volunteering and not assume that technology changes everything.
So the graph above attracted my attention because of the similarities to Nama Budhathoki work on the motivation of OpenStreetMap volunteers. First, there is a difference between the reasons that are influencing the people that join just one session and those that are involved for the longer time. Secondly, social and personal development aspects are becoming more important over time.
There is clear need to continue and explore the data – especially because the numbers that are being surveyed at each period are different, but this is an interesting finding, and there is surly more to explore. Some of it will be explored by Valentine Seymour in ExCiteS who is working with TCV as part of her PhD.
It is also worth listening to the qualitative observations by volunteers, as expressed in the video that open the event, which is provided below.
Following the two previous assertions, namely that:
‘you can be supported by a huge crowd for a very short time, or by few for a long time, but you can’t have a huge crowd all of the time (unless data collection is passive)’ (original post here)
‘All information sources are heterogeneous, but some are more honest about it than others’ (original post here)
The third assertion is about pattern of participation. It is one that I’ve mentioned before and in some way it is a corollary of the two assertions above.
‘When looking at crowdsourced information, always keep participation inequality in mind’
Because crowdsourced information, either Volunteered Geographic Information or Citizen Science, is created through a socio-technical process, all too often it is easy to forget the social side – especially when you are looking at the information without the metadata of who collected it and when. So when working with OpenStreetMap data, or viewing the distribution of bird species in eBird (below), even though the data source is expected to be heterogeneous, each observation is treated as similar to other observation and assumed to be produced in a similar way.
Yet, data is not only heterogeneous in terms of consistency and coverage, it is also highly heterogeneous in terms of contribution. One of the most persistence findings from studies of various systems – for example in Wikipedia , OpenStreetMap and even in volunteer computing is that there is a very distinctive heterogeneity in contribution. The phenomena was term ‘Participation Inequality‘ by Jakob Nielsn in 2006 and it is summarised succinctly in the diagram below (from Visual Liberation blog) – very small number of contributors add most of the content, while most of the people that are involved in using the information will not contribute at all. Even when examining only those that actually contribute, in some project over 70% contribute only once, with a tiny minority contributing most of the information.
Therefore, when looking at sources of information that were created through such process, it is critical to remember the nature of contribution. This has far reaching implications on quality as it is dependent on the expertise of the heavy contributors, on their spatial and temporal engagement, and even on their social interaction and practices (e.g. abrasive behaviour towards other participants).
Because of these factors, it is critical to remember the impact and implications of participation inequality on the analysis of the information. There will be some analysis to which it will have less impact and some where it will have major one. In either cases, it need to be taken into account.
Following the last post, which focused on an assertion about crowdsourced geographic information and citizen science I continue with another observation. As was noted in the previous post, these can be treated as ‘laws’ as they seem to emerge as common patterns from multiple projects in different areas of activity – from citizen science to crowdsourced geographic information. The first assertion was about the relationship between the number of volunteers who can participate in an activity and the amount of time and effort that they are expect to contribute.
This time, I look at one aspect of data quality, which is about consistency and coverage. Here the following assertion applies:
‘All information sources are heterogeneous, but some are more honest about it than others’
What I mean by that is the on-going argument about authoritative and crowdsourced information sources (Flanagin and Metzger 2008 frequently come up in this context), which was also at the root of the Wikipedia vs. Britannica debate, and the mistrust in citizen science observations and the constant questioning if they can do ‘real research’.
There are many aspects for these concerns, so the assertion deals with the aspects of comprehensiveness and consistency which are used as a reason to dismiss crowdsourced information when comparing them to authoritative data. However, at a closer look we can see that all these information sources are fundamentally heterogeneous. Despite of all the effort to define precisely standards for data collection in authoritative data, heterogeneity creeps in because of budget and time limitations, decisions about what is worthy to collect and how, and the clash between reality and the specifications. Here are two examples:
Take one of the Ordnance Survey Open Data sources – the map present themselves as consistent and covering the whole country in an orderly way. However, dig in to the details for the mapping, and you discover that the Ordnance Survey uses different standards for mapping urban, rural and remote areas. Yet, the derived products that are generalised and manipulated in various ways, such as Meridian or Vector Map District, do not provide a clear indication which parts originated from which scale – so the heterogeneity of the source disappeared in the final product.
The census is also heterogeneous, and it is a good case of specifications vs. reality. Not everyone fill in the forms and even with the best effort of enumerators it is impossible to collect all the data, and therefore statistical analysis and manipulation of the results are required to produce a well reasoned assessment of the population. This is expected, even though it is not always understood.
Therefore, even the best information sources that we accept as authoritative are heterogeneous, but as I’ve stated, they just not completely honest about it. The ONS doesn’t release the full original set of data before all the manipulations, nor completely disclose all the assumptions that went into reaching the final value. The Ordnance Survey doesn’t tag every line with metadata about the date of collection and scale.
Somewhat counter-intuitively, exactly because crowdsourced information is expected to be inconsistent, we approach it as such and ask questions about its fitness for use. So in that way it is more honest about the inherent heterogeneity.
Importantly, the assertion should not be taken to be dismissive of authoritative sources, or ignoring that the heterogeneity within crowdsources information sources is likely to be much higher than in authoritative ones. Of course all the investment in making things consistent and the effort to get universal coverage is indeed worth it, and it will be foolish and counterproductive to consider that such sources of information can be replaced as is suggest for the census or that it’s not worth investing in the Ordnance Survey to update the authoritative data sets.
Moreover, when commercial interests meet crowdsourced geographic information or citizen science, the ‘honesty’ disappear. For example, even though we know that Google Map Maker is now used in many part
s of the world (see the figure), even in cases when access to vector data is provided by Google, you cannot find out about who contribute, when and where. It is also presented as an authoritative source of information.
Despite the risk of misinterpretation, the assertion can be useful as a reminder that the differences between authoritative and crowdsourced information are not as big as it may seem.