Citizen Science & Scientific Crowdsourcing – week 5 – Data quality

This week, in the “Introduction to Citizen Science & Scientific Crowdsourcing“, our focus was on data management, to complete the first part of the course (the second part starts in a week’s time since we have a mid-term “Reading Week” at UCL).

The part that I’ve enjoyed most in developing was the segment that addresses the data quality concerns that are frequently raised about citizen science and geographic crowdsourcing. Here are the slides from this segment, and below them a rationale for the content and detailed notes

I’ve written a lot on this blog about data quality and in many talks that I gave about citizen science and crowdsourced geographic information, the question about data quality is the first one to come up. It is a valid question, and it had led to useful research – for example on OpenStreetMap and I recall the early conversations, 10 years ago, during a journey to the Association for Geographic Information (AGI) conference about the quality and the longevity potential of OSM.

However, when you are being asked the same question again, and again, and again, at some point, you start considering “why am I being asked this question?”. Especially when you know that it’s been over 10 years since it was demonstrated that the quality is beyond “good enough”, and that there are over 50 papers on citizen science quality. So why is the problem so persistent?

Therefore, the purpose of the segment was to explain the concerns about citizen science data quality and their origin, then to explain a core misunderstanding (that the same quality assessment methods that are used in “scarcity” conditions work in “abundance” conditions), and then cover the main approaches to ensure quality (based on my article for the international encyclopedia of geography). The aim is to equip the students with a suitable explanation on why you need to approach citizen science projects differently, and then to inform them of the available methods. Quite a lot for 10 minutes!

So here are the notes from the slides:

[Slide 1] When it comes to citizen science, it is very common to hear suggestions that the data is not good enough and that volunteers cannot collect data at a good quality, because unlike trained researchers, they don’t understand who they are – a perception that we know little about the people that are involved and therefore we don’t know about their ability. There are also perceptions that like Wikipedia, it is all a very loosely coordinate and therefore there are no strict data quality procedures. However, we know that even in the Wikipedia case that when the scientific journal Nature shown over a decade ago (2005) that Wikipedia is resulting with similar quality to Encyclopaedia Britannica, and we will see that OpenStreetMap is producing data of a similar quality to professional services.
In citizen science where sensing and data collection from instruments is included, there are also concerns over the quality of the instruments and their calibration – the ability to compare the results with high-end instruments.
The opening of the Hunter et al. paper (which offers some solutions), summarises the concerned that are raised over data

[Slide 2] Based on conversations with scientists and concerned that are appearing in the literature, there is also a cultural aspect at play which is expressed in many ways – with data quality being used as an outlet to express them. This can be similar to the concerns that were raised in the cult of the amateur (which we’ve seen in week 2 regarding the critique of crowdsourcing) to protect the position of professional scientists and to avoid the need to change practices. There are also special concerns when citizen science is connected to activism, as this seems to “politicise” science or make the data suspicious – we will see next lecture that the story is more complex. Finally, and more kindly, we can also notice that because scientists are used to top-down mechanisms, they find alternative ways of doing data collection and ensuring quality unfamiliar and untested.

[Slide 3] Against this background, it is not surprising to see that checking data quality in citizen science is a popular research topic. Caren Cooper have identified over 50 papers that compare citizen science data with those that were collected by professional – as she points: “To satisfy those who want some nitty gritty about how citizen science projects actually address data quality, here is my medium-length answer, a brief review of the technical aspects of designing and implementing citizen science to ensure the data are fit for intended uses. When it comes to crowd-driven citizen science, it makes sense to assess how those data are handled and used appropriately. Rather than question whether citizen science data quality is low or high, ask whether it is fit or unfit for a given purpose. For example, in studies of species distributions, data on presence-only will fit fewer purposes (like invasive species monitoring) than data on presence and absence, which are more powerful. Designing protocols so that citizen scientists report what they do not see can be challenging which is why some projects place special emphasize on the importance of “zero data.”
It is a misnomer that the quality of each individual data point can be assessed without context. Yet one of the most common way to examine citizen science data quality has been to compare volunteer data to those collected by trained technicians and scientists. Even a few years ago I’d noticed over 50 papers making these types of comparisons and the overwhelming evidence suggested that volunteer data are fine. And in those few instances when volunteer observations did not match those of professionals, that was evidence of poor project design. While these studies can be reassuring, they are not always necessary nor would they ever be sufficient.” (http://blogs.plos.org/citizensci/2016/12/21/quality-and-quantity-with-citizen-science/)

[Slide 4] One way to examine the issue with data quality is to think of the clash between two concepts and systems of thinking on how to address quality issue – we can consider the condition of standard scientific research conditions as ones of scarcity: limited funding, limited number of people with the necessary skills, a limited laboratory space, expensive instruments that need to be used in a very specific way – sometimes unique instruments.
The conditions of citizen science, on the other hand, are of abundance – we have a large number of participants, with multiple skills, but the cost per participant is low, they bring their own instruments, use their own time, and are also distributed in places that we usually don’t get to (backyards, across the country – we talked about it in week 2). Conditions of abundance are different and require different thinking for quality assurance.

[Slide 5] Here some of the differences. Under conditions of scarcity, it is worth investing in long training to ensure that the data collection is as good as possible the first time it is attempted since time is scarce. Also, we would try to maximise the output from each activity that our researcher carried out, and we will put procedures and standards to ensure “once & good” or even “once & best” optimisation. We can also force all the people in the study to use the same equipment and software, as this streamlines the process.
On the other hand, in abundance conditions we need to assume that people are coming with a whole range of skills and that training can be variable – some people will get trained on the activity over a long time, while to start the process we would want people to have light training and join it. We also thinking of activities differently – e.g. conceiving the data collection as micro-tasks. We might also have multiple procedures and even different ways to record information to cater for a different audience. We will also need to expect a whole range of instrumentation, with sometimes limited information about the characteristics of the instruments.
Once we understand the new condition, we can come up with appropriate data collection procedures that ensure data quality that is suitable for this context.

[Slide 6] There are multiple ways of ensuring data quality in citizen science data. Let’s briefly look at each one of these. The first 3 methods were suggested by Mike Goodchild and Lina Li in a paper from 2012.

[Slide 7] The first method for quality assurance is crowdsourcing – the use of multiple people who are carrying out the same work, in fact, doing peer review or replication of the analysis which is desirable across the sciences. As Watson and Floridi argued, using the examine of Zooniverse, the approaches that are being used in crowdsourcing give these methods a stronger claim on accuracy and scientific correct identification because they are comparing multiple observers who work independently.

[Slide 8] The social form of quality assurance is using more and less experienced participants as a way to check the information and ensure that the data is correct. This is fairly common in many areas of biodiversity observations and integrated into iSpot, but also exist in other areas, such as mapping, where some information get moderated (we’ve seen that in Google Local Guides, when a place is deleted).

[Slide 9] The geographical rules are especially relevant to information about mapping and locations. Because we know things about the nature of geography – the most obvious is land and sea in this example – we can use this knowledge to check that the information that is provided makes sense, such as this sample of two bumble bees that are recorded in OPAL in the middle of the sea. While it might be the case that someone seen them while sailing or on some other vessel, we can integrate a rule into our data management system and ask for more details when we get observations in such a location. There are many other such rules – about streams, lakes, slopes and more.

[Slide 10] The ‘domain’ approach is an extension of the geographic one, and in addition to geographical knowledge uses a specific knowledge that is relevant to the domain in which information is collected. For example, in many citizen science projects that involved collecting biological observations, there will be some body of information about species distribution both spatially and temporally. Therefore, a new observation can be tested against this knowledge, again algorithmically, and help in ensuring that new observations are accurate. If we see a monarch butterfly within the marked area, we can assume that it will not harm the dataset even if it was a mistaken identity, while an outlier (temporally, geographically, or in other characteristics) should stand out.

[Slide 11] The ‘instrumental observation’ approach removes some of the subjective aspects of data collection by a human that might make an error, and rely instead on the availability of equipment that the person is using. Because of the increase in availability of accurate-enough equipment, such as the various sensors that are integrated in smartphones, many people keep in their pockets mobile computers with the ability to collect location, direction, imagery and sound. For example, images files that are captured in smartphones include in the file the GPS coordinates and time-stamp, which for a vast majority of people are beyond their ability to manipulate. Thus, the automatic instrumental recording of information provides evidence for the quality and accuracy of the information. This is where the metadata of the information becomes very valuable as it provides the necessary evidence.

[Slide 12] Finally, the ‘process oriented’ approach bring citizen science closer to traditional industrial processes. Under this approach, the participants go through some training before collecting information, and the process of data collection or analysis is highly structured to ensure that the resulting information is of suitable quality. This can include the provision of standardised equipment, online training or instruction sheets and a structured data recording process. For example, volunteers who participate in the US Community Collaborative Rain, Hail & Snow network (CoCoRaHS) receive standardised rain gauge, instructions on how to install it and online resources to learn about data collection and reporting.

[Slide 13]  What is important to be aware of is that methods are not being used alone but in combination. The analysis by Wiggins et al. in 2011 includes a framework that includes 17 different mechanisms for ensuring data quality. It is therefore not surprising that with appropriate design, citizen science projects can provide high-quality data.

 

 

Advertisements

International Encyclopedia of Geography – Quality Assurance of VGI

The Association of American Geographers is coordinating an effort to create an International Encyclopedia of Geography. Plans started in 2010, with an aim to see the 15 volumes project published in 2015 or 2016. Interestingly, this shows that publishers and scholars are still seeing the value in creating subject-specific encyclopedias. On the other hand, the weird decision by Wikipedians that Geographic Information Science doesn’t exist outside GIS, show that geographers need a place to define their practice by themselves. You can find more information about the AAG International Encyclopedia project in an interview with Doug Richardson from 2012.

As part of this effort, I was asked to write an entry on ‘Volunteered Geographic Information, Quality Assurance‘ as a short piece of about 3000 words. To do this, I have looked around for mechanisms that are used in VGI and in Citizen Science. This are covered in OpenStreetMap studies and similar work in GIScience, and in the area of citizen science, there are reviews such as the one by Andrea Wiggins and colleagues of mechanisms to ensure data quality in citizen science projects, which clearly demonstrated that projects are using multiple methods to ensure data quality.

Below you’ll find an abridged version of the entry (but still long). The citation for this entry will be:

Haklay, M., Forthcoming. Volunteered geographic information, quality assurance. in D. Richardson, N. Castree, M. Goodchild, W. Liu, A. Kobayashi, & R. Marston (Eds.) The International Encyclopedia of Geography: People, the Earth, Environment, and Technology. Hoboken, NJ: Wiley/AAG

In the entry, I have identified 6 types of mechanisms that are used to ensure quality assurance when the data has a geographical component, either VGI or citizen science. If I have missed a type of quality assurance mechanism, please let me know!

Here is the entry:

Volunteered geographic information, quality assurance

Volunteered Geographic Information (VGI) originate outside the realm of professional data collection by scientists, surveyors and geographers. Quality assurance of such information is important for people who want to use it, as they need to identify if it is fit-for-purpose. Goodchild and Li (2012) identified three approaches for VGI quality assurance , ‘crowdsourcing‘ and that rely on the number of people that edited the information, ‘social’ approach that is based on gatekeepers and moderators, and ‘geographic’ approach which uses broader geographic knowledge to verify that the information fit into existing understanding of the natural world. In addition to the approaches that Goodchild and li identified, there are also ‘domain’ approach that relate to the understanding of the knowledge domain of the information, ‘instrumental observation’ that rely on technology, and ‘process oriented’ approach that brings VGI closer to industrialised procedures. First we need to understand the nature of VGI and the source of concern with quality assurance.

While the term volunteered geographic information (VGI) is relatively new (Goodchild 2007), the activities that this term described are not. Another relatively recent term, citizen science (Bonney 1996), which describes the participation of volunteers in collecting, analysing and sharing scientific information, provide the historical context. While the term is relatively new, the collection of accurate information by non-professional participants turn out to be an integral part of scientific activity since the 17th century and likely before (Bonney et al 2013). Therefore, when approaching the question of quality assurance of VGI, it is critical to see it within the wider context of scientific data collection and not to fall to the trap of novelty, and to consider that it is without precedent.

Yet, this integration need to take into account the insights that emerged within geographic information science (GIScience) research over the past decades. Within GIScience, it is the body of research on spatial data quality that provide the framing for VGI quality assurance. Van Oort’s (2006) comprehensive synthesis of various quality standards identifies the following elements of spatial data quality discussions:

  • Lineage – description of the history of the dataset,
  • Positional accuracy – how well the coordinate value of an object in the database relates to the reality on the ground.
  • Attribute accuracy – as objects in a geographical database are represented not only by their geometrical shape but also by additional attributes.
  • Logical consistency – the internal consistency of the dataset,
  • Completeness – how many objects are expected to be found in the database but are missing as well as an assessment of excess data that should not be included.
  • Usage, purpose and constraints – this is a fitness-for-purpose declaration that should help potential users in deciding how the data should be used.
  • Temporal quality – this is a measure of the validity of changes in the database in relation to real-world changes and also the rate of updates.

While some of these quality elements might seem independent of a specific application, in reality they can be only be evaluated within a specific context of use. For example, when carrying out analysis of street-lighting in a specific part of town, the question of completeness become specific about the recording of all street-light objects within the bounds of the area of interest and if the data set includes does not include these features or if it is complete for another part of the settlement is irrelevant for the task at hand. The scrutiny of information quality within a specific application to ensure that it is good enough for the needs is termed ‘fitness for purpose’. As we shall see, fit-for-purpose is a central issue with respect to VGI.

To understand the reason that geographers are concerned with quality assurance of VGI, we need to recall the historical development of geographic information, and especially the historical context of geographic information systems (GIS) and GIScience development since the 1960s. For most of the 20th century, geographic information production became professionalised and institutionalised. The creation, organisation and distribution of geographic information was done by official bodies such as national mapping agencies or national geological bodies who were funded by the state. As a results, the production of geographic information became and industrial scientific process in which the aim is to produce a standardised product – commonly a map. Due to financial, skills and process limitations, products were engineered carefully so they can be used for multiple purposes. Thus, a topographic map can be used for navigation but also for urban planning and for many other purposes. Because the products were standardised, detailed specifications could be drawn, against which the quality elements can be tested and quality assurance procedures could be developed. This was the backdrop to the development of GIS, and to the conceptualisation of spatial data quality.

The practices of centralised, scientific and industrialised geographic information production lend themselves to quality assurance procedures that are deployed through organisational or professional structures, and explains the perceived challenges with VGI. Centralised practices also supported employing people with focus on quality assurance, such as going to the field with a map and testing that it complies with the specification that were used to create it. In contrast, most of the collection of VGI is done outside organisational frameworks. The people who contribute the data are not employees and seemingly cannot be put into training programmes, asked to follow quality assurance procedures, or expected to use standardised equipment that can be calibrated. The lack of coordination and top-down forms of production raise questions about ensuring the quality of the information that emerges from VGI.

To consider quality assurance within VGI require to understand some underlying principles that are common to VGI practices and differentiate it from organised and industrialised geographic information creation. For example, some VGI is collected under conditions of scarcity or abundance in terms of data sources, number of observations or the amount of data that is being used. As noted, the conceptualisation of geographic data collection before the emergence of VGI was one of scarcity where data is expensive and complex to collect. In contrast, many applications of VGI the situation is one of abundance. For example, in applications that are based on micro-volunteering, where the participant invest very little time in a fairly simple task, it is possible to give the same mapping task to several participants and statistically compare their independent outcomes as a way to ensure the quality of the data. Another form of considering abundance as a framework is in the development of software for data collection. While in previous eras, there will be inherently one application that was used for data capture and editing, in VGI there is a need to consider of multiple applications as different designs and workflows can appeal and be suitable for different groups of participants.

Another underlying principle of VGI is that since the people who collect the information are not remunerated or in contractual relationships with the organisation that coordinates data collection, a more complex relationships between the two sides are required, with consideration of incentives, motivations to contribute and the tools that will be used for data collection. Overall, VGI systems need to be understood as socio-technical systems in which the social aspect is as important as the technical part.

In addition, VGI is inherently heterogeneous. In large scale data collection activities such as the census of population, there is a clear attempt to capture all the information about the population over relatively short time and in every part of the country. In contrast, because of its distributed nature, VGI will vary across space and time, with some areas and times receiving more attention than others. An interesting example has been shown in temporal scales, where some citizen science activities exhibit ‘weekend bias’ as these are the days when volunteers are free to collect more information.

Because of the difference in the organisational settings of VGI, a different approaches to quality assurance is required, although as noted, in general such approaches have been used in many citizen science projects. Over the years, several approaches emerged and these include ‘crowdsourcing ‘, ‘social’, ‘geographic’, ‘domain’, ‘instrumental observation’ and ‘process oriented’. We now turn to describe each of these approaches.

Thecrowdsourcing approach is building on the principle of abundance. Since there are is a large number of contributors, quality assurance can emerge from repeated verification by multiple participants. Even in projects where the participants actively collect data in uncoordinated way, such as the OpenStreetMap project, it has been shown that with enough participants actively collecting data in a given area, the quality of the data can be as good as authoritative sources. The limitation of this approach is when local knowledge or verification on the ground (‘ground truth’) is required. In such situations, the ‘crowdsourcing’ approach will work well in central, highly populated or popular sites where there are many visitors and therefore the probability that several of them will be involved in data collection rise. Even so, it is possible to encourage participants to record less popular places through a range of suitable incentives.

Thesocial approach is also building on the principle of abundance in terms of the number of participants, but with a more detailed understanding of their knowledge, skills and experience. In this approach, some participants are asked to monitor and verify the information that was collected by less experienced participants. The social method is well established in citizen science programmes such as bird watching, where some participants who are more experienced in identifying bird species help to verify observations by other participants. To deploy the social approach, there is a need for a structured organisations in which some members are recognised as more experienced, and are given the appropriate tools to check and approve information.

Thegeographic approach uses known geographical knowledge to evaluate the validity of the information that is received by volunteers. For example, by using existing knowledge about the distribution of streams from a river, it is possible to assess if mapping that was contributed by volunteers of a new river is comprehensive or not. A variation of this approach is the use of recorded information, even if it is out-of-date, to verify the information by comparing how much of the information that is already known also appear in a VGI source. Geographic knowledge can be potentially encoded in software algorithms.

Thedomain approach is an extension of the geographic one, and in addition to geographical knowledge uses a specific knowledge that is relevant to the domain in which information is collected. For example, in many citizen science projects that involved collecting biological observations, there will be some body of information about species distribution both spatially and temporally. Therefore, a new observation can be tested against this knowledge, again algorithmically, and help in ensuring that new observations are accurate.

Theinstrumental observation approach remove some of the subjective aspects of data collection by a human that might made an error, and rely instead on the availability of equipment that the person is using. Because of the increased in availability of accurate-enough equipment, such as the various sensors that are integrated in smartphones, many people keep in their pockets mobile computers with ability to collect location, direction, imagery and sound. For example, images files that are captured in smartphones include in the file the GPS coordinates and time-stamp, which for a vast majority of people are beyond their ability to manipulate. Thus, the automatic instrumental recording of information provide evidence for the quality and accuracy of the information.

Finally, the ‘process oriented approach bring VGI closer to traditional industrial processes. Under this approach, the participants go through some training before collecting information, and the process of data collection or analysis is highly structured to ensure that the resulting information is of suitable quality. This can include provision of standardised equipment, online training or instruction sheets and a structured data recording process. For example, volunteers who participate in the US Community Collaborative Rain, Hail & Snow network (CoCoRaHS) receive standardised rain gauge, instructions on how to install it and an online resources to learn about data collection and reporting.

Importantly, these approach are not used in isolation and in any given project it is likely to see a combination of them in operation. Thus, an element of training and guidance to users can appear in a downloadable application that is distributed widely, and therefore the method that will be used in such a project will be a combination of the process oriented with the crowdsourcing approach. Another example is the OpenStreetMap project, which in the general do not follow limited guidance to volunteers in terms of information that they collect or the location in which they collect it. Yet, a subset of the information that is collected in OpenStreetMap database about wheelchair access is done through the highly structured process of the WheelMap application in which the participant is require to select one of four possible settings that indicate accessibility. Another subset of the information that is recorded for humanitarian efforts is following the social model in which the tasks are divided between volunteers using the Humanitarian OpenStreetMap Team (H.O.T) task manager, and the data that is collected is verified by more experienced participants.

The final, and critical point for quality assurance of VGI that was noted above is fitness-for-purpose. In some VGI activities the information has a direct and clear application, in which case it is possible to define specifications for the quality assurance element that were listed above. However, one of the core aspects that was noted above is the heterogeneity of the information that is collected by volunteers. Therefore, before using VGI for a specific application there is a need to check for its fitness for this specific use. While this is true for all geographic information, and even so called ‘authoritative’ data sources can suffer from hidden biases (e.g. luck of update of information in rural areas), the situation with VGI is that variability can change dramatically over short distances – so while the centre of a city will be mapped by many people, a deprived suburb near the centre will not be mapped and updated. There are also limitations that are caused by the instruments in use – for example, the GPS positional accuracy of the smartphones in use. Such aspects should also be taken into account, ensuring that the quality assurance is also fit-for-purpose.

References and Further Readings

Bonney, Rick. 1996. Citizen Science – a lab tradition, Living Bird, Autumn 1996.
Bonney, Rick, Shirk, Jennifer, Phillips, Tina B. 2013. Citizen Science, Encyclopaedia of science education. Berlin: Springer-Verlag.
Goodchild, Michael F. 2007. Citizens as sensors: the world of volunteered geography. GeoJournal, 69(4), 211–221.
Goodchild, Michael F., and Li, Linna. 2012, Assuring the quality of volunteered geographic information. Spatial Statistics, 1 110-120
Haklay, Mordechai. 2010. How Good is volunteered geographical information? a comparative study of OpenStreetMap and ordnance survey datasets. Environment and Planning B: Planning and Design, 37(4), 682–703.
Sui, Daniel, Elwood, Sarah and Goodchild, Michael F. (eds), 2013. Crowdsourcing Geographic Knowledge, Berlin:Springer-Verlag.
Van Oort, Pepjin .A.J. 2006. Spatial data quality: from description to application, PhD Thesis, Wageningen: Wageningen Universiteit, p. 125.