1 June, 2014
‘More or Less‘ is a good programme on BBC Radio 4. Regularly exploring the numbers and the evidence behind news stories and other important things, and checking if they stand out. However, the piece that was broadcast this week about Golf courses and housing in the UK provides a nice demonstration of when not to use crowdsourced information. The issue that was discussed was how much actual space golf courses occupy, when compared to space that is used for housing. All was well, until they announced in the piece the use of clever software (read GIS) with a statistical superhero to do the analysis. Interestingly, the data that was used for the analysis was OpenStreetMap – and because the news item was about Surrey, they started doing the analysis with it.
For the analysis to be correct, you need to assume that all the building polygons in OpenStreetMap and all the Golf courses have been identified and mapped. My own guess that in Surrey, this could be the case – especially with all the wonderful work of James Rutter catalysed. However, assuming that this is the case for the rest of the country is, well, a bit fancy. I wouldn’t dare to state that OpenStreetMap is complete to such a level, without lots of quality testing which I haven’t seen. There is only the road length analysis of ITO World! and other bits of analysis, but we don’t know how complete OSM is.
While I like OpenStreetMap very much, it is utterly unsuitable for any sort of statistical analysis that works at the building level and then summing up to the country level – because of the heterogeneity of the data . For that sort of thing, you have to use a consistent dataset, or at least one that attempts to be consistent, and that data comes from the Ordnance Survey.
As with other statistical affairs, the core case that is made about the assertion as a whole in the rest of the clip is relevant here. First, we should question the unit of analysis (is it right to compare the footprint of a house to the area of Golf courses? Probably not) and what is to be gained by adding up individual building’s footprints to the level of the UK while ignoring roads, gardens, and all the rest of the built environment. Just because it is possible to add up every building’s footprint, doesn’t mean that you should. Second, this analysis is the sort of example of ‘Big Data’ fallacy which goes analyse first, then question (if at all) what the relationship between the data and reality.
29 March, 2014
Thursday marked the launch of The Conservation Volunteers (TCV) report on volunteering impact where they summarised a three year project that explored motivations, changes in pro-environmental behaviour, wellbeing and community resilience. The report is worth a read as it goes beyond the direct impact on the local environment of TCV activities, and demonstrates how involvement in environmental volunteering can have multiple benefits. In a way, it is adding ingredients to a more holistic understanding of ‘green volunteering’.
One of the interesting aspects of the report is in the longitudinal analysis of volunteers motivation (copied here from the report). The comparison is from 784 baseline surveys, 202 Second surveys and 73 third surveys, which were done with volunteers while they were involved with the TCV. The second survey was taken after 4 volunteering sessions, and the third after 10 sessions.
The results of the surveys are interesting in the context of online activities (e.g. citizen science or VGI) because they provide an example for an activity that happen off line – in green spaces such as local parks, community gardens and the such. Moreover, the people that are participating in them come from all walks of life, as previous analysis of TCV data demonstrated that they are recruiting volunteers across the socio-economic spectrum. So here is an activity that can be compared to online volunteering. This is valuable, as if the pattern of TCV information are similar, then we can understand online volunteering as part of general volunteering and not assume that technology changes everything.
So the graph above attracted my attention because of the similarities to Nama Budhathoki work on the motivation of OpenStreetMap volunteers. First, there is a difference between the reasons that are influencing the people that join just one session and those that are involved for the longer time. Secondly, social and personal development aspects are becoming more important over time.
There is clear need to continue and explore the data – especially because the numbers that are being surveyed at each period are different, but this is an interesting finding, and there is surly more to explore. Some of it will be explored by Valentine Seymour in ExCiteS who is working with TCV as part of her PhD.
It is also worth listening to the qualitative observations by volunteers, as expressed in the video that open the event, which is provided below.
Following the two previous assertions, namely that:
‘you can be supported by a huge crowd for a very short time, or by few for a long time, but you can’t have a huge crowd all of the time (unless data collection is passive)’ (original post here)
‘All information sources are heterogeneous, but some are more honest about it than others’ (original post here)
The third assertion is about pattern of participation. It is one that I’ve mentioned before and in some way it is a corollary of the two assertions above.
‘When looking at crowdsourced information, always keep participation inequality in mind’
Because crowdsourced information, either Volunteered Geographic Information or Citizen Science, is created through a socio-technical process, all too often it is easy to forget the social side – especially when you are looking at the information without the metadata of who collected it and when. So when working with OpenStreetMap data, or viewing the distribution of bird species in eBird (below), even though the data source is expected to be heterogeneous, each observation is treated as similar to other observation and assumed to be produced in a similar way.
Yet, data is not only heterogeneous in terms of consistency and coverage, it is also highly heterogeneous in terms of contribution. One of the most persistence findings from studies of various systems – for example in Wikipedia , OpenStreetMap and even in volunteer computing is that there is a very distinctive heterogeneity in contribution. The phenomena was term ‘Participation Inequality‘ by Jakob Nielsn in 2006 and it is summarised succinctly in the diagram below (from Visual Liberation blog) – very small number of contributors add most of the content, while most of the people that are involved in using the information will not contribute at all. Even when examining only those that actually contribute, in some project over 70% contribute only once, with a tiny minority contributing most of the information.
Therefore, when looking at sources of information that were created through such process, it is critical to remember the nature of contribution. This has far reaching implications on quality as it is dependent on the expertise of the heavy contributors, on their spatial and temporal engagement, and even on their social interaction and practices (e.g. abrasive behaviour towards other participants).
Because of these factors, it is critical to remember the impact and implications of participation inequality on the analysis of the information. There will be some analysis to which it will have less impact and some where it will have major one. In either cases, it need to be taken into account.
Following the last post, which focused on an assertion about crowdsourced geographic information and citizen science I continue with another observation. As was noted in the previous post, these can be treated as ‘laws’ as they seem to emerge as common patterns from multiple projects in different areas of activity – from citizen science to crowdsourced geographic information. The first assertion was about the relationship between the number of volunteers who can participate in an activity and the amount of time and effort that they are expect to contribute.
This time, I look at one aspect of data quality, which is about consistency and coverage. Here the following assertion applies:
‘All information sources are heterogeneous, but some are more honest about it than others’
What I mean by that is the on-going argument about authoritative and crowdsourced information sources (Flanagin and Metzger 2008 frequently come up in this context), which was also at the root of the Wikipedia vs. Britannica debate, and the mistrust in citizen science observations and the constant questioning if they can do ‘real research’.
There are many aspects for these concerns, so the assertion deals with the aspects of comprehensiveness and consistency which are used as a reason to dismiss crowdsourced information when comparing them to authoritative data. However, at a closer look we can see that all these information sources are fundamentally heterogeneous. Despite of all the effort to define precisely standards for data collection in authoritative data, heterogeneity creeps in because of budget and time limitations, decisions about what is worthy to collect and how, and the clash between reality and the specifications. Here are two examples:
Take one of the Ordnance Survey Open Data sources – the map present themselves as consistent and covering the whole country in an orderly way. However, dig in to the details for the mapping, and you discover that the Ordnance Survey uses different standards for mapping urban, rural and remote areas. Yet, the derived products that are generalised and manipulated in various ways, such as Meridian or Vector Map District, do not provide a clear indication which parts originated from which scale – so the heterogeneity of the source disappeared in the final product.
The census is also heterogeneous, and it is a good case of specifications vs. reality. Not everyone fill in the forms and even with the best effort of enumerators it is impossible to collect all the data, and therefore statistical analysis and manipulation of the results are required to produce a well reasoned assessment of the population. This is expected, even though it is not always understood.
Therefore, even the best information sources that we accept as authoritative are heterogeneous, but as I’ve stated, they just not completely honest about it. The ONS doesn’t release the full original set of data before all the manipulations, nor completely disclose all the assumptions that went into reaching the final value. The Ordnance Survey doesn’t tag every line with metadata about the date of collection and scale.
Somewhat counter-intuitively, exactly because crowdsourced information is expected to be inconsistent, we approach it as such and ask questions about its fitness for use. So in that way it is more honest about the inherent heterogeneity.
Importantly, the assertion should not be taken to be dismissive of authoritative sources, or ignoring that the heterogeneity within crowdsources information sources is likely to be much higher than in authoritative ones. Of course all the investment in making things consistent and the effort to get universal coverage is indeed worth it, and it will be foolish and counterproductive to consider that such sources of information can be replaced as is suggest for the census or that it’s not worth investing in the Ordnance Survey to update the authoritative data sets.
Moreover, when commercial interests meet crowdsourced geographic information or citizen science, the ‘honesty’ disappear. For example, even though we know that Google Map Maker is now used in many part
s of the world (see the figure), even in cases when access to vector data is provided by Google, you cannot find out about who contribute, when and where. It is also presented as an authoritative source of information.
Despite the risk of misinterpretation, the assertion can be useful as a reminder that the differences between authoritative and crowdsourced information are not as big as it may seem.
Looking across the range of crowdsourced geographic information activities, some regular patterns are emerging and it might be useful to start notice them as a way to think about what is possible or not possible to do in this area. Since I don’t like the concept of ‘laws’ – as in Tobler’s first law of geography which is stated as ‘Everything is related to everything else, but near things are more related than distant things.’ – I would call them assertions. There is also something nice about using the word ‘assertion’ in the context of crowdsourced geographic information, as it echos Mike Goodchild’s differentiation between asserted and authoritative information. So not laws, just assertions or even observations.
The first one, is rephrasing a famous quote:
‘you can be supported by a huge crowd for a very short time, or by few for a long time, but you can’t have a huge crowd all of the time (unless data collection is passive)’
So the Christmas Bird Count can have tens of thousands of participants for a short time, while the number of people who operate weather observation stations will be much smaller. Same thing is true for OpenStreetMap – for crisis mapping, which is a short term task, you can get many contributors but for the regular updating of an area under usual conditions, there will be only few.
The exception for the assertion is the case for passive data collection, where information is collected automatically through the logging of information from a sensor – for example the recording of GPS track to improve navigation information.
10 December, 2013
There is something in the physical presence of book that is pleasurable. Receiving the copy of Introducing Human Geographies was special, as I have contributed a chapter about Geographic Information Systems to the ‘cartographies’ section.
It might be a response to Ron Johnston critique of Human Geography textbooks or a decision by the editors to extend the content of the book, but the book now contains three chapters that deal with maps and GIS. The contributions are the ‘Power of maps’ by Jeremy Crampton, a chapter about ‘Geographical information systems’ by me, and ‘Counter geographies’ by Wen Lin. To some extent, we’ve coordinated the writing, as this is a textbook for undergraduates in geography and we wanted to have a coherent message.
In my chapter I have covered both the quantitative/spatial science face of GIS, as well as the critical/participatory one. As the introduction to the section describes:
“Chapter 14 focuses on the place of Geographical Information Systems (GIS) within contemporary mapping. A GIS involves the representation of geographies in digital computers. … GIS is now a widespread and varied form of mapping, both within the academy and beyond. In the chapter, he speaks to that variety by considering the use of GIS both within practices such as location planning, where it is underpinned by the intellectual paradigm of spatial science and quantitative data, and within emergent fields of ‘critical’ and ‘qualitative GIS’, where GIS could be focused on representing the experiences of marginalized groups of people, for example. Generally, Muki argues against the equation of GIS with only one sort of Human Geography, showing how it can be used as a technology within various kinds of research. More specifically, his account shows how current work is pursuing those options through careful consideration of both the wider issues of power and representation present in mapping and the detailed, technical and scientific challenges within GIS development.”
To preview the chapter on Google Book, use this link . I hope that it will be useful introduction to GIS to Geography students.
18 March, 2013
The Consumers’ Association Which? magazine is probably not the first place to turn to when you look for usability studies. Especially not if you’re interested in computer technology – for that, there are sources such as PC Magazine on the consumer side, and professional magazines such as Interactions from Association for Computing Machinery (ACM) Special Interest Group on Computer-Human Interaction (SIGCHI).
Over the past few years, Which? is reviewing, testing and recommending Satnavs (also known Personal Navigation Devices – PNDs). Which? is an interesting case because it reaches over 600,000 households and because of the level of trust that it enjoys. If you look at their methodology for testing satnavs , you’ll find that it does resemble usability testing – click on the image to see the video from Which? about their methodology. The methodology is more about everyday use and the opinion of the assessors seems to play an important role.
Professionals in geographical information science or human-computer interaction might dismiss the study as unrepresentative, or not fitting their ways of evaluating technologies, but we need to remember that Which? is providing an insight into the experience of the people who are outside our usual professional and social context – people who go to a high street shop or download an app and start using it straightaway. Therefore, it’s worth understanding how they review the different systems and what the experience is like when you try to think like a consumer, with limited technical knowledge and understanding of maps.
There are also aspects that puncture the ‘filter bubble‘ of geoweb people – Google Maps are now probably the most used maps on the web, but the satnav application using Google Maps was described as ‘bad, useful for getting around on foot, but traffic information and audio instructions are limited and there’s no speed limit or speed camera data‘. Waze, the crowdsourced application received especially low marks and the magazine noted that it ‘lets users share traffic and road info, but we found its routes and maps are inaccurate and audio is poor‘ (both citations from Which? Nov 2012, p. 38). It is also worth reading their description of OpenStreetMap when discussing map updates, and also the opinions on the willingness to pay for map updates.
There are many ways to receive information about the usability and the nature of interaction with geographical technologies, and some of them, while not traditional, can provide useful insights.
20 July, 2011
As part of the Volunteered Geographic Information (VGI) workshop that was held in Seattle in April 2011, Daniel Sui, Sarah Elwood and Mike Goodchild announced that they will be editing a volume dedicated to the topic, published as ‘Crowdsourcing Geographic Knowledge‘ (Here is a link to the Chapter in Crowdsourcing Geographic Knowledge)
My contribution to this volume focuses on citizen science, and shows the links between it and VGI. The chapter is currently under review, but the following excerpt discusses different types of citizen science activities, and I would welcome comments:
“While the aim here is not to provide a precise definition of citizen science. Yet, a definition and clarification of what the core characteristics of citizen science are is unavoidable. Therefore, it is defined as scientific activities in which non-professional scientists volunteer to participate in data collection, analysis and dissemination of a scientific project (Cohn 2008; Silvertown 2009). People who participate in a scientific study without playing some part in the study itself – for example, volunteering in a medical trial or participating in a social science survey – are not included in this definition.
While it is easy to identify a citizen science project when the aim of the project is the collection of scientific information, as in the recording of the distribution of plant species, there are cases where the definition is less clear-cut. For example, the process of data collection in OpenStreetMap or Google Map Maker is mostly focused on recording verifiable facts about the world that can be observed on the ground. The tools that OpenStreetMap mappers use – such as remotely sensed images, GPS receivers and map editing software – can all be considered scientific tools. With their attempt to locate observed objects and record them on a map accurately, they follow the footsteps of surveyors such as Robert Hooke, who also carried out an extensive survey of London using scientific methods – although, unlike OpenStreetMap volunteers, he was paid for his effort. Finally, cases where facts are collected in a participatory mapping activity, such as the one that Ghose (2001) describes, should probably be considered a citizen science only if the participants decided to frame it as such. For the purpose of the discussion here, such a broad definition is more useful than a limiting one that tries to reject certain activities.
Notice also that, by definition, citizen science can only exist in a world in which science is socially constructed as the preserve of professional scientists in academic institutions and industry, because, otherwise, any person who is involved in a scientific project would simply be considered a contributor and potentially a scientist. As Silvertown (2009) noted, until the late 19th century, science was mainly developed by people who had additional sources of employment that allowed them to spend time on data collection and analysis. Famously, Charles Darwin joined the Beagle voyage, not as a professional naturalist but as a companion to Captain FitzRoy. Thus, in that era, almost all science was citizen science albeit mostly by affluent gentlemen scientists and gentlewomen. While the first professional scientist is likely to be Robert Hooke, who was paid to work on scientific studies in the 17th century, the major growth in the professionalisation of scientists was mostly in the latter part of the 19th and throughout the 20th centuries.
Even with the rise of the professional scientist, the role of volunteers has not disappeared, especially in areas such as archaeology, where it is common for enthusiasts to join excavations, or in natural science and ecology, where they collect and send samples and observations to national repositories. These activities include the Christmas Bird Watch that has been ongoing since 1900 and the British Trust for Ornithology Survey, which has collected over 31 million records since its establishment in 1932 (Silvertown 2009). Astronomy is another area where amateurs and volunteers have been on par with professionals when observation of the night sky and the identification of galaxies, comets and asteroids are considered (BBC 2006). Finally, meteorological observations have also relied on volunteers since the early start of systematic measurements of temperature, precipitation or extreme weather events (WMO 2001).
This type of citizen science provides the first type of ‘classic’ citizen science – the ‘persistence’ parts of science where the resources, geographical spread and the nature of the problem mean that volunteers sometimes predate the professionalisation and mechanisation of science. These research areas usually require a large but sparse network of observers who carry out their work as part of a hobby or leisure activity. This type of citizen science has flourished in specific enclaves of scientific practice, and the progressive development of modern communication tools has made the process of collating the results from the participants easier and cheaper, while inherently keeping many of the characteristics of data collection processes close to their origins.
A second set of citizen science activities is environmental management and, even more specifically, within the context of environmental justice campaigns. Modern environmental management includes strong technocratic and science oriented management practices (Bryant & Wilson 1998; Scott & Barnett 2009) and environmental decision making is heavily based on scientific environmental information. As a result, when an environmental conflict emerges – such as a community protest over a local noisy factory or planned expansion of an airport – the valid evidence needs to be based on scientific data collection. This aspect of environmental justice struggle is encouraging communities to carry out ‘community science’ in which scientific measurements and analysis are carried out by members of local communities so they can develop an evidence base and set out action plans to deal with problems in their area. A successful example of such an approach is the ‘Global Community Monitor’ method to allow communities to deal with air pollution issues (Scott & Barnett 2009). This is performed through a simple method of sampling air using plastic buckets followed by analysis in an air pollution laboratory, and, finally, the community being provided with instructions on how to understand the results. This activity is termed ‘Bucket Brigade’ and was used across the world in environmental justice campaigns. In London, community science was used to collect noise readings in two communities that are impacted by airport and industrial activities. The outputs were effective in bringing environmental problems to the policy arena (Haklay, Francis & Whitaker 2008). As in ‘classic’ citizen science, the growth in electronic communication has enabled communities to identify potential methods – e.g. through the ‘Global Community Monitor’ website – as well as find international standards , regulations and scientific papers that can be used together with the local evidence.
However, the emergence of the Internet and the Web as a global infrastructure has enabled a new incarnation of citizen science: the realisation of scientists that the public can provide free labour, skills, computing power and even funding, and, the growing demands from research funders for public engagement all contributing to the motivation of scientists to develop and launch new and innovative projects (Silvertown 2009; Cohn 2008). These projects utilise the abilities of personal computers, GPS receivers and mobile phones to double as scientific instruments.
This third type of citizen science has been termed ‘citizen cyberscience’ by Francois Grey (2009). Within it, it is possible to identify three sub-categories: volunteered computing, volunteered thinking and participatory sensing.
Volunteered computing was first developed in 1999, with the foundation of SETI@home (Anderson et al. 2002), which was designed to distribute the analysis of data that was collected from a radio telescope in the search for extra-terrestrial intelligence. The project utilises the unused processing capacity that exists in personal computers, and uses the Internet to send and receive ‘work packages’ that are analysed automatically and sent back to the main server. Over 3.83 million downloads were registered on the project’s website by July 2002. The system on which SETI@home is based, the Berkeley Open Infrastructure for Network Computing (BOINC), is now used for over 100 projects, covering Physics, processing data from the Large Hadron Collider through LHC@home; Climate Science with the running of climate models in Climateprediction.net; and Biology in which the shape of proteins is calculated in Rosetta@home.
While volunteered computing requires very little from the participants, apart from installing software on their computers, in volunteered thinking the volunteers are engaged at a more active and cognitive level (Grey 2009). In these projects, the participants are asked to use a website in which information or an image is presented to them. When they register onto the system, they are trained in the task of classifying the information. After the training, they are exposed to information that has not been analysed, and are asked to carry out classification work. Stardust@home (Westphal et al. 2006) in which volunteers were asked to use a virtual microscope to try to identify traces of interstellar dust was one of the first projects in this area, together with the NASA ClickWorkers that focused on the classification of craters on Mars. Galaxy Zoo (Lintott et al. 2008), a project in which volunteers classify galaxies, is now one of the most developed ones, with over 100,000 participants and with a range of applications that are included in the wider Zooniverse set of projects (see http://www.zooniverse.org/) .
Participatory sensing is the final and most recent type of citizen science activity. Here, the capabilities of mobile phones are used to sense the environment. Some mobile phones have up to nine sensors integrated into them, including different transceivers (mobile network, WiFi, Bluetooth), FM and GPS receivers, camera, accelerometer, digital compass and microphone. In addition, they can link to external sensors. These capabilities are increasingly used in citizen science projects, such as Mappiness in which participants are asked to provide behavioural information (feeling of happiness) while the phone records their location to allow the linkage of different locations to wellbeing (MacKerron 2011). Other activities include the sensing of air-quality (Cuff 2007) or noise levels (Maisonneuve et al. 2010) by using the mobile phone’s location and the readings from the microphone.”
At the State of the Map (EU) 2011 conference that was held in Vienna from 15-17 July, I gave a keynote talk on the relationships between the OpenStreetMap (OSM) community and the GIScience research community. Of course, the relationships are especially important for those researchers who are working on volunteered Geographic Information (VGI), due to the major role of OSM in this area of research.
The talk included an overview of what researchers have discovered about OpenStreetMap over the 5 years since we started to pay attention to OSM. One striking result is that the issue of positional accuracy does not require much more work by researchers. Another important outcome of the research is to understand that quality is impacted by the number of mappers, or that the data can be used with confidence for mainstream geographical applications when some conditions are met. These results are both useful, and of interest to a wide range of groups, but there remain key areas that require further research – for example, specific facets of quality, community characteristics and how the OSM data is used.
Reflecting on the body of research, we can start to form a ‘code of engagement’ for both academics and mappers who are engaged in researching or using OpenStreetMap. One such guideline would be that it is both prudent and productive for any researcher do some mapping herself, and understand the process of creating OSM data, if the research is to be relevant and accurate. Other aspects of the proposed ‘code’ are covered in the presentation.