How Many Volunteers Does It Take To Map An Area Well? The validity of Linus’ law to Volunteered Geographic Information
10 January, 2011
The paper “How Many Volunteers Does It Take To Map An Area Well? The validity of Linus’ law to “ has appeared in The Cartographic Journal. The proper citation for the paper is:
Haklay, M and Basiouka, S and Antoniou, V and Ather, A (2010) How Many Volunteers Does It Take To Map An Area Well? The validity of Linus’ law to Volunteered Geographic Information. The Cartographic Journal , 47 (4) , 315 – 322.
The abstract of the paper is as follows:
In the area of volunteered geographical information (VGI), the issue of spatial data quality is a clear challenge. The data that are contributed to VGI projects do not comply with standard spatial data quality assurance procedures, and the contributors operate without central coordination and strict data collection frameworks. However, similar to the area of open source software development, it is suggested that the data hold an intrinsic quality assurance measure through the analysis of the number of contributors who have worked on a given spatial unit. The assumption that as the number of contributors increases so does the quality is known as `Linus’ Law’ within the open source community. This paper describes three studies that were carried out to evaluate this hypothesis for VGI using the OpenStreetMap dataset, showing that this rule indeed applies in the case of positional accuracy.
To access the paper on the journal’s website, you can follow the link: 10.1179/000870410X12911304958827. However, if you don’t hold a subscription to the journal, a postprint of the paper is available at the UCL Discovery repository. If you would like to get hold of the printed version, email me.
7 January, 2011
EveryAware is a three-year research project, funded under the European Union Seventh Framework Programme (FP7).
The project’s focus is on the development of Citizen Science techniques to allow people to find out about their local environmental conditions, and then to see if the provision of this information leads to behaviour change.
The abstract of the project highlights the core topics that will be covered:
‘The enforcement of novel policies may be triggered by a grassroots approach, with a key contribution from information and communication technology (ICT). Current low-cost sensing technologies allow the citizens to directly assess the state of the environment; social networking tools allow effective data and opinion collection and real-time information-spreading processes. Moreover theoretical and modelling tools developed by physicists, computer scientists and sociologists allow citizens to analyse, interpret and visualise complex data sets.
‘The proposed project intends to integrate all crucial phases (environmental monitoring, awareness enhancement, behavioural change) in the management of the environment in a unified framework, by creating a new technological platform combining sensing technologies, networking applications and data-processing
tools; the Internet and the existing mobile communication networks will provide the infrastructure hosting this platform, allowing its replication in different times and places. Case studies concerning different numbers of participants will test the scalability of the platform, aiming to involve as many citizens as possible thanks to
low cost and high usability. The integration of participatory sensing with the monitoring of subjective opinions is novel and crucial, as it exposes the mechanisms by which the local perception of an environmental issue, corroborated by quantitative data, evolves into socially-shared opinions, and how the latter, eventually, drives behavioural changes. Enabling this level of transparency critically allows an effective communication of desirable environmental strategies to the general public and to institutional agencies.’
The project will be coordinated by Fondazione ISI (Institute for Scientific Interchange) and the Physics department at Sapienza Università di Roma. Other participants include the L3S Research Center at the Gottfried Wilhelm Leibniz Universität, Hannover, and finally the Environmental Risk and Health unit at the Flemish Institute of Technological Research (VITO).
At UCL, I will run the project together with Dr Claire Ellul. We will focus on Citizen Science, the interaction with mobile phones for data collection and understanding behaviour change. We are looking for a PhD student to work on this project so, if this type of activity is of interest, get it touch.
29 November, 2010
The website GPS Business News published an interview with me in which I covered several aspects of OpenStreetMap and crowdsourced geographical information, including aspects of spatial data quality, patterns of data collection, inequality in coverage and the implications of these patterns to the wider area of Volunteered geographical Information.
The interview is available here .
21 October, 2010
One issue that remained open in the studies on the relevance of Linus’ Law for OpenStreetMap was that the previous studies looked at areas with more than 5 contributors, and the link between the number of users and the quality was not conclusive – although the quality was above 70% for this number of contributors and above it.
Now, as part of writing up the GISRUK 2010 paper for journal publication, we had an opportunity to fill this gap, to some extent. Vyron Antoniou has developed a method to evaluate the positional accuracy on a larger scale than we have done so far. The methodology uses the geometric position of the Ordnance Survey (OS) Meridian 2 road intersections to evaluate positional accuracy. Although Meridian 2 is created by applying a 20-metre generalisation filter to the centrelines of the OS Roads Database, this generalisation process does not affect the positional accuracy of node points and thus their accuracy is the best available. An algorithm was developed for the identification of the correct nodes between the Meridian 2 and OSM, and the average positional error was calculated for each square kilometre in England. With this data, which provides an estimated positional accuracy for an area of over 43,000 square kilometres, it was possible to estimate the contribution that additional users make to the quality of the data.
As can be seen in the chart below, positional accuracy remains fairly level when the number of users is 13 or more – as we have seen in previous studies. On the other hand, up to 13 users, each additional contributor considerably improves the dataset’s quality. In grey you can see the maximum and minimum values, so the area represents the possible range of positional accuracy results. Interestingly, as the number of users increases, positional accuracy seems to settle close to 5m, which is somewhat expected when considering the source of the information – GPS receivers and aerial imagery. However, this is an aspect of the analysis that clearly requires further testing of the algorithm and the datasets.
It is encouraging to see that the results of the analysis are significantly correlated. For the full dataset the correlation is weak (-0.143) but significant at the 0.01 level (2-tailed). However, the average values for each number of contributors (blue line in the graph), the correlation is strong (-0.844) and significant at the 0.01 level (2-talled).
An important caveat is that the number of tiles with more than 10 contributors is fairly small, so that is another aspect that requires further exploration. Moreover, spatial data quality is not just positional accuracy, but also attribute accuracy, completeness, update and other properties. We can expect that they will also exhibit similar behaviour to positional accuracy, but this requires further studies – as always.
However, as this is a large-scale analysis that adds to the evidence from the small-scale studies, it is becoming highly likely that Linus’ Law is affecting the quality of OSM data and possibly of other so-called Volunteered Geographical Information (VGI) sources and there is a decreased gain in terms of positional accuracy when the number of contributors passes about 10 or so.
Completeness in volunteered geographical information – the evolution of OpenStreetMap coverage (2008-2009)
13 August, 2010
The Journal of Spatial Information Science (JOSIS) is a new open access journal in GIScience, edited by Matt Duckham, Jörg-Rüdiger Sack, and Michael Worboys. In addition, the journal adopted an open peer review process, so readers are invited to comment on a paper while it goes through the formal peer review process. So this seem to be the most natural outlet for a new paper that analyses the completeness of OpenStreetMap over 18 months – March 2008 to October 2009. The paper was written in collaboration with Claire Ellul. The abstract of the paper provided below, and you are very welcome to comment on the paper on JOSIS forum that is dedicated to it, where you can also download it.
Abstract: The ability of lay people to collect and share geographical information has increased markedly over the past 5 years as results of the maturation of web and location technologies. This ability has led to a rapid growth in Volunteered Geographical Information (VGI) applications. One of the leading examples of this phenomenon is the OpenStreetMap project, which started in the summer of 2004 in London, England. This paper reports on the development of the project over the period March 2008 to October 2009 by focusing on the completeness of coverage in England. The methodology that is used to evaluate the completeness is comparison of the OpenStreetMap dataset to the Ordnance Survey dataset Meridian 2. The analysis evaluates the coverage in terms of physical coverage (how much area is covered), followed by estimation of the percentage of England population which is covered by completed OpenStreetMap data and finally by using the Index of Deprivation 2007 to gauge socio-economic aspects of OpenStreetMap activity. The analysis shows that within 5 years of project initiation, OpenStreetMap already covers 65% of the area of England, although when details such as street names are taken into consideration, the coverage is closer to 25%. Significantly, this 25% of England’s area covers 45% of its population. There is also a clear bias in data collection practices – more affluent areas and urban locations are better covered than deprived or rural locations. The implications of these outcomes to studies of volunteered geographical information are discussed towards the end of the paper.
10 July, 2010
The slides below are from my presentation in State of the Map 2010 in Girona, Spain. While the conference is about OpenStreetMap, the presentation covers a range of spatially implicint and explicit crowdsourcing projects and also activities that we carried out in Mapping for Change, which all show that unlike other crowdsourcing activities, geography (and places) are both limiting and motivating contribution to them.
In many ways, OpenStreetMap is similar to other open source and open knowledge projects, such as Wikipedia. These similarities include the patterns of contribution and the importance of participation inequalities, in which a small group of participants contribute very significantly, while a very large group of occasional participants contribute only occasionally; the general demographic of participants, with strong representation from educated young males; or the temporal patterns of engagements, in which some participants go through a peak of activity and lose interest, while a small group joins and continues to invest its time and effort to help the progress of the project. These aspects have been identified by researchers who explored volunteering and leisure activities, and crowdsourcing as well as those who explored commons-based peer production networks (Benkler & Nissenbaum 2006).
However, OpenStreetMap is a project about geography, and deals with the shape of features and information about places on the face of the Earth. Thus, the emerging question is ‘what influence does geography have on OSM?’ Does geography make some fundamental changes to the basic principles of crowdsourcing, or should OSM be treated as ‘wikipedia for maps’?
In the presentation, which is based on my work, as well as the work of Vyron Antoniou and Nama Budhathoki, we argue that geography is playing a ‘tyrannical’ role in OSM and other projects that are based on crowdsourced geographical information and shapes the nature of the project beyond what is usually accepted.
The first influence of geography is on motivation. A survey of OSM participants shows that specific geographical knowledge, which a participant acquired at first hand, and the wish to use this knowledge and see it mapped well is an important factor in participation in the project. We found that participants are driven to mapping activities by their desire to represent the places they care about and fix the errors on the map. Both of these motives require local knowledge.
A second influence is on the accuracy and completeness of coverage, with places that are highly populated, and therefore have a larger pool of potential participants, showing better coverage than suburban areas of well-mapped cities. Furthermore, there is an ongoing discussion within the OSM community about the value of mapping without local knowledge and the impact of such action on the willingness of potential contributors to fix errors and contribute to the map.
A third, and somewhat surprising, influence is the impact of mapping places that the participants haven’t or can’t visit, such as Haiti after the earthquake or Baghdad in 2007. Despite the willingness of participants to join in and help in the data collection process, the details that can be captured without being on the ground are fairly limited, even when multiple sources such as Flickr images, Google Street View and paper maps are used. The details are limited to what was captured at a certain point in time and to the limitations of the sensing device, so the mapping is, by necessity, incomplete.
We will demonstrate these and other aspects of what we termed ‘the tyranny of place’ and its impact on what can be covered by OSM without much effort and which locations will not be covered without a concentrated effort that requires some planning.
4 April, 2010
The opening of Ordnance Survey datasets at the beginning of April 2010 is bound to fundamentally change the way OpenStreetMap (OSM) information is produced in the UK. So just before this major change start to influence OpenStreetMap, it is worth evaluating what has been achieved so far without this data. It is also the time to update the completeness study, as the previous ones were conducted with data from March 2008 and March 2009.
Following the same method that was used in all the previous studies (which is described in details here), the latest version of Meridian 2 from OS OpenData was downloaded and used and compared to OSM data which was downloaded from GeoFabrik. The processing is now streamlined with MapBasic scripts, PostGIS scripts and final processing in Manifold GIS so it is possible to complete the analysis within 2 days. The colour scheme for the map is based on Cynthia Brewer and Mark Harrower‘s ColorBrewer 2.
By the end of March 2010, OpenStreetMap coverage of England grown to 69.8% from 51.2% a year ago. When attribute information is taken into account, the coverage grown to 24.3% from 14.7% a year ago. The chart on the left shows how the coverage progressed over the past 2 years, using the 4 data points that were used for analysis – March 2008, March 2009, October 2009 and March 2010. Notice that in terms of capturing the geometry less than 5% are now significantly under mapped when compared to Meridian 2. Another interesting aspect is the decline in empty cells – that is grid cells that don’t have any feature in Meridian 2 but now have features from OSM appearing in them. So in terms of capturing road information for England, it seems like the goal of capturing the whole country with volunteer effort was within reach, even without the release of Ordnance Survey data.
On the other hand, when attributes are included in the analysis, the picture is very different.
The progression of coverage is far from complete, and although the area that is empty of features that include street or road name in Meridian 2 is much larger, the progress of OSM mappers in completing the information is much slower. While the geometry coverage gone up by 18.6% over the past year, less than 10% (9.6% to be precise) were covered when attributes are taken into account. The reason for this is likely to be the need to carry a ground survey to find the street name without using other copyrighted sources.
The attribute area is the one that I would expect will show the benefits of Ordnance Survey data release to OSM mapping. Products such as StreetView and VectorMap District can be used to either copy the street name (StreetView) or write an algorithm that will copy the street name and other attributes from a vector data set – such as Meridian 2 or VectorMap District.
Of course, this is a failure of the ‘crowd’ in the sense that as this bit of information previously required an actual visit on the ground and it was a more challenging task than finding the people who are happy to volunteer their time to digitise maps.
As in the previous cases, there are local variations, and the geography of the coverage is interesting. The information includes 4 time points, so the most appropriate visualisation is one that allows for comparison and transition between maps. Below is a presentation (you can download it from SlideShare) that provides maps for the whole of England as well as 5 regional maps, roughly covering the South West, London, Birmingham and the Midlands, Manchester and Liverpool, and Newcastle upon Tyne and the North West.
If you want to create your own visualisation, of use the results of this study, you can download the results in a shapefile format from here.
For a very nice visualisation of Meridian 2 and OpenStreetMap data – see Ollie O’Brien SupraGeography blog .
Usability of VGI in Haiti earthquake response and the 2nd workshop on usability of geographic information
27 March, 2010
On the 23rd March 2010, UCL hosted the second workshop on usability of geographic information, organised by Jenny Harding (Ordnance Survey Research), Sarah Sharples (Nottingham), and myself. This workshop was extending the range of topics that we have covered in the first one, on which we have reported during the AGI conference last year. This time, we had about 20 participants and it was an excellent day, covering a wide range of topics – from a presentation by Martin Maguire (Loughborough) on the visualisation and communication of Climate Change data, to Johannes Schlüter (Münster) discussion on the use of XO computers with schoolchildren, to a talk by Richard Treves (Southampton) on the impact of Google Earth tours on learning. Especially interesting are the combination of sound and other senses in the work on Nick Bearman (UEA) and Paul Kelly (Queens University, Belfast).
Jenny’s introduction highlighted the different aspects of GI usability, from those that are specific to data to issues with application interfaces. The integration of data with software that creates the user experience in GIS was discussed throughout the day, and it is one of the reasons that the issue of the usability of the information itself is important in this field. The Ordnance Survey is currently running a project to explore how they can integrate usability into the design of their products – Michael Brown’s presentation discusses the development of a survey as part of this project. The integration of data and application was also central to Philip Robinson (GE Energy) presentation on the use of GI by utility field workers.
My presentation focused on some preliminary thoughts that are based on the analysis of OpenStreetMap and Google Map communities response to the earthquake in Haiti at the beginning of 2010. The presentation discussed a set of issues that, if explored, will provide insights that are relevant beyond the specific case and that can illuminate issues that are relevant to daily production and use of geographic information. For example, the very basic metadata that was provided on portals such as GeoCommons and what users can do to evaluate fitness for use of a specific data set (See also Barbara Poore’s (USGS) discussion on the metadata crisis).
Interestingly, the day after giving this presentation I had a chance to discuss GI usability with Map Action volunteers who gave a presentation in GEO-10 . Their presentation filled in some gaps, but also reinforced the value of researching GI usability for emergency situations.
20 March, 2010
The Digital Economy is a research programme of Research Council UK, and as part of it the University of Nottingham is running the Horizon Digital Economy research centre. The institute organised a set of theme days, and the latest one focused on ‘supporting the contextual footprint – infrastructure challenges‘. The day was excellent, covering issues such as background on location issues with a review of location technology and a demonstration of car pooling application, data ownership, privacy and control over your information and finally crowdsourcing. I was asked to give a presentation with a bit of background on OpenStreetMap, discuss the motivation of contributors and mention the business models that are based on open geographical information.
For the purpose of this demonstration, I teamed with Nama Raj Budhathoki who is completing his PhD research at the University of Illinois, Urbana-Champaign under the supervision of Zorica Nedović-Budić (now at University College Dublin). His research focuses on user-generated geographical information, and just before Christmas he run a survey of OpenStreetMap contributors, and I was involved in the design of the questionnaire (as well as being lucky enough to be on Nama’s advisory committee).
So here is the presentation and we plan to give more comprehensive feedback on the survey during State of the Map 2010.