Haklay, M and Basiouka, S and Antoniou, V and Ather, A (2010) How Many Volunteers Does It Take To Map An Area Well? The validity of Linus’ law to Volunteered Geographic Information. The Cartographic Journal , 47 (4) , 315 – 322.
The abstract of the paper is as follows:
In the area of volunteered geographical information (VGI), the issue of spatial data quality is a clear challenge. The data that are contributed to VGI projects do not comply with standard spatial data quality assurance procedures, and the contributors operate without central coordination and strict data collection frameworks. However, similar to the area of open source software development, it is suggested that the data hold an intrinsic quality assurance measure through the analysis of the number of contributors who have worked on a given spatial unit. The assumption that as the number of contributors increases so does the quality is known as `Linus’ Law’ within the open source community. This paper describes three studies that were carried out to evaluate this hypothesis for VGI using the OpenStreetMap dataset, showing that this rule indeed applies in the case of positional accuracy.
One issue that remained open in the studies on the relevance of Linus’ Law for OpenStreetMap was that the previous studies looked at areas with more than 5 contributors, and the link between the number of users and the quality was not conclusive – although the quality was above 70% for this number of contributors and above it.
Now, as part of writing up the GISRUK 2010 paper for journal publication, we had an opportunity to fill this gap, to some extent. Vyron Antoniou has developed a method to evaluate the positional accuracy on a larger scale than we have done so far. The methodology uses the geometric position of the Ordnance Survey (OS) Meridian 2 road intersections to evaluate positional accuracy. Although Meridian 2 is created by applying a 20-metre generalisation filter to the centrelines of the OS Roads Database, this generalisation process does not affect the positional accuracy of node points and thus their accuracy is the best available. An algorithm was developed for the identification of the correct nodes between the Meridian 2 and OSM, and the average positional error was calculated for each square kilometre in England. With this data, which provides an estimated positional accuracy for an area of over 43,000 square kilometres, it was possible to estimate the contribution that additional users make to the quality of the data.
As can be seen in the chart below, positional accuracy remains fairly level when the number of users is 13 or more – as we have seen in previous studies. On the other hand, up to 13 users, each additional contributor considerably improves the dataset’s quality. In grey you can see the maximum and minimum values, so the area represents the possible range of positional accuracy results. Interestingly, as the number of users increases, positional accuracy seems to settle close to 5m, which is somewhat expected when considering the source of the information – GPS receivers and aerial imagery. However, this is an aspect of the analysis that clearly requires further testing of the algorithm and the datasets.
It is encouraging to see that the results of the analysis are significantly correlated. For the full dataset the correlation is weak (-0.143) but significant at the 0.01 level (2-tailed). However, the average values for each number of contributors (blue line in the graph), the correlation is strong (-0.844) and significant at the 0.01 level (2-talled).
An important caveat is that the number of tiles with more than 10 contributors is fairly small, so that is another aspect that requires further exploration. Moreover, spatial data quality is not just positional accuracy, but also attribute accuracy, completeness, update and other properties. We can expect that they will also exhibit similar behaviour to positional accuracy, but this requires further studies – as always.
However, as this is a large-scale analysis that adds to the evidence from the small-scale studies, it is becoming highly likely that Linus’ Law is affecting the quality of OSM data and possibly of other so-called Volunteered Geographical Information (VGI) sources and there is a decreased gain in terms of positional accuracy when the number of contributors passes about 10 or so.
One of the interesting questions that emerged from the work on the quality of OpenStreetMap (OSM) in particular, and Volunteered Geographical Information (VGI) in general, is the validity of the ‘Linus’ Law’ for this type of information.
The law came from Open Source software development and states that ‘Given enough eyeballs, all bugs are shallow’ (Raymond, 2001, p.19). For mapping, I suggest that this can be translated into the number of contributors that have worked on a given area. The rationale behind it is that if there is only one contributor in an area he or she might inadvertently introduce some errors. For example, they might forget to survey a street or might position a feature in the wrong location. If there are several contributors, they might notice inaccuracies or ‘bugs’ and therefore the more users, the less ‘bugs’.
In my original analysis, I looked only at the number of contributors per square kilometre as a proxy for accuracy, and provided a visualisation of the difference across England.
During the past year, Aamer Ather and Sofia Basiouka looked at this issue, by comparing the positional accuracy of OSM in 125 sq km of London. Aamer carried out a detailed comparison of OSM and the Ordnance Survey MasterMap Integrated Transport Network (ITN) layer. Sofia took the results from his study and divided them for each grid square, so it was possible to calculate an overall value for every cell. The value is the average of the overlap between OSM and OS objects, weighted by the length of the ITN object. The next step was to compare the results to the number of users at each grid square, as calculated from the nodes in the area.
The results show that, above 5 users, there is no clear pattern of improved quality. The graph below provide the details – but the pattern is that the quality, while generally very high, is not dependent on the number of users – so Linus’ Law does not apply to OSM (and probably not to VGI in general).
From looking at OSM data, my hypothesis is that, due to the participation inequality in OSM contribution (some users contribute a lot while others don’t contribute very much), the quality is actually linked to a specific user, and not to the number of users.
Yet, I will qualify the conclusion with the statement that further research is necessary. Firstly, the analysis was carried out in London, so checking what is happening in other parts of the country where different users collected the data is necessary. Secondly, the analysis did not include the interesting range of 1 to 5 users, so it might be the case that there is rapid improvement in quality from 1 to 5 and then it doesn’t matter. Maybe the big change is from 1 to 3? Finally, the analysis focused on positional accuracy, and it is worth exploring the impact of the number of users on completeness.