21 October, 2010
One issue that remained open in the studies on the relevance of Linus’ Law for OpenStreetMap was that the previous studies looked at areas with more than 5 contributors, and the link between the number of users and the quality was not conclusive – although the quality was above 70% for this number of contributors and above it.
Now, as part of writing up the GISRUK 2010 paper for journal publication, we had an opportunity to fill this gap, to some extent. Vyron Antoniou has developed a method to evaluate the positional accuracy on a larger scale than we have done so far. The methodology uses the geometric position of the Ordnance Survey (OS) Meridian 2 road intersections to evaluate positional accuracy. Although Meridian 2 is created by applying a 20-metre generalisation filter to the centrelines of the OS Roads Database, this generalisation process does not affect the positional accuracy of node points and thus their accuracy is the best available. An algorithm was developed for the identification of the correct nodes between the Meridian 2 and OSM, and the average positional error was calculated for each square kilometre in England. With this data, which provides an estimated positional accuracy for an area of over 43,000 square kilometres, it was possible to estimate the contribution that additional users make to the quality of the data.
As can be seen in the chart below, positional accuracy remains fairly level when the number of users is 13 or more – as we have seen in previous studies. On the other hand, up to 13 users, each additional contributor considerably improves the dataset’s quality. In grey you can see the maximum and minimum values, so the area represents the possible range of positional accuracy results. Interestingly, as the number of users increases, positional accuracy seems to settle close to 5m, which is somewhat expected when considering the source of the information – GPS receivers and aerial imagery. However, this is an aspect of the analysis that clearly requires further testing of the algorithm and the datasets.
It is encouraging to see that the results of the analysis are significantly correlated. For the full dataset the correlation is weak (-0.143) but significant at the 0.01 level (2-tailed). However, the average values for each number of contributors (blue line in the graph), the correlation is strong (-0.844) and significant at the 0.01 level (2-talled).
An important caveat is that the number of tiles with more than 10 contributors is fairly small, so that is another aspect that requires further exploration. Moreover, spatial data quality is not just positional accuracy, but also attribute accuracy, completeness, update and other properties. We can expect that they will also exhibit similar behaviour to positional accuracy, but this requires further studies – as always.
However, as this is a large-scale analysis that adds to the evidence from the small-scale studies, it is becoming highly likely that Linus’ Law is affecting the quality of OSM data and possibly of other so-called Volunteered Geographical Information (VGI) sources and there is a decreased gain in terms of positional accuracy when the number of contributors passes about 10 or so.
5 October, 2010
The London Citizen Cyberscience Summit in early September was a stimulating event, which brought together a group of people with an interest in this area. A report from the event, with a very good description of the presentations, including a reflection piece, is available on the ‘Strange Attractor’ blog.
During the summit, I discussed the aspects of ‘Extreme’ Citizen Science, where we move from usual science to participatory research. The presentation was partly based on a paper that I wrote and that I presented during the workshop on the value of Volunteered Geographical Information in advancing science, which was run as part of the GIScience 2010 conference towards the middle of September. Details about the workshop are available on the workshop’s website including a set of interesting position papers.
The presentation below covers the topics that I discussed in both workshops. Here, I provide a brief synopsis for the presentation, as it is somewhat different from the paper.
In the talk, I started by highlighting that by using different terminologies we can notice different facets of the practice of crowd data collection (VGI within the GIScience community, crowdsourcing, participatory mapping …).
The first way in which we can understand this information is in the context of Web 2.0 applications. These applications can be non-spatial (such as Wikipedia or Twitter), or implicitly spatial (such as Flickr – you need to be in a location before you can capture a photograph), or explicitly spatial , in applications that are about collecting geographical information – for example OpenStreetMap. When looking at VGI from the perspective of Web 2.0 it’s possible to identify the specific reasons that it emerged and how other similar applications influence its structure and practices.
The second way to view this information is as part of geographical information produced by companies who need mapping information (such as Google or TomTom). In this case, you notice that it’s about reducing the costs of labour and the need for active or passive involvement of the person who carries out the mapping.
The third, and arguably new way to view VGI is as part of Citizen Science. These activities have been going for a long time in ornithology and in meteorology. However, there are new forms of Citizen Science that rely on ICT – such as movement-activated cameras (slide 11 on the left) that are left near animal trails and are operated by volunteers, or a network of accelerometers that form a global earthquake monitoring network. Not all Citizen Science is spatial, and there are very effective examples, especially in the area of Citizen Cyberscience. So in this framing of VGI we can pay special attention to the collection of scientific information. Importantly, as in the case of spatial application, some volunteers become experts, such as Hanny van Arkel who has discovered a type of galaxy in Galaxy Zoo.
Slides 16-17 show the distribution of crowdsourced images, and emphasise the spatial distribution of information near population centres and tourist attractions. Slides 19-25 show the analysis of the data that was collected by OpenStreetMap volunteers and highlight bias towards highly populated and affluent areas.
Citizen Science is not just about the data collections. There are also cultural problems regarding the trustworthiness of the data, but slides 28-30 show that the data is self-improving as more volunteers engage in the process (in this case, mapping in OpenStreetMap). On that basis, I do question the assumption about trustworthiness of volunteers and the need to change the way we think about projects. There are emerging examples of such Citizen Science where the engagement of participants is at a higher level. For example, noise mapping activities that a community near London City Airport carried out (slides 34-39) which shows that people can engage in science and are well placed when there are opportunities, such as the ash cloud in April 2010, to collect ‘background’ noise. This is not possible without the help of communities.
Finally, slides 40 and 41 demonstrate that it is possible to engage non-literate users in environmental data collection.
So in summary, a limitless Citizen Science is possible – we need to create the tool for it and understand how to run such projects, as well study them.