Back in September, during AGI Geocommunity ’09, I had a chat with Jo Cook about the barriers to the use of OpenStreetMap data by people who are not experts in the ways the data was created and don’t have the time and resources to evaluate the quality of the information. One of the difficulties is to decide if the coverage is complete (or close to complete) for a given area.
To help with this problem, I obtained permission from the Ordnance Survey research unit to release the results of my analysis, which compares OpenStreetMap coverage to the Ordnance Survey Meridian 2 dataset (see below about the licensing conundrum that the analysis produced as a by-product).
Before using the data, it is necessary to understnad how it was created. The methodology can be used for the comparison of completeness as well as the systematic analysis of other properties of two vector datasets. The methodology is based on the evaluation of two datasets A and B, where A is the reference dataset (Ordnance Survey Meridian 2 in this case) and B is the test dataset (OpenStreetMap), and a dataset C which includes the spatial units that will be used for the comparison (1km grid square across England).
The first step in the analysis is to decide on the spatial units that will be used in the comparison process (dataset C). This can be a reference grid with standard cell size, or some other meaningful geographical unit such as census enumeration units or administrative boundaries (see previous post, where lower level super output areas were used). There are advantages to the use of a regular grid, as this avoids problems that arise from the Modifiable Areal Unit Problem (MAUP) to some extent.
The two datasets (A and B) are then split along the boundaries of the geographical units, while preserving the attributes in each part of the object, to ensure that no information is lost. The splitting is necessary to support queries that address only objects that fall within each geographical unit.
The next step involves the creation of very small buffers around the geographical units. This is necessary because, due to computational errors in the algorithm that calculates the intersections and splits the objects and implementation of operators in the specific GIS package used, the co-ordinates where the object was split might be near, but not at, the boundary of the reference geographical unit. The buffers should be very small so as to ensure that only objects that should be calculated inside the unit’s area will be included in the analysis. In our case, the buffers are 25cm over grid square units that are 1km in length.
Finally, spatial queries can be carried out to evaluate the total length, area or any other property of dataset A that falls within each unit, and to compare these values to the results of the analysis of dataset B. The whole process is described in the image above.
The shape file provided here contains values from -4 to +4, and these values correspond to the difference between OpenStreetMap and Meridian 2. In each grid square, the following equation was calculated:
∑(OSM roads length)-∑(Meridian roads length)
If the value is negative, then the total length of Meridian objects is bigger than the length of OpenStreetMap objects. A value of -1, for example, means that ‘there are between 0 and 1000 metres more Meridian 2’ in this grid square whereas 1 means that ‘there are between 0 and 1000 metres more OpenStreetMap’. Importantly, 4 and -4 mean anything with a positive of negative difference of over 3000 metres. In general, the analysis shows that, if the difference is at levels 3 or 4, then you can consider OpenStreetMap as complete, while 1 and 2 will usually mean that some minor roads are likely to be missing. Also, -1 should be easy to complete. In areas where the values are -2 to -4, the OpenStreetMap community needs to do complete the map.
Finally, a licensing conundrum that shows the problems with both Ordnance Survey principles, which state that anything that is derived from its maps is Crown copyright and part of Ordnance Survey intellectual property, and with the use of the Creative Commons licence for OpenStreetMap data.
Look at the equation above. The left-hand side is indisputably derived from OpenStreetMap, so it is under the CC-By-SA licence. The right-hand side is indisputably derived from Ordnance Survey, so it is clearly Crown copyright. The equation, however, includes a lot of UCL’s work, and, most importantly, does not contain any geometrical object from either datasets – the grid was created afresh. Yet, without ‘deriving’ the total length from each dataset, it is impossible to compute the results that are presented here – but they are not derived by one or the other. So what is the status of the resulting dataset? It is, in my view, UCL copyright – but it is an interesting problem, and I might be wrong.
You can download the data from here – the file includes a metadata document.
If you use the dataset, please let me know what you have done with it.