Blog

Posts Tagged ‘gist’

Indexed Nearest Neighbour Search in PostGIS

An always popular question on the PostGIS users mailing list has been “how do I find the N nearest things to this point?”.

To date, the answer has generally been quite convoluted, since PostGIS supports bounding box index searches, and in order to get the N nearest things you need a box large enough to capture at least N things. Which means you need to know how big to make your search box, which is not possible in general.

PostgreSQL has the ability to return ordered information where an index exists, but the ability has been restricted to B-Tree indexes until recently. Thanks to one of our clients, we were able to directly fund PostgreSQL developers Oleg Bartunov and Teodor Sigaev in adding the ability to return sorted results from a GiST index. And since PostGIS indexes use GiST, that means that now we can also return sorted results from our indexes.

Which is a very long way of saying that PostGIS (the development code in the source repository) now has the ability to do index-assisted nearest neighbour searching.

This feature (the PostGIS side of it) was funded by Vizzuality, and hopefully it comes in useful in their CartoDB work.

You will need PostgreSQL 9.1 and the PostGIS source code from the repository, but this is what a nearest neighbour search looks like:

SELECT name, gid
FROM geonames
ORDER BY geom <-> st_setsrid(st_makepoint(-90,40),4326)
LIMIT 10;

Note the magic <-> operator in the ORDER BY clause. This is where the magic occurs. The <-> is a “distance” operator, but it only makes use of the index when it appears in the ORDER BY clause. Between putting the operator in the ORDER BY and using a LIMIT to truncate the result set, we can very very quickly (less than 10ms on a 2M record table, in this case) get the 10 nearest points to our test point.

“It can’t possibly be this easy!!” You’re right. It can’t. Because it is traversing the index, which is made of bounding boxes, the distance operator only works with bounding boxes. For point data, the bounding boxes are equivalent to the points, so the answers are exact. But for any other geometry types (lines, polygons, etc) the results are approximate.

There are actually two different approximations available for you to chose from.

  • Using the <-> operator, you get the nearest neighbour using the centers of the bounding boxes to calculate the inter-object distances.
  • Using the <#> operator, you get the nearest neighbour using the bounding boxes themselves to calculate the inter-object distances.

In general, because the box calculations are approximations of calculations on the objects themselves, getting a more exact “nearest N objects” is going to require a two-phase query, where the first phase grabs a larger candidate set, and the second phase does an exact test (just like all the other index-assisted predicates). So, for example:

with index_query as (
  select
    st_distance(geom, 'SRID=3005;POINT(1011102 450541)') as distance,
    parcel_id, address
  from parcels
  order by geom <#> 'SRID=3005;POINT(1011102 450541)' limit 100
)
select * from index_query order by distance limit 10;

The indexed query pulls the 100 nearest objects by box distance, and the second query pulls the 10 actual closest from that set.

PgCon Notes #3

One of the joys of an open source conference is meeting people face to face who you have only previously corresponded with via e-mail. PgCon this year I got to meet Oleg Bartunov and Teodor Sigaev, the team behind the PostgreSQL GiST and GIN indexes.

The GiST and GIN indexes are interesting beasts. Both offer APIs for data type authors (like the PostGIS team) to attach their particular type to an index strategy that makes sense for it. So the GiST index can be used to build an R-Tree pattern, but it can also be used to index arrays.

At PgCon this year, Oleg and Teodor presented their initial findings on a new generalized index, that they are calling “SP-GiST” for “Spatial Partitioning Generalized Search Tree”.

Spatial partitioning is an important topic for GIS data, because GIS data tends to be queried in spatially consistent groups. When data is spatially partitioned, objects in the same index node tend to be close together. The net result should in theory be faster retrieval times for spatial queries. Oleg and Teodor implemented a very basic SP-GiST with a quadtree binding and ran performance tests to see how it fared.

Using a GIS data set (the GeoNames corpus of 2 million points), and comparing a GiST R-Tree to an SP-GiST quad-tree, they found the SP-GiST quad-tree built 3 times faster and fulfilled bounding box queries 6 times faster.

In case you skipped that last paragraph: the SP-GiST quad-tree was six times faster.

The demonstration code was suitable for the performance test, but is not ready for production. It needs to use the PostgreSQL write-ahead log for durability, to support concurrency for transactions, to support vacuuming so it doesn’t bloat over time, and to use a better fill algorithm for database pages.

However, with funding the work could be ready to be part of the PostgreSQL 9.2 development cycle which will likely release to production in the fall of next year. If you’re interested in seeing your index queries run 6 times faster in PostGIS.