Blog

Posts Tagged ‘postgis’

Down with Podal!

Are you anti-podal? So are many geographic edges!

Actually, not many, but it feels like a common first test case people try when they start playing with the geography type. “How far is it between 0,0 and 180,0?”

There’s a big problem with anti-podal edges though: they don’t have a determinate path. That is, to get from (0,0) to (180,0) it doesn’t matter what direction you travel, just start moving. Any other pairing of points generates a single great circle describing the shortest path joining them. So anti-podal points make very bad components of geometry: they don’t define a path, and they can’t bound an area, because only the end points have a determinate location.

Which brings me to my problem. How do I handle geography objects in PostGIS that include anti-podal segments? On the one hand, since they are impossible to do calculations against, I should just disallow them in all cases, and throw an error. On the other hand, people think they have meaning and stick them into functions all the time. There are also a few functions (like ST_Length) where it’s actually possible to calculate a valid answer given an antipodal input (because we know that antipodal points are exactly one half-circumference apart, even if we don’t know what direction someone might travel between them).

What do you think? Is there a best answer? Comments most welcome!

It goes up to 2.0

Maybe someday PostGIS will go to 11, but for now, we’re still shooting for 2, point oh. And happily we are getting closer and closer. We have moved to a weekly schedule of alpha releases (this week was alpha3) and have started cleaning down the list of tickets against the 2.0 milestone.

Last month, much of the time spent by me and Sandro Santilli on PostGIS 2.0 preparation was funded by the Humanitarian Information Unit of the US Department of State. So, from the PostGIS development team, and the PostGIS community in general: thanks, HIU! Why is HIU funding PostGIS? Because the kinds of tools that HIU and its partners use for humanitarian response are backed by PostGIS, and they want to see those tools get better. Funding PostGIS development is an economical way to simultaneously raise the capabilities of a whole ecosystem of tools in HIU’s space.

Getting Curvey

PostGIS has supported curved geometry types — CIRCULARSTRING, COMPOUNDCURVE< CURVEPOLYGON — since version 1.4, but the number of functions that directly calculate against the curved features has remained pretty small. You can generate a bounding box, or calculate a length, but that’s about it.

A COMPOUNDCURVE made up of a line string, a circular arc, and another line string.

In order to do more complex calculations like area calculation or intersections, you have to first convert the curved object into a linearized approximation, using the ST_CurveToLine function. This is fine for functions that return numbers (like ST_Area) or booleans (like ST_Intersects), but what about functions that return derived geometries?

Linearized version of the compound curve. The arc has been replaced by a regular collection of lines.

For derived geometries, the result will be linearized like the inputs. But portions of the geometry will be linearized versions of the original curves: wouldn’t it be nice to have those curves back for storage?

Yes, it would which is why ST_LineToCurve exists. The line to curve logic works on the premise that a linearized version of a curve will have a certain amount of regularity in it.

The version of the code from PostGIS 1.4 and 1.5 works by looking at the angles between successive segments. Segments that share an angle of deflection with neighbours are probably components of a circular arc. This worked OK, but the code involved a fair amount of trigonometry.

By looking at angles between edges, you can find edges that are former components of an arc.

A simpler approach used for 2.0 turned out to be looking at the circle the arc is inscribed on. Any circular arc in PostGIS is defined by a start point, mid point and end point. Between them, they imply a circle, and the center of the circle can be calculated. Any successive point which is the same distance from that center point as the arc points can be considered part of the arc.

By using the circle as a basis for comparison, each successive point needs a simple distance check, instead of a trig check.

The new code is a lot simpler, and can deal with derived segments of more variable length that the old code.

The simplest way to prove that it works is to wrap a curved geometry in multiple nests of ST_LineToCurve and ST_CurveToLine, pushing the geometry back and forth between representations. While the functions are not perfect inverses (the segmentization routine doesn’t necessarily include the middle control points of the input arcs) you can see that the space bounded by the geometries does not change.

Happy curving!

ST_GeomFromGeoJSON

One of the things that makes PostGIS powerful and useful for web developers is that it can handle multiple input formats and multiple output formats. That means that a simple script binding a web service to the database can be a powerful processing and query tool. But, in order for those queries to work, the database has to speak the same language as the web service.

So, for output we have in addition to the OGC standard formats, ST_AsGML, ST_AsKML, ST_AsGeoJSON, and ST_AsSVG. And for input we have in addition to the OGC formats (WKB, WKT), ST_GeomFromGML and ST_GeomFromKML. But web developers don’t work in GML or KML, they work in JSON, and our favourite web developers, Vizzuality wanted to have a ST_GeomFromGeoJSON function, and backed up their desire with funding.

It turns out that most of the work had already been done, I just needed to clean up an existing patch and integrate with the build system a little more. Like the from-GML and from-KML functions, this iteration of the from-JSON function probably needs to be moved into a more central part of the code tree, but that’s a programmer-level clean-up that can wait. For now, thanks Vizzuality for supporting PostGIS and thanks Kashif Rasul for the patch!

To prove to yourself that it works:

select st_astext(st_geomfromgeojson(st_asgeojson('POLYGON((0 0,1 1,1 0,0 0))')));

To actually see the JSON:

select st_astext(st_geomfromgeojson('{"type":"Point","coordinates":[1,1]}'));

Celebrate PostGIS Day with Reduced Rates on the OpenGeo Suite

This year GIS Day fell on Wednesday, November 16. That means that today is PostGIS Day!

(Get it? Post-GIS Day!)

In honor of PostGIS Day and American Thanksgiving, we’d like to extend a special offer to those of you who are considering signing up for an OpenGeo Suite support contract. For a limited time, anyone who purchases our Basic, Professional, or Platform packages of the OpenGeo Suite Enterprise Edition will receive a 10% discount. But wait, it gets even better! For even more savings buy a two year contract and we’ll increase the discount to 20% off! This is one sale where you won’t have to fight any lines at the mall! (If you want to get up before dawn, that’s okay; we’re pretty excited too.)

If you have considered purchasing the OpenGeo Suite, now is the time. As always, you’ll receive full technical support, priority bug-fixes, and help further the mission of building the best open source geospatial software.

At OpenGeo we have a lot to be thankful for: we’re working with the best open source communities around, our clients and friends are doing amazing things with our support, and our team is comprised of some ridiculously talented people. In this season of appreciation and reflection, we want to pass our good fortune on to you.

So while you’re celebrating PostGIS Day, contact us to take advantage of this great deal on the OpenGeo Suite. This special offer is only available from PostGIS Day through Monday, November 28.

How will you be celebrating PostGIS Day? Let us know!

Indexed Nearest Neighbour Search in PostGIS

An always popular question on the PostGIS users mailing list has been “how do I find the N nearest things to this point?”.

To date, the answer has generally been quite convoluted, since PostGIS supports bounding box index searches, and in order to get the N nearest things you need a box large enough to capture at least N things. Which means you need to know how big to make your search box, which is not possible in general.

PostgreSQL has the ability to return ordered information where an index exists, but the ability has been restricted to B-Tree indexes until recently. Thanks to one of our clients, we were able to directly fund PostgreSQL developers Oleg Bartunov and Teodor Sigaev in adding the ability to return sorted results from a GiST index. And since PostGIS indexes use GiST, that means that now we can also return sorted results from our indexes.

Which is a very long way of saying that PostGIS (the development code in the source repository) now has the ability to do index-assisted nearest neighbour searching.

This feature (the PostGIS side of it) was funded by Vizzuality, and hopefully it comes in useful in their CartoDB work.

You will need PostgreSQL 9.1 and the PostGIS source code from the repository, but this is what a nearest neighbour search looks like:

SELECT name, gid
FROM geonames
ORDER BY geom <-> st_setsrid(st_makepoint(-90,40),4326)
LIMIT 10;

Note the magic <-> operator in the ORDER BY clause. This is where the magic occurs. The <-> is a “distance” operator, but it only makes use of the index when it appears in the ORDER BY clause. Between putting the operator in the ORDER BY and using a LIMIT to truncate the result set, we can very very quickly (less than 10ms on a 2M record table, in this case) get the 10 nearest points to our test point.

“It can’t possibly be this easy!!” You’re right. It can’t. Because it is traversing the index, which is made of bounding boxes, the distance operator only works with bounding boxes. For point data, the bounding boxes are equivalent to the points, so the answers are exact. But for any other geometry types (lines, polygons, etc) the results are approximate.

There are actually two different approximations available for you to chose from.

  • Using the <-> operator, you get the nearest neighbour using the centers of the bounding boxes to calculate the inter-object distances.
  • Using the <#> operator, you get the nearest neighbour using the bounding boxes themselves to calculate the inter-object distances.

In general, because the box calculations are approximations of calculations on the objects themselves, getting a more exact “nearest N objects” is going to require a two-phase query, where the first phase grabs a larger candidate set, and the second phase does an exact test (just like all the other index-assisted predicates). So, for example:

with index_query as (
  select
    st_distance(geom, 'SRID=3005;POINT(1011102 450541)') as distance,
    parcel_id, address
  from parcels
  order by geom <#> 'SRID=3005;POINT(1011102 450541)' limit 100
)
select * from index_query order by distance limit 10;

The indexed query pulls the 100 nearest objects by box distance, and the second query pulls the 10 actual closest from that set.

PgCon Notes #3

One of the joys of an open source conference is meeting people face to face who you have only previously corresponded with via e-mail. PgCon this year I got to meet Oleg Bartunov and Teodor Sigaev, the team behind the PostgreSQL GiST and GIN indexes.

The GiST and GIN indexes are interesting beasts. Both offer APIs for data type authors (like the PostGIS team) to attach their particular type to an index strategy that makes sense for it. So the GiST index can be used to build an R-Tree pattern, but it can also be used to index arrays.

At PgCon this year, Oleg and Teodor presented their initial findings on a new generalized index, that they are calling “SP-GiST” for “Spatial Partitioning Generalized Search Tree”.

Spatial partitioning is an important topic for GIS data, because GIS data tends to be queried in spatially consistent groups. When data is spatially partitioned, objects in the same index node tend to be close together. The net result should in theory be faster retrieval times for spatial queries. Oleg and Teodor implemented a very basic SP-GiST with a quadtree binding and ran performance tests to see how it fared.

Using a GIS data set (the GeoNames corpus of 2 million points), and comparing a GiST R-Tree to an SP-GiST quad-tree, they found the SP-GiST quad-tree built 3 times faster and fulfilled bounding box queries 6 times faster.

In case you skipped that last paragraph: the SP-GiST quad-tree was six times faster.

The demonstration code was suitable for the performance test, but is not ready for production. It needs to use the PostgreSQL write-ahead log for durability, to support concurrency for transactions, to support vacuuming so it doesn’t bloat over time, and to use a better fill algorithm for database pages.

However, with funding the work could be ready to be part of the PostgreSQL 9.2 development cycle which will likely release to production in the fall of next year. If you’re interested in seeing your index queries run 6 times faster in PostGIS.

OpenGeo Gallery Part 4

Another week, another wing added to the OpenGeo Gallery. As always, we showcase applications from around the world built using some or all of the the software components we help to develop: PostGIS, GeoServer, GeoWebCache, OpenLayers, and GeoExt. These range from small-scale applications to national deployments. See for yourself.

We start out down under with the Australian Antarctic Data Navigator, showing observational data from the Australian Antarctic Division using a GeoExt client and served through GeoServer and GeoWebCache. Next up, our friends at the Ordnance Survey in the UK have built an interesting WMS Map Builder using OpenLayers that serves from their horde of national data. Somewhat closer to our home office, the State of Oklahoma has its own broadband map, similar to the National Broadband Map that we’ve highlighted here in the past. Finally, the World Health Organization hosts the International Travel and Health Map, a lightweight application that shows health risks on a per-country basis.

We’re nowhere near done yet—thanks to everyone who has sent in your applications.

Enter the OpenGeo Gallery

PgCon Notes #2

During my keynote at PgCon, I looked to the future of PostgreSQL/PostGIS and pointed out two use cases where users were pushing the limits of software as it exists today.

The first case is where the user wants to write huge and continuous streams of data. Think: Foursquare or Loopt. Think: fleet-tracking company with a 100,000 vehicles each emitting a GPS point every two seconds. This is not a problem unique to spatial. Many web workloads generate very large write streams, and RDBMS systems with a single machine worth of I/O channels simply can’t keep up.

It turns out that there is existing work already in flight to support massive concurrent write loads in PostgreSQL, most notably the Postgres-XC project. While Postgres-XC doesn’t currently support PostGIS types, it could, and I talked to a developer at the conference who was interested in adding that feature. If you or your organization is interested in partially (or fully) funding a project to add PostGIS support to XC.

The second future performance case is somewhat specific to spatial. When users load multi-million record spatial tables, and then do spatial joins against them, they generate very compute-intensive workloads. The workloads are naturally parallelizable, and could theoretically be run on multiple CPUs at once, but PostgreSQL current evaluates them serially, so only one CPU is used. After the talk I had some good conversations with a few different developers interested in tackling this problem.

There are two approaches to cracking the parallel-query problem.

The first approach is to work against the PostgreSQL core directly, so that queries can take advantage of the multiple CPUs that are commonly found on modern servers. The advantage to doing the work on the PostgreSQL core is that it would have a stronger likelihood of being maintained and advanced in PostgreSQL over the long term. The disadvantage is that it would be fairly intricate work, and would take a while to become available for production. (At a minimum 12-14 months to hit the 9.2 release cycle.)

The second approach is to build a query dispatch layer above a PostgreSQL cluster. This is the approach taken by GridSQL and (with a more invasive approach to allow deeper integration) Postgres-XC. The second approach is “shared nothing” and has some potential performance advantages over using the core. But the main advantage is that it can be released and stabilized on a different schedule from the core PostgreSQL. A PostGIS-enabled version of GridSQL could theoretically be available for production use in a few months.

As before, I have talked to developers who are willing to pursue both approaches (and I am enthusiastic about both, as they both help push forward the state of the art). If you or your organization is interested in partially (or fully) funding a project to add parallel query to PostgreSQL or PostGIS support to GridSQL.

It was very gratifying that, for the two problems I considered the most difficult to resolve going forward, I was able to talk to several folks with different concrete plans to overcome those problems. PgCon was truly an amazing event.

PgCon Notes #1

Open source conferences are pretty unique in providing very deep discussions of the way software really works, given by the people who actually write it. There is no filter. And PgCon, which I attended last week was very much in that mould. Most of the major developers of PostgreSQL were in attendance, giving talks, and the conference was book-ended by “summits” discussing the future development of the core, of clustering, of procedural languages, and so on.

I was invited to give the opening keynote, which I did on GIS in general, and how the audience of core database developers should look at GIS. Not just as a funny little niche feature, but as a growing challenge to the core: the amount of data GIS system generate and the kinds of queries we require are quite distinct.

I have spent the last year or so being bombarded (along with everyone else in IT) with breathless hype about “NoSQL” persistence stores, and a general sense that the classic RDBMS is yesterday’s news. So PgCon was a real revelation to me. The core team is turning PostgreSQL into a tool that transcends simple data storage and retrieval. Take these four bits of functionality, three of which I learned about at PgCon:

  • “Foreign data wrappers” that allow remote information (flat files, other databases, GIS files, anything one can coerce into a “table-like” representation) to appear as tables in the database for reading and writing.
  • Invoking parallel processing in GPUs using OpenCL inside the database to allow high performance image processing.
  • Publishing events from PostgreSQL (table updates, etc) to an external queue for consumption by other processes.
  • Type extension, like that used by PostGIS, and the availability of many generic index APIs such as GiST, GIN and (hopefully soon) SP-GiST to bind with those custom types. Also custom procedural languages like PL/Python and PL/R that allow complex programs to run inside the database.

Taken all together the view I have of PostgreSQL is not a simple storage/query engine, but of an integration environment. Using foreign data wrappers to expose remote data to processing internally that can leverage unique types (like PostGIS) and languages (like R) to generate novel analyses, and push those analyses out to other systems via a queue or by writing back to another foreign data wrapper. PostgreSQL now is a powerful platform for turning data into information and information into knowledge. It’s the beginning of an exciting chapter for my favourite database.