14

I am trying to start with a geo search project that will find all landmarks in the 10 km/miles (not important for this story) of a particular landmark.

So for example, lets say I have a database of a 1,000,000 landmarks. In order to find all landmarks in 10 miles range of a landmark with certain coordinates, I would have to calculate a distance between a landmark from my search and 1,000,000 landmarks.

Is there a better way to do that?

Alternative I was thinking is to categorize landmarks such as country, region, city, neighborhood, business, historical, etc. in such a way that business can be part of a neighborhood or city. City is a part of a region, a country, etc. This can narrow a list of calculations, but it still looks like a lot of work do to in order for search to be fast and accurate.

Could the Google Maps API help?

Peter Mortensen
  • 1,050
  • 2
  • 12
  • 14
Dario Granich
  • 795
  • 7
  • 16
  • 5
    You could probably eliminate a good many simply by performing a quick Manhattan distance calculation and then performing a second filter afterwards to exclude landmarks which are within a 10km square but are outside the 10km radius. – Neil Nov 05 '18 at 13:30
  • 3
    What database technology are you using? The answer is not database agnostic. – jpmc26 Nov 05 '18 at 16:48
  • 1
    @Neil As a second pass you can include any landmark that where the x and y both fall with in 7km of the origin without calculating the actual distance. – JimmyJames Nov 05 '18 at 20:17

4 Answers4

29

Use a database with support for GIS (geographic information systems) queries. Most databases support this outright or have extensions, but the details will be database-specific (in their answer, Flater shows the syntax for SQL server).

If you need to implement such queries within your application, you can implement a data structure that allows spatial queries, e.g. a k-d Tree. This is like a binary search tree, except that each level of the tree partitions on a different coordinate dimension. This allows you to restrict the search to a smaller set of feasible candidates. Effectively, you translate your search “10km radius” into bounds for each coordinate dimension, and tighten the bounds as you recurse into the tree.

amon
  • 132,749
  • 27
  • 279
  • 375
  • 5
    There is also [a GIS stackexchange](https://gis.stackexchange.com/) – BlueRaja - Danny Pflughoeft Nov 05 '18 at 15:11
  • 8
    [PostGIS](https://postgis.net/) is the premier free option. It supports *much, much* more than SQL Server's very basic GIS types and functions. But this is basic functionality. – jpmc26 Nov 05 '18 at 16:41
  • @amon I find jpmc26's comment as a good addition, and not that much as criticizing your example. "If you want to start from scratch, you don't need to pay for a licensed DB - this free, open-source one will also do the trick really well". – mgarciaisaia Nov 06 '18 at 11:51
11

Yes, there's a better way. You need to use a spatial index. These indexes organize metadata about geometries to filter out far away geometries very rapidly, saving a lot of CPU cycles by avoiding the computations you describe. You shouldn't bother implementing one yourself as all major relational databases provide a spatial geometry type and indexes to go with them.

What you want to look into are "within distance" queries (queries for geometries within a certain distance of some other geometry). These are very standard and very much a solved problem and are possible in all of the above databases (and built into several):

  • PostGIS: ST_DWithin
  • SQL Server: STDistance (Not clear that index use on the 3D geography version of this function is supported)
  • Oracle: SDO_WITHIN_DISTANCE (This doesn't say explicitly that it will trigger index usage. I'd double check the query plan. You might need to apply an SDO_FILTER to get it to use the index.)
  • MySQL: Still figuring this out.

Workaround for triggering index usage

In the worst case where you have trouble getting the system to use the spatial index with these queries, you can add an additional filter. You'd create a square bounding box with sides of length 2*(search distance) centered at your search point and compare the table geometries' bounding boxes against that before checking the actual distance. That's what PostGIS' ST_DWithin above does internally anyway.


Distance in GIS

While spatial indexes are fantastic and absolutely the right solution to your problem, distance calculation can get logically complicated. In particular, you need to worry about what projection (basically all the parameters for the coordinate system) your data is stored in. Most 2D projections (things other than angular coordinate systems like the various lat/long projections) distort length significantly. For example, the Web Mercator projection (the one used by Google, Bing, and every other major base map provider) expands areas and distances increasingly as the location gets further from the equator. I might be wrong as I'm not formally educated in GIS, but the best I've seen for 2D projections is some specific ones that promise correct distances from a single, constant point in the entire world. (No, it's not practical to use a different projection for every query; that would render your indexes useless.)

The bottom line is that you need to make sure your math is accurate. The simplest way of doing so from a development perspective is to use angular projections (These are often referred to as "geographic.") and functions that support doing the math using a spheroid model, but these computations are slightly more expensive than the 2D counterparts and some DBs may not support indexing them. If you can get acceptable performance using them, though, that's probably the way to go. Another common option is regional projections (like UTM zones) that get both distances and areas pretty close to correct if your data is confined to a particular part of the world. What's best for your app will depend on your specific requirements, but be aware that you need to think this through and maybe learn a little bit about it.

This applies even if you don't use built in spatial indexes. Your data has some projection regardless of what technology or technique you are currently using or use in the future, and it's already currently affecting any queries and computations you're making.

jpmc26
  • 5,389
  • 4
  • 25
  • 37
10

Since SQL Server 2008, there is a geography data type which stores locations (lat/lon pairs) and makes it easy for you to write location-related queries.

There is an existing StackOverflow answer that discusses this in-depth.

A basic query to find the nearest 7 items:

USE AdventureWorks2012  
GO  
DECLARE @g geography = 'POINT(-121.626 47.8315)';  
SELECT TOP(7) SpatialLocation.ToString(), City FROM Person.Address  
WHERE SpatialLocation.STDistance(@g) IS NOT NULL  
ORDER BY SpatialLocation.STDistance(@g);  

A basic query to find everything within 100m (second answer to the question)

-- Get the center point
DECLARE @g geography
SELECT @g = geo FROM yourTable WHERE PointId = something

-- Get the results, radius 100m
SELECT * FROM yourTable WHERE @g.STDistance(geo) <= 100
Flater
  • 44,596
  • 8
  • 88
  • 122
  • 1
    This will still be hellishly inefficient if you didn’t create an index for the relevant columns, though. – Konrad Rudolph Nov 05 '18 at 14:54
  • 11
    @KonradRudolph: As is the case for any SQL column that is used for querying on a table with a massive rowcount. You're correct, but that comment would apply to virtually any SQL query that is posted as an answer. – Flater Nov 05 '18 at 14:58
  • 2
    Where did you read "MS SQL Server" in the question? – Doc Brown Nov 05 '18 at 14:58
  • @DocBrown: Where did you read that OP's DB system is _not_ MS SQL Server and that he's not open to changing it? OP asked for "a better way to do that" and did not specify any particular limits as to the DB system used. – Flater Nov 05 '18 at 15:00
  • 3
    @Flater I agree that it would normally obvious and redundant but OP’s wording seems to suggest that they’re unaware of such mechanisms. – Konrad Rudolph Nov 05 '18 at 15:09
  • 1
    I am appalled by the lack of mention of an index. I'd even suggest mentioning that there is a special *kind* of index specifically for GIS data. (SQL Server uses a grid index. Other DBs opt for R-Tree.) Also, if your answer to assuming SQL Server is that the OP might be willing to change it, I'm appalled you didn't point out the existence of PostGIS, which is free and open source. Oracle also has built in GIS support if they're stuck with that. The only DB I'm not really sure about is MySQL, but they have *something* for geometries, too. Point being that switching DBs is likely unnecessary. – jpmc26 Nov 05 '18 at 16:43
  • 1
    @Flater: don't get me wrong, your answer is ok, but I think it could be improved by some introducing words like "in case you are using Windows and you are free to pick a db system ..." Or, even better, "In many DB systems, there is a geography data type available ... For example, MS SQL server provides the following ...". – Doc Brown Nov 05 '18 at 17:00
  • This also isn't what the OP asked for. This is n nearest neighbors. The OP wants *all* results within a specific distance. It's not immediately clear to me from [SQL Server's docs](https://docs.microsoft.com/en-us/sql/relational-databases/spatial/query-spatial-data-for-nearest-neighbor?view=sql-server-2017) how to adjust the query so that it still uses the index. – jpmc26 Nov 05 '18 at 17:07
  • 2
    @jpmc26: You're appalled that I listed a valid option and didn't include some other option? What? If you feel it's relevant to add PostGIS, add the answer yourself (which you did) and don't resort to criticizing others for not having the same idea as you. – Flater Nov 06 '18 at 07:22
  • 3
    Your answer appears to me as basically just a MS SQL sales pitch. Your comments suggesting they *switch databases* to something that would cost 10s of thousands of dollars without actually inquiring about what their situation only makes it appear moreso. It doesn't even describe how the OP can actually implement their query or discuss the fact that doing so and esnuring the spatial index is used is not as straightforward in MS SQL as in other DBs. Nor does it discuss any of the underlying concepts. It's a bad answer, irrespective of whether it's "valid." That's why it bothers me. – jpmc26 Nov 06 '18 at 10:16
  • So something odd. I finally tracked down [documentation](https://docs.microsoft.com/en-us/sql/relational-databases/spatial/spatial-indexes-overview) about `STDistance` supporting spatial indexes (though not as far back as 2008), but it doesn't say the *geography* type is supported. The [*geometry* version](https://docs.microsoft.com/en-us/sql/t-sql/spatial-geometry/stdistance-geometry-data-type) links to that doc in "See Also," but curiously, the [*geography* version](https://docs.microsoft.com/en-us/sql/t-sql/spatial-geography/stdistance-geography-data-type) does not. – jpmc26 Nov 06 '18 at 10:54
3

I would agree that if possible using specific support in a database would be the most sensible way to do this.

However if I had to do this on a database without specific support I would start by querying for a square that encloses the circule e.g. (y > (y1 - rad)) AND (y < (y1 + rad)) AND (x > (x1 - rad)) AND (x < (x1 + rad)). Assuming your points have roughly even distribution querying for a square will get you your true matches plus about 30% extra false matches. You can then cull out the false matches.

Peter Green
  • 2,125
  • 9
  • 15
  • But without an appropriate spatial index, such a query will scan at worst the entire database, at best all items within the given latitude OR longitude range depending on your index, i.e. a "band" rather than a square. If you don't want to kill performance, use a database that supports spatial indexes! – jcaron Nov 05 '18 at 16:21
  • @jcaron I believe this query could be optimized with an ordinary B-tree index on `x` and `y`. (Perhaps combined, perhaps separate. I'd profile a little to figure out which works better in practice.) – jpmc26 Nov 05 '18 at 17:55
  • @jpmc26 Nope, it can’t. Think it through, you’ll see. – jcaron Nov 05 '18 at 20:52
  • @jcaron Perhaps it would be better if you weren't cryptic about something that is clearly not straightforward. B-trees can be used for `BETWEEN` queries. I don't see why worst case you couldn't have 2 indexes and then the filtered results from each index get joined together. (That is something RDBMSes do internally when they deem it worth using multiple indexes.) If a combined index works, it should filter out one dimension entirely at the first level and then relatively quickly narrow down in the second level. – jpmc26 Nov 05 '18 at 21:04
  • 2
    @jcaron actually you can use index for something like `y between -68 and -69 and x between 10 and 11` but of course spatial index do a better job for that task – Juan Carlos Oropeza Nov 05 '18 at 21:04
  • A single B-Tree index on the two columns can only be used for a single column in this case: it can only find values between (-69,10) and (-68,11), but that includes values such as (-68.5, -40) or (-68.2,80), which will need to be filtered out. Some RDBMS will be able to combine two independent indices, but not all can, and for those which can, it can have terrible performance. If you want an index that can actually be used directly to only read the relevant subset from the table, you need a spatial index. – jcaron Nov 05 '18 at 23:01
  • FWIW, this is exactly what I do in a web app I built a couple years ago, and it works very well. For accuracy, you do want to convert the radius to the proper lat/lng range for where you are on the globe first. I believe I did it all in a stored proc to keep the sql simpler. – GrandmasterB Nov 06 '18 at 19:01