100x constraints in WHERE clause makes query extremely slow - sql

I'm using Firebird and created a table, called EVENTS. The columns are:
id (INT) | name (VARCHAR) | category (INT) | website (VARCHAR) | lat (DOUBLE) | lon (DOUBLE)
A user wants to search for events in a certain radius around them, but entered only two or three letters of their home city. So we've got - lets say - 200 possible cities with their latitudes and longitudes. So, my SQL query looks like:
SELECT id FROM events WHERE ((lat BETWEEN 30.09 AND 30.12) AND (lon BETWEEN 40.78 AND 40.81)) OR ((lat BETWEEN 30.09 AND 30.12) AND (lon BETWEEN 40.78 AND 40.81)) OR ...
So, we get 200 constraints in the WHERE clause and it takes seconds to actually get the result.
I know the query might look horrible, but are the many constraints really the bottleneck? Can this query be optimized?

My guess would be that the database engine decides that the criterion will likely return a lot of rows, so it wrongly full scans the table. Hint it to do the right thing, or perform some kind of rewrite of the query e.g. (which might or might not help)
SELECT id
FROM cities c
JOIN events e ON (e.lat BETWEEN c.lat - .01 AND c.lat + .01) AND (e.lon BETWEEN c.lon - .01 AND c.lon + .01)
WHERE c.name LIKE 'x%'
In SQL server you could write
SELECT id
FROM cities c
INNER LOOP JOIN events e ON (e.lat BETWEEN c.lat - .01 AND c.lat + .01) AND (e.lon BETWEEN c.lon - .01 AND c.lon + .01)
WHERE c.name LIKE 'x%'
to ensure the correct plan (you do have an index on the lat and lon columns together?)

Tradeoff space for speed:
Cities don't move. Whenever you add an event, you can pre-calculate the distance between each event and each city, and store the distance to all nearby cities. You can index this by city, so you can directly find events somewhat near a given city (or near 200 cities with the same prefix). Actual longitude/latitude filtering can then be restricted to a much smaller set of events.

You could redesign database (if it is possible), to contain not only latitude and longitude, but also name of place of event. Your query would contain like statement, or similar (begins with?). I know, that this might be unusable solution, but constraining yourself to square (in spherical sense) cities or regions seems a bit odd to me ;)

Create a range search friendly index (a B-tree index) on events.lat and/or events.long (but not a single index on both!) That will at least get you in the ballpark.
What you really want is an R-Tree or similar, which allows indexing multi-dimensional data and gives you good range search performance. PostgreSQL has GiST for that; I don't know what kind of support Firebird has for this sort of problem.
Wiki links for more info:
http://en.wikipedia.org/wiki/R-tree
http://en.wikipedia.org/wiki/GiST

You should first use IBExpert over your query to check it's plan to see why is it so slow.

Try with a correlated subquery :
select *
from events e
where exists
( select *
from cities c
where c.name like 'X%' and
e.lat BETWEEN c.lat - .01 AND c.lat + .01 and
e.lon BETWEEN c.lon - .01 AND c.lon + .01
)
Im some scenarios it works faster than joins.

Related

geolocating self join too slow

I am trying to get the count of all records within 50 miles of each record in a huge table (1m + records), using self join as shown below:
proc sql;
create table lab as
select distinct a.id, sum(case when b.value="New York" then 1 else 0 end)
from latlon a, latlon b
where a.id <> b.id
and geodist(a.lat,a.lon,b.lat,b.lon,"M") <= 50
and a.state = b.state;
This ran for 6 hours and was still running when i last checked.
Is there a way to do this more efficiently?
UPDATE: My intention is to get the number of new yorkers in a 50 mile radius from every record identified in table latlon which has name, location and latitude/longitude where lat/lon could be anywhere in the world but location will be a person's hometown. I have to do this for close to a dozen towns. Looks like this is the best it could get. I may have to write a C code for this one i guess.
The geodist() function you're using has no chance of exploiting any index. So, you have an algorithn that's O(n**2) at best. That's gonna be slow.
You can take advantage of a simple fact of spherical geometry, though, to get access to an indexable query. A degree of latitude (north - south) is equivalent to sixty nautical miles, 69 statute miles, or 111.111 km. The British definition of nautical mile was originally equal to a minute. The original Napoleonic meter was defined as one part in ten thousand of the distance from the equator to the pole, also defined as 90 degrees.
(These defintions depend on the assumption that the earth is spherical. It isn't, quite. If you're a civil engineer these definitions break down. If you use them to design a parking lot, it will have some nasty puddles in it when it rains, and will encrooach on the neighbors' property.)
So, what you want is to use a bounding range. Assuming your latitude values a.lat and b.lat are in degrees, two of them are certainly more than fifty statute miles apart unless
a.lat BETWEEN b.lat - 50.0/69.0 AND b.lat + 50.0/69.0
Let's refactor your query. (I don't understand the case stuff about New York so I'm ignoring it. You can add it back.) This will give the IDs of all pairs of places lying within 50 miles of each other. (I'm using the 21st century JOIN syntax here).
select distinct a.id, b.id
from latlon a
JOIN latlon b ON a.id<>b.id
AND a.lat BETWEEN b.lat - 50.0/69.0 AND b.lat + 50.0/69.0
AND a.state = b.state
AND geodist(a.lat,a.lon,b.lat,b.lon,"M") <= 50
Try creating an index on the table on the lat column. That should help performance a LOT.
Then try creating a compound index on (state, lat, id, lon, value). Try those columns in the compound index in different orders, if you don't get satisfactory performance acceleration. It's called a covering index, because the some of its columns (the first two in this case) are used for quick lookups and the rest are used to provide values that would otherwise have to be fetched from the main table.
Your question is phrased ambiguously - I'm interpreting it as "give me all (A, B) city pairs within 50 miles of each other." The NYC special case seems to be for a one-off test - the problem is not to (trivially, in O(n) time) find all cities within 50 miles of NYC.
Rather than computing Great Circle distances, find Manhattan distances instead, using simple addition, and simple bounding boxes. Given (A, B) city tuples with Manhattan distance less than 50 miles, it is straightforward to prune out the few (on diagonals) that have Great Circle (or Euclidean) distance less than 50 miles.
You didn't show us EXPLAIN output describing the backend optimizer's plan.
You didn't tell us about indexes on the latlon table.
I'm not familiar with the SAS RDBMS. Oracle, MySQL, and others have geospatial extensions to support multi-dimensional indexing. Essentially, they merge high-order coordinate bits, down to low-order coordinate bits, to construct a quadtree index. The technique could prove beneficial to your query.
Your DISTINCT keyword will make a big difference for the query plan. Often it will force a tablescan and a filesort. Consider deleting it.
The equijoin on state seems wrong, but maybe you don't care about the tri-state metropolitan area and similar densely populated regions near state borders.
You definitely want the WHERE clause to prune out b rows that are more than 50 miles from the current a row:
too far north, OR
too far south, OR
too far west, OR
too far east
Each of those conditionals boils down to a simple range query that the RDBMS backend can evaluate and optimize against an index. Unfortunately, if it chooses the latitude index, any longitude index that's on disk will be ignored, and vice versa. Which motivates using your vendor's geospatial support.

Data retrieving by Latitude longitude matching from both tables in mysql

I have two tables are A & B.
A table having columns are hotelcode_id, latitude,longitude
B table having columns are latitude, longitude
Requirement is, I need retrieving hotelcode_id according to match latitude from both tables and longitude from both tables
I have designed the following query, but still in query performance
SELECT a.hotelcode_id, a.latitude,b.latitude,b.longitude,b.longitude
FROM A
JOIN B
ON a.latitude like concat ('%', b.latitude, '%') AND a.longitude like concat ('%', b.longitude, '%')
Also I'm designed the following another query but I can't able to accuret data's.
This query running too much time but still now I can't able to retrieve the data's.
NOTE:
A table has 150k records
B table has 250k records
: I have set DECIMAL(10,6) for latitude and longitude columns in both tables.
I have done the following job but still in problems in query performance,
done index properly using EXPLAIN statements
done hash partition for this tables
I think wild card characters not allowed the index reference.
Also LIKE SELECT query performance very poor in MySQL.
Any other solution is there instead wild cards issues & LIKE issues in SELECT query?
If you are sure that the numeric values of LAT/LON pairs are equal across the two table, the simple approach would be
SELECT a.hotelcode_id, a.latitude,b.latitude,b.longitude,b.longitude
FROM A JOIN B
WHERE a.latitude = b.latitude
AND a.longitude = b.longitude
If there is some inaccuracy in the data, you may want to define the maximum deviation (here 3.6 angle seconds) which you would regard as "same place", e.g.
SELECT a.hotelcode_id, a.latitude,b.latitude,b.longitude,b.longitude
FROM A JOIN B
WHERE ABS(a.latitude-b.latitude) < 0.001
AND ABS(a.longitude-b.longitude) < 0.001
Mind that in the second case the actual distance (in km) between two points are not the same at any given LAT ... higher LAT --> less distance
And review the sizing of LON and LAT columns ... you know that (usually ...)
-180 <= LON <= 180
-90 <= LAT <= 90

SQL Cross Apply Performance Issues

My database has a directory of about 2,000 locations scattered throughout the United States with zipcode information (which I have tied to lon/lat coordinates).
I also have a table function which takes two parameters (ZipCode & Miles) to return a list of neighboring zip codes (excluding the same zip code searched)
For each location I am trying to get the neighboring location ids. So if location #4 has three nearby locations, the output should look like:
4 5
4 24
4 137
That is, locations 5, 24, and 137 are within X miles of location 4.
I originally tried to use a cross apply with my function as follows:
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist(A.Sl_Zip,7))) AS Q
WHERE A.SL_StoreNum='04'
However that ran for over 20 minutes with no results so I canceled it. I did try hardcoding in the zipcode and it immediately returned a list
SELECT A.SL_STORENUM,A.Sl_Zip,Q.SL_STORENUM FROM tbl_store_locations AS A
CROSS APPLY (SELECT SL_StoreNum FROM tbl_store_locations WHERE SL_Zip in (select zipnum from udf_GetLongLatDist('12345',7))) AS Q
WHERE A.SL_StoreNum='04'
What is the most efficient way of accomplishing this listing of nearby locations? Keeping in mind while I used "04" as an example here, I want to run the analysis for 2,000 locations.
The "udf_GetLongLatDist" is a function which uses some math to calculate distance between two geographic coordinates and returns a list of zipcodes with a distance of > 0. Nothing fancy within it.
When you use the function you probably have to calculate every single possible distance for each row. That is why it takes so long. SInce teh actual physical locations don;t generally move, what we always did was precalculate the distance from each zipcode to every other zip code (and update only once a month or so when we added new possible zipcodes). Once the distances are precalculated, all you have to do is run a query like
select zip2 from zipprecalc where zip1 = '12345' and distance <=10
We have something similar and optimized it by only calculating the distance of other zipcodes whose latitude is within a bounded range. So if you want other zips within #miles, you use a
where latitude >= #targetLat - (#miles/69.2) and latitude <= #targetLat + (#miles/69.2)
Then you are only calculating the great circle distance of a much smaller subset of other zip code rows. We found this fast enough in our use to not require precalculating.
The same thing can't be done for longitude because of the variation between equator and pole of what distance a degree of longitude represents.
Other answers here involve re-working the algorithm. I personally advise the pre-calculated map of all zipcodes against each other. It should be possible to embed such optimisations in your existing udf, to minimise code-changes.
A refactoring of the query, however, could be as follows...
SELECT
A.SL_STORENUM, A.Sl_Zip, C.SL_STORENUM
FROM
tbl_store_locations AS A
CROSS APPLY
dbo.udf_GetLongLatDist(A.Sl_Zip,7) AS B
INNER JOIN
tbl_store_locations AS C
ON C.SL_Zip = B.zipnum
WHERE
A.SL_StoreNum='04'
Also, the performance of the CROSS APPLY will benefit greatly if you can ensure that the udf is INLINE rather than MULTI-STATEMENT. This allows the udf to be expanded inline (macro like) for a much cleaner execution plan.
Doing so would also allow you to return additional fields from the udf. The optimiser can then include or exclude those fields from the plan depending on whether you actually use them. Such an example would be to include the SL_StoreNum if it's easily accessible from the query in the udf, and so remove the need for the last join...

Distance between two coordinates, how can I simplify this and/or use a different technique?

I need to write a query which allows me to find all locations within a range (Miles) from a provided location.
The table is like this:
id | name | lat | lng
So I have been doing research and found: this my sql presentation
I have tested it on a table with around 100 rows and will have plenty more! - Must be scalable.
I tried something more simple like this first:
//just some test data this would be required by user input
set #orig_lat=55.857807; set #orig_lng=-4.242511; set #dist=10;
SELECT *, 3956 * 2 * ASIN(
SQRT( POWER(SIN((orig.lat - abs(dest.lat)) * pi()/180 / 2), 2)
+ COS(orig.lat * pi()/180 ) * COS(abs(dest.lat) * pi()/180)
* POWER(SIN((orig.lng - dest.lng) * pi()/180 / 2), 2) ))
AS distance
FROM locations dest, locations orig
WHERE orig.id = '1'
HAVING distance < 1
ORDER BY distance;
This returned rows in around 50ms which is pretty good!
However this would slow down dramatically as the rows increase.
EXPLAIN shows it's only using the PRIMARY key which is obvious.
Then after reading the article linked above. I tried something like this:
// defining variables - this when made into a stored procedure will call
// the values with a SELECT query.
set #mylon = -4.242511;
set #mylat = 55.857807;
set #dist = 0.5;
-- calculate lon and lat for the rectangle:
set #lon1 = #mylon-#dist/abs(cos(radians(#mylat))*69);
set #lon2 = #mylon+#dist/abs(cos(radians(#mylat))*69);
set #lat1 = #mylat-(#dist/69);
set #lat2 = #mylat+(#dist/69);
-- run the query:
SELECT *, 3956 * 2 * ASIN(
SQRT( POWER(SIN((#mylat - abs(dest.lat)) * pi()/180 / 2) ,2)
+ COS(#mylat * pi()/180 ) * COS(abs(dest.lat) * pi()/180)
* POWER(SIN((#mylon - dest.lng) * pi()/180 / 2), 2) ))
AS distance
FROM locations dest
WHERE dest.lng BETWEEN #lon1 AND #lon2
AND dest.lat BETWEEN #lat1 AND #lat2
HAVING distance < #dist
ORDER BY distance;
The time of this query is around 240ms, this is not too bad, but is slower than the last. But I can imagine at much higher number of rows this would work out faster. However anEXPLAIN shows the possible keys as lat,lng or PRIMARY and used PRIMARY.
How can I do this better???
I know I could store the lat lng as a POINT(); but I also haven't found too much documentation on this which shows if it's faster or accurate?
Any other ideas would be happily accepted!
Thanks very much!
-Stefan
UPDATE:
As Jonathan Leffler pointed out I had made a few mistakes which I hadn't noticed:
I had only put abs() on one of the lat values. I was using an id search in the WHERE clause in the second one as well, when there was no need. In the first query was purely experimental the second one is more likely to hit production.
After these changes EXPLAIN shows the key is now using lng column and average time to respond around 180ms now which is an improvement.
Any other ideas would be happily accepted!
If you want speed (and simplicity) you'll want some decent geospatial support from your database. This introduces geospatial datatypes, geospatial indexes and (a lot of) functions for processing / building / analyzing geospatial data.
MySQL implements a part of the OpenGIS specifications although it is / was (last time I checked it was) very very rough around the edges / premature (not useful for any real work).
PostGis on PostgreSql would make this trivially easy and readable:
(this finds all points from tableb which are closer then 1000 meters from point a in tablea with id 123)
select
myvalue
from
tablea, tableb
where
st_dwithin(tablea.the_geom, tableb.the_geom, 1000)
and
tablea.id = 123
The first query ignores the parameters you set - using 1 instead of #dist for the distance, and using the table alias orig instead of the parameters #orig_lat and #orig_lon.
You then have the query doing a Cartesian product between the table and itself, which is seldom a good idea if you can avoid it. You get away with it because of the filter condition orig.id = 1, which means that there's only one row from orig joined with each of the rows in dest (including the point with dest.id = 1; you should probably have a condition AND orig.id != dest.id). You also have a HAVING clause but no GROUP BY clause, which is indicative of problems. The HAVING clause is not relating any aggregates, but a HAVING clause is (primarily) for comparing aggregate values.
Unless my memory is failing me, COS(ABS(x)) === COS(x), so you might be able to simplify things by dropping the ABS(). Failing that, it is not clear why one latitude needs the ABS and the other does not - symmetry is crucial in matters of spherical trigonometry.
You have a dose of the magic numbers - the value 69 is presumably number of miles in a degree (of longitude, at the equator), and 3956 is the radius of the earth.
I'm suspicious of the box calculated if the given position is close to a pole. In the extreme case, you might need to allow any longitude at all.
The condition dest.id = 1 in the second query is odd; I believe it should be omitted, but its presence should speed things up, because only one row matches that condition. So the extra time taken is puzzling. But using the primary key index is appropriate as written.
You should move the condition in the HAVING clause into the WHERE clause.
But I'm not sure this is really helping...
The NGS Online Inverse Geodesic Calculator is the traditional reference means to calculate the distance between any two locations on the earth ellipsoid:
http://www.ngs.noaa.gov/cgi-bin/Inv_Fwd/inverse2.prl
But above calculator is still problematic. Especially between two near-antipodal locations, the computed distance can show an error of some tens of kilometres !!! The origin of the numeric trouble was identified long time ago by Thaddeus Vincenty (page 92):
http://www.ngs.noaa.gov/PUBS_LIB/inverse.pdf
In any case, it is preferrable to use the reliable and very accurate online calculator by Charles Karney:
http://geographiclib.sourceforge.net/cgi-bin/Geod
Some thoughts on improving performance. It wouldn't simplify things from a maintainability standpoint (makes things more complex), but it could help with scalability.
Since you know the radius, you can add conditions for the bounding box, which may allow the db to optimize the query to eliminate some rows without having to do the trig calcs.
You could pre-calculate some of the trig values of the lat/lon of stored locations and store them in the table. This would shift some of the performance cost when inserting the record, but if queries outnumber inserts, this would be good. See this answer for an idea of this approach:
Query to get records based on Radius in SQLite?
You could look at something like geohashing.
When used in a database, the structure of geohashed data has two advantages. ,,, Second, this index structure can be used for a quick-and-dirty proximity search - the closest points are often among the closest geohashes.
You could search SO for some ideas on how to implement:
https://stackoverflow.com/search?q=geohash
If you're only interested in rather small distances, you can approximate the geographical grid by a rectangular grid.
SELECT *, SQRT(POWER(RADIANS(#mylat - dest.lat), 2) +
POWER(RADIANS(#mylon - dst.lng)*COS(RADIANS(#mylat)), 2)
)*#radiusOfEarth AS approximateDistance
…
You could make this even more efficient by storing radians instead of (or in addition to) degrees in your database. If your queries may cross the 180° meridian, some extra care would be neccessary there, but many applications don't have to deal with those locations. You could also try to change POWER(x) to x*x, which might get computed faster.

Optimizing Sqlite query for INDEX

I have a table of 320000 rows which contains lat/lon coordinate points. When a user selects a location my program gets the coordinates from the selected location and executes a query which brings all the points from the table that are near. This is done by calculating the distance between the selected point and each coordinate point from my table row. This is the query I use:
select street from locations
where ( ( (lat - (-34.594804)) *(lat - (-34.594804)) ) + ((lon - (-58.377676 ))*(lon - (-58.377676 ))) <= ((0.00124)*(0.00124)))
group by street;
As you can see the WHERE clause is a simple Pythagoras formula to calculate the distance between two points.
Now my problem is that I can not get an INDEX to be usable. I've tried with
CREATE INDEX indx ON location(lat,lon)
also with
CREATE INDEX indx ON location(street,lat,lon)
with no luck. I've notice that when there is math operation with lat or lon, the index is not being called . Is there any way I can optimize this query for using an INDEX so as to gain speed results?
Thanks in advance!
The problem is that the sql engine needs to evaluate all the records to do the comparison (WHERE ..... <= ...) and filter the points so the indexes don’t speed up the query.
One approach to solve the problem is compute a Minimum and Maximum latitude and longitude to restrict the number of record.
Here is a good link to follow: Finding Points Within a Distance of a Latitude/Longitude
Did you try adjusting the page size? A table like this might gain from having a different (i.e. the largest?) available page size.
PRAGMA page_size = 32768;
Or any power of 2 between 512 and 32768. If you change the page_size, don't forget to vacuum the database (assuming you are using SQLite 3.5.8. Otherwise, you can't change it and will need to start a fresh new database).
Also, running the operation on floats might not be as fast as running it on integers (big maybe), so that you might gain speed if you record all your coordinates times 1 000 000.
Finally, euclydian distance will not yield very accurate proximity results. The further you get from the equator, the more the circle around your point will flatten to ressemble an ellipse. There are fast approximations which are not as calculation intense as a Great Circle Distance Calculation (avoid at all cost!)
You should search in a square instead of a circle. Then you will be able to optimize.
Surely you have a primary key in locations? Probably called id?
Why not just select the id along with the street?
select id, street from locations
where ( ( (lat - (-34.594804)) *(lat - (-34.594804)) ) + ((lon - (-58.377676 ))*(lon - (-58.377676 ))) <= ((0.00124)*(0.00124)))
group by street;