Most efficient way to query for lat-long rectangle in SQL - sql

I'm currently making consistent queries for a block of land within a given latitude, longitude rectangle. The coordinates are stored as individual double precision values. I've created a single index of both columns, so the current query containing 15240 tiles takes .10 seconds on my local machine.
At the moment, there's 23 million rows in the table, but there's going to be around 800 million upon completion of the table, so I expect this query time to get much slower.
Here's the query I'm running, with example values:
SELECT * FROM territories
WHERE nwlat < 47.606977 and nwlat > 47.506977
and nwlng < -122.232991 and nwlng > -122.338991;
Is there a more efficient way of doing this? I'm fairly new to large databases, so any help is appreciated. FYI, I'm using PostgreSQL.

It would be much more efficient with a GiST or an SP-GiST index and a "box-contains-points" query ...
GiST index
The index is on a box with zero area, built from the same point (point(nwlat, nwlng)) twice.
There is a related code example in the manual for CREATE INDEX.
CREATE INDEX territories_box_gist_idx ON territories
USING gist (box(point(nwlat, nwlng), point(nwlat, nwlng)));
Query with the "overlaps" operator &&:
SELECT *
FROM territories
WHERE box(point(nwlat, nwlng), point(nwlat, nwlng))
&& '(47.606977, -122.232991), (47.506977, -122.338991)'::box;
SP-GiST index
Smaller index on just points:
CREATE INDEX territories_box_spgist_idx ON territories
USING spgist (point(nwlat, nwlng));
Query with the contains operator #>:
SELECT *
FROM point
WHERE '(47.606977, -122.232991), (47.506977, -122.338991)'::box
#> point(nwlat, nwlng);
I get fastest results for the SP-GiST index in a simple test with 1M rows on Postgres 9.6.1.
For more sophisticated needs consider the PostGIS extension.

Related

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input
BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

PostgreSQL, find strings differ by n characters

Suppose I have a table like this
id data
1 0001
2 1000
3 2010
4 0120
5 0020
6 0002
sql fiddle demo
id is primary key, data is fixed length string where characters could be 0, 1, 2.
Is there a way to build an index so I could quickly find strings which are differ by n characters from given string? like for string 0001 and n = 1 I want to get row 6.
Thanks.
There is the levenshtein() function, provided by the additional module fuzzystrmatch. It does exactly what you are asking for:
SELECT *
FROM a
WHERE levenshtein(data, '1110') = 1;
SQL Fiddle.
But it is not very fast. Slow with big tables, because it can't use an index.
You might get somewhere with the similarity or distance operators provided by the additional module pg_trgm. Those can use a trigram index as detailed in the linked manual pages. I did not get anywhere, the module is using a different definition of "similarity".
Generally the problem seems to fit in the KNN ("k nearest neighbours") search pattern.
If your case is as simple as the example in the question, you can use LIKE in combination with a trigram GIN index, which should be reasonably fast with big tables:
SELECT *
FROM a
WHERE data <> '1110'
AND (data LIKE '_110' OR
data LIKE '1_10' OR
data LIKE '11_0' OR
data LIKE '111_');
Obviously, this technique quickly becomes unfeasible with longer strings and more than 1 difference.
However, since the string is so short, any query will match a rather big percentage of the base table. Therefore, index support will hardly buy you anything. Most of the time it will be faster for Postgres to scan sequentially.
I tested with 10k and 100k rows with and without a trigram GIN index. Since ~ 19% match the criteria for the given test case, a sequential scan is faster and levenshtein() still wins. For more selective queries matching less than around 5 % of the rows (depends), a query using an index is (much) faster.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?

Optimizing Sqlite query for INDEX

I have a table of 320000 rows which contains lat/lon coordinate points. When a user selects a location my program gets the coordinates from the selected location and executes a query which brings all the points from the table that are near. This is done by calculating the distance between the selected point and each coordinate point from my table row. This is the query I use:
select street from locations
where ( ( (lat - (-34.594804)) *(lat - (-34.594804)) ) + ((lon - (-58.377676 ))*(lon - (-58.377676 ))) <= ((0.00124)*(0.00124)))
group by street;
As you can see the WHERE clause is a simple Pythagoras formula to calculate the distance between two points.
Now my problem is that I can not get an INDEX to be usable. I've tried with
CREATE INDEX indx ON location(lat,lon)
also with
CREATE INDEX indx ON location(street,lat,lon)
with no luck. I've notice that when there is math operation with lat or lon, the index is not being called . Is there any way I can optimize this query for using an INDEX so as to gain speed results?
Thanks in advance!
The problem is that the sql engine needs to evaluate all the records to do the comparison (WHERE ..... <= ...) and filter the points so the indexes don’t speed up the query.
One approach to solve the problem is compute a Minimum and Maximum latitude and longitude to restrict the number of record.
Here is a good link to follow: Finding Points Within a Distance of a Latitude/Longitude
Did you try adjusting the page size? A table like this might gain from having a different (i.e. the largest?) available page size.
PRAGMA page_size = 32768;
Or any power of 2 between 512 and 32768. If you change the page_size, don't forget to vacuum the database (assuming you are using SQLite 3.5.8. Otherwise, you can't change it and will need to start a fresh new database).
Also, running the operation on floats might not be as fast as running it on integers (big maybe), so that you might gain speed if you record all your coordinates times 1 000 000.
Finally, euclydian distance will not yield very accurate proximity results. The further you get from the equator, the more the circle around your point will flatten to ressemble an ellipse. There are fast approximations which are not as calculation intense as a Great Circle Distance Calculation (avoid at all cost!)
You should search in a square instead of a circle. Then you will be able to optimize.
Surely you have a primary key in locations? Probably called id?
Why not just select the id along with the street?
select id, street from locations
where ( ( (lat - (-34.594804)) *(lat - (-34.594804)) ) + ((lon - (-58.377676 ))*(lon - (-58.377676 ))) <= ((0.00124)*(0.00124)))
group by street;

Any idea why contains(...) querys so slow in SQL Server 2005

I've got a simple select query which executes in under 1 second normally, but when I add in a contains(column, 'text') into the where clause, suddenly it's running for 20 seconds up to a minute. The table it's selecting from has around 208k rows.
Any ideas what would cause this query to run so slow with just the addition of the contains clause?
Substring matching is a computationally expensive operation. Is the field indexed? If this is a major feature implementation, consider a search-caching table so you can simply lookup where the words exist.
Depending on the search keyword and the median length of characters in the column it is logical that it would take a long time.
Consider searching for 'cookie' in a column with median length 100 characters in a dataset of 200k rows.
Best case scenario with early outs, you would do 100 * 200k = 20m comparisons
Worst case scenario near missing on every compare, you would do (5 * 100) * 200k = 100m comparisons
Generally I would:
reorder your query to filter out as much as possible in advance prior to string matching
limit number of the results if you don't need all of them at once (TOP x)
reduce the number characters in your search term
reduce the number of search terms by filtering out terms that are likely to match a lot, or not at all (if applicable)
cache query results if possible (however cache invalidation can get pretty tricky if you want to do it right)
Try this:
SELECT *
FROM table
WHERE CONTAINS((column1, column2, column3), '"*keyword*"')
Instead of this:
SELECT *
FROM table
WHERE CONTAINS(column1, '"*keyword*"')
OR CONTAINS(column2, '"*keyword*"')
OR CONTAINS(column3y, '"*keyword*"')
The first one is a lot faster.
CONTAINS does a lot of extra work. There's a few things to note here:
NVarChar is always faster, so do CONTAINS(column, N'text')
If all you want to do is see if the word is in there, compare the performance to column LIKE '%' + text + '%'.
Compare query plans before and after, did it go to a table scan? If so, post more so we can figure out why.
In ultimo, you can break up the text's individual words into a separate table so they can be indexed.