I have a MySQL SELECT query that calculates a distance using Pythagoras in the WHERE clause to restrict results to a certain radius.
I also use the exact same calculation in the ORDER BY clause to sort the results by smallest distance first.
Does MySQL calculate the distance twice (once for the WHERE, and again for the ORDER BY)?
If it does, how can I optimize the query so it is only calculated once (if possible at all)?
Does MySQL calculate the distance twice (once for the WHERE, and again for the ORDER BY)?
No, the calculation will not be performed twice if it is written in exactly the same way. However if your aim is to improve the performance of your application then you might want to look at the bigger picture rather than concentrating on this minor detail which could give you at most a factor of two difference. A more serious problem is that your query prevents efficient usage of indexes and will result in a full scan.
I would recommend that you change your database so that you use the geometry type and create a spatial index on your data. Then you can use MBRWithin to quickly find the points that lie inside the bounding box of your circle. Once you have found those points you can run your more expensive distance test on those points only. This approach will be significantly faster if your table is large and a typical search returns only a small fraction of the rows.
If you can't change the data model then you can still improve the performance by using a bounding box check first, for example WHERE x BETWEEN 10 AND 20 AND y BETWEEN 50 AND 60. The bounding box check will be able to use an index, but because R-Tree indexes are only supported on the geometry type you will have to use the standard B-Tree index which is not as efficient for this type of query (but still much better than what you are currently doing).
You could possibly select for it, put it in the HAVING clause and use it in the ORDER BY clause, then the calculation is certainly only done once, but I guess that would be slower, because it has to work with more data. The calculation itself is not that expensive.
Related
I have a postgres table ("dist_mx") that indicates the distances between two points (geographic space). The points are defined in the "hex_0" and "hex_1" columns. The table will eventually be 10^7 to 10^8 rows. The table is structured as such:
One of the purposes of this table is to query the shortest distance from a list of points (1000s) to the points that correspond to locations of interest. For example, I want to know the shortest distance from each point to a grocery stores (we know how each grocery store corresponds to point ids).
I'm using a UNION statement to run the query. The OR statement is used because the order of the points is arbitrary (i.e., pairs aren't repeated in reverse order). See below:
SELECT MIN(distances) FROM dist_mx
WHERE ((point_id_0= '8829abb139fffff' AND point_id_1 IN ('8829abb555fffff', ...))
OR (point_id_1= '8829abb139fffff' AND point_id_0 IN ('8829abb555fffff', ...))
UNION
SELECT MIN(distances) FROM dist_mx
WHERE ((point_id_0= '8829abb469fffff' AND point_id_1 IN ('8829abb555fffff', ...))
OR (point_id_1= '8829abb469fffff' AND point_id_0 IN ('8829abb555fffff', ...))
...
The query seems to be working as intended but it is slow. It takes 20 minutes for the query to run on a list of ~4500 points. I have tried chunking the query so I only include 500 queries at a time (i.e., connected by the UNION statement), but this does not significantly change performance.
I'm relatively new to postgres so I am hoping that there is a fairly simple speedup (or a not fairly simple speedup)?
EDIT:
adding schema
Without seeing an explain analyze for your query, and also the whole query, I can't give specific advice. There's also probably a better way to write your query, but it's unclear what you're doing.
Here's some general advice.
The basic performance tool is indexes. Without indexes, Postgres must scan the whole table, probably repeatedly. See Use The Index, Luke for more.
A multi-column index on (point_id_0, point_id_1) will allow Postgres to quickly find the matching rows without having to scan the whole table.
create index dist_mx_points_idx on dist_mx(point_id_0, point_id_1)
That should help significantly.
One of the purposes of this table is to query the shortest distance from a list of points (1000s) to the points that correspond to locations of interest. For example, I want to know the shortest distance from each point to a grocery stores (we know how each grocery store corresponds to point ids).
Use PostGIS.
Other notes.
Don't store hex as a string, store it as a bigint and convert. This will take less space and is faster.
Don't store numbers as text, use an integer.
Don't store your points as two columns, use a single point column. Then you can use geometric operators. However, these are 2D calculations and only accurate for GIS over short distances.
Since you're doing GIS, don't do this by hand. Use PostGIS.
I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!
If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.
This the minimal query statement I want to execute.
select count(*) from temper_300_1 group by onegid;
I do have "where" clauses to go along as well though. What I am trying to do is build a histogram query and determine the number of elements with a particular "onegid". the query takes about 7 seconds on 800 million rows. Could someone suggest a faster alternative or optimization.
I was to actually trying plot a heatmap from a spatial data consisting of latitudes and longitudes, I have assigned a grid id to each elements, but the "group by aggregation" is coming out to be pretty costly in terms of time.
You're not going to get much faster than group by, though your current query won't display which group item is associated with each count.
Make sure that the table is properly distributed with
select datasliceid, count(1) from temper_300_1 group by onegid;
The counts should be roughly equal. If they're not, your DBA needs to redistribute the table on a better distribution key.
If it is, you could ask your DBA to create a materialized view on that specific column, ordered by that column. You may see some performance gains.
I would say that there are two primary considerations for performance related to your query: distribution and row size/extent density.
Distribution:
As #jeremytwfortune mentions, it is important that your data be well distributed with little skew. In an MPP system such as Netezza, you are only as fast as your slowest data slice, and if one data slice has 10x the data as the rest it will likely drag your performance down.
The other distribution consideration is that if your table is not already distributed on onegid, it will be dynamically redistributed on onegid when the query runs in support of your GROUP BY onegid clause. This will happen for GROUP BYs and windowed aggregates with PARTITION BYs. If the distribution of onegid values is not relatively even you may be faced with processing skew.
If your table is already distributed on onegid and you don't supply any other WHERE predicates then you are probably already optimally configured from that standpoint.
Row Size / Extent Density
As Netezza reads data to support your query each data slice will read the its disk in 3 MB extents. If your row is substantially wider than just the onegid value, you will be reading more data from the disk than you need in order to answer your query. If your table is large, your rows are wider than just onegid, and query time performance is paramount, then you might consider creating a materialized view, like so:
CREATE MATERIALIZED VIEW temper_300_1_mv AS select onegid from temper_300_1 ORDER BY onegid;
When you execute your query against temper_300_1 with only onegid in the SELECT clause, the optimizer will refer to the materialized view only, which will be able to pack more rows into a given 3MB extent. This can be a significant performance boost.
The ORDER BY clause in the MVIEW creation statement will also likely increase the effectiveness of compression of the MVIEW, further reducing the number of extents required to hold a given number of rows, and further improving performance.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Does the order of columns in a WHERE clause matter?
These are the basics SQL Function and Keywords.
Is there any tips or trick to speed up your SQL ?
For example; I have a query with a lot of keywords. (AND, GROUP BY, ORDER BY, IN, BETWEEN, LIKE... etc.)
Which Keyword should be on top in my query?
How can i decide it?
Example;
Where NUMBER IN (156, 646)
AND DATE BETWEEN '01/01/2011' AND '01/02/2011'
OR
Where DATE BETWEEN '01/01/2011' AND '01/02/2011'
AND NUMBER IN (156, 646)
Which one is faster? Depends of what?
Don't use functions in the where clause. Because the query engine must execute the function for every single row.
There are no "tricks".
Given the competition between the database vendors about which one is "faster", any "trick" that is always true would be implemented in the database itself. (The tricks are implemented in the part of the database called "optimizer").
There are only things to be aware of, but they typically can't be reduced into:
Use feature X
Avoid feature Y
Model like this
Never model like that
Look at all the raging questions/discussions about indexes, index types, index strategies, clustering, single column keys, compound keys, referential integrity, access paths, joins, join mechanisms, storage engines, optimizer behaviour, datatypes, normalization, query transformations, denormalization, procedures, buffer cache, resultset cache, application cache, modeling, aggregation, functions, views, indexed views, set processing, procedural processing and the list goes on.
All of them was invented to attack a specific problem area. Variations on that problem make the "trick" more or less suitable. Very often the tricks have zero effect, and sometimes sometimes flat out horrible. Why? Because when we don't understand why something works, we are basically just throwing features at the problem until it goes away.
The key point here is that there is a reason why something makes a query go faster, and the understanding of what that something is, is crucial to the process of understanding why a different unrelated query is slow, and how to deal with it. And it is never a trick, nor magic.
We (humans) are lazy, and we want to be thrown that fish when what we really need is to learn how to catch it.
Now, what specific fish do YOU want to catch?
Edited for comments:
The placement of your predicates in the where clause makes no difference since the order in which they are processed is determined by the database. Some of the things which will affect that order (for your example) are :
Whether or not the query can be rewritten against an indexed view
What indexes are available that covers one or both of columns NUMBER and DATE and in what order they exist in that index
The estimated selectivity of your predicates, which basically mean the estimated percentage of rows matched by your predicate. The lower % the more likely the optimizer is to use your index efficiently.
The clustering factor (or whatever the name is in SQL Server) if SQL Server factors that into the query cost. This has to do with how the order of the index entries aligns with the physical order of the table rows. Better alignment = reduces cost for higher % of rows fetched via that index.
Now, if the only values you have in column NUMBER are 156, 646 and they are pretty much evenly spread, an index would be useless. A full scan would be a better alternative.
On the other hand, if those are unique order numbers (backed by a unique index), the optimizer will pick that index and drive the query from there. Similarily, if the rows having a DATE between the first and second of January 2011 make up a small enough % of the rows, an index leading with DATE will be considered.
Or if you include order by NUMBER, DATE another parameter comes into the equation; the cost of sorting. An index on (NUMBER, DATE) will now seem more attractive to the optimizer, because even though it might not be the most efficient way of aquiring the rows, the sorting (which is expensive) can be skipped.
Or, if your query included a join to another table (say customer) on customer_id and you also had a filter on customer.ssn, again the equation changes, because (since you did a good job with foreign keys and a backing index) you will now have a very efficient access path into your first table, without using the indexes in NUMBER or DATE. Unless you only have one customer and all of the 10 million orders where his...
Read about sargable queries (ones which can use the index vice ones which can't). Avoid correlated subqueries, functions in where clauses, cursors and while loops. Don't use select * especially if you have joins, never return more than the data you need.
Actually there are whole books written on performance tuning, get one and read it for the datbase you are using as the techniques vary from database to database.
Learn to use indexes properly.
http://Use-The-Index-Luke.com/
What is the Big-O for SQL select, for a table with n rows and for which I want to return m result?
And What is the Big-O for an Update, or delete, or Create operation?
I am talking about mysql and sqlite in general.
As you don't control the algorithm selected, there is no way to know directly. However, without indexes a SELECT should be O(n) (a table scan has to inspect every record which means it will scale with the size of the table).
With an index a SELECT is probably O(log(n)) (although it would depend on the algorithm used for indexing and the properties of the data itself if that holds true for any real table). To determine your results for any table or query you have to resort to profiling real world data to be sure.
INSERT without indexes should be very quick (close to O(1)) while UPDATE needs to find the records first and so will be slower (slightly) than the SELECT that gets you there.
INSERT with indexes will probably again be in the ballpark of O(log(n)^2) when the index tree needs to be rebalanced, closer to O(log(n)) otherwise. The same slowdown will occur with an UPDATE if it affects indexed rows, on top of the SELECT costs.
Edit: O(log(n^2)) = O(2log(n)) = O(log(n)) did you mean O(log(n)^2)?
All bets are off once you are talking about JOIN in the mix: you will have to profile and use your databases query estimation tools to get a read on it. Also note that if this query is performance critical you should reprofile from time to time as the algorithms used by your query optimizer will change as the data load changes.
Another thing to keep in mind... big-O doesn't tell you about fixed costs for each transaction. For smaller tables these are probably higher than the actual work costs. As an example: the setup, tear down and communication costs of a cross network query for a single row will surely be more than the lookup of an indexed record in a small table.
Because of this I found that being able to bundle a group of related queries in one batch can have vastly more impact on performance than any optimization I did to the database proper.
I think the real answer can only be determined on a case by case basis (database engine, table design, indices, etc.).
However, if you are a MS SQL Server user, you can familiarize yourself with the Estimated Execution Plan in Query Analyzer (2000) or Management Studio (2005+). That gives you a lot of information you can use for analysis.
All depends on how (well) you write your SQL and how well your database is designed for the operation you are performing. Try to use the explain plan function to see how things will be executed by the db. The. You can calculate the big-O