The Timecomplexity of Built in SQL Functions such as sum, count, avg - sql

What is the time complexity of a function such as count, sum, avg or any other of the built in "math"-functions in mysql, sql server, oracle and others?
One would think that calling sum(myColumn) would be Linear.
But count(1) isn't. How come and what are the real time-complexity?
In a perfect world I would want sum, avg and count to be O(1). But we don't live in one of those, do we?

What is the time complexity of a function such as count, sum, avg or any other of the built in "math"-functions in mysql, sql server, oracle and others?
In MySQL with MyISAM, COUNT(*) without GROUP BY is O(1) (constant)
It is stored in the table metadata.
In all systems, MAX and MIN on indexed expressions without GROUP BY are O(log(n)) (logarithmic).
They are fetched with a single index seek.
Aggregate functions are O(n) (linear), when used without GROUP BY or GROUP BY uses HASH
Aggregate functions are O(n log(n)) when GROUP BY uses SORT.
All values should be fetched, calculated and stored in the state variables (which may be stored in a hash table).
In addition, when using SORT, they should also be sorted.

In SQL the math function complexity of aggregates is totally irelevant. The only thing that really matters is the data acceess complexity: what access path is chosen (table scan, index range scan, index seek etc) and how many pages are read. There may be slight differences in the internals of each aggregate, but they all work pretty much the same way (keep running state and compute running aggregation for each input value) and there is absolutely NO aggregate that looks at input twice, so they all O(n) as internal implementation, where 'n' is the number of recordes fed to the aggregate (not necesarily the number of record in the table!).
Some aggregates have internal shortcuts, eg. COUNT(*) may return the count from metadata on some systems, if possible.

Note: this is speculation based on my understanding of how SQL query planners work, and may not be entirely accurate.
I believe all the aggregate functions, or at least the "math" ones you name above, should be O(n). The query will be executed roughly as follows:
Fetch rows matching the join predicates and filter predicates (ie "WHERE clause")
Create row-groups according to the GROUP BY clause. A single row-group is created for queries with no GROUP BY
For each row group, apply the aggregate function to the rows in the group. For things like SUM, AVG, MIN, MAX as well as non-numeric functions like CONCAT there are simple O(n) algorithms, and I suspect those are used. Create one row in the output set for each row-group created in step #2
If a HAVING predicate is present, filter output rows using this predicate
Note, however, that even though the aggregate functions are O(n), the operation might not be. If you create a query which cartesian joins a table to itself, you will be looking at O(n*n) minimum just to create the initial row set (step #1). Sorting to create row-groups (step #2) may be O(n lg n), and may require disk storage for the sort operation (as opposed to in-memory operation only), so your query may still perform poorly if you are manipulating many rows.

For big data-warehouse style queries, the major databases can parallelize the task, so have multiple CPUs working on it. There will therefore be threshold points where it isn't quite linear as the cost of co-ordinating the parallel threads trades off against the benefit of using the multiple CPUs.

Related

Best performance in sampling repeated value from a grouped column

This question is about the functionality of first_value(), using another function or workaround.
It is also about "little gain in performance" in big tables. To use eg. max() in the explained context below, demands spurious comparisons. Even if fast, it imposes some additional cost.
This typical query
SELECT x, y, count(*) as n
FROM t
GROUP BY x, y;
needs to repeat all columns in GROUP BY to return more than one column. A syntactic sugar to do this, is to use positional references:
SELECT x, y, count(*) as n
FROM t
GROUP BY x, 2 -- imagine that 2, 3, etc. are repeated with x
Sometimes needs not only sugar, but also some semantic to understand complex context:
SELECT x, COALESCE(y,z), count(*) as n
FROM t
GROUP BY x, y, z -- y and z are not "real need" grouping clauses?
I can imagine many other complex contexts. Let's see usual solutions:
SELECT x, max(y) as y, count(*) as n
FROM t
GROUP BY x -- best semantic! no need for other columns here
where max() function can be any "sample()" (eg. first or last value). The performance of something that do nothing is better than max(), e.g. the aggregate function first_value(), but it needs a WINDOW, so lost performance. There are some old suggestions to implement first/last agg functions in C.
Is there some "get any one value fast" aggregate function with better performance than max() or GROUP BY X,2,...?
Perhaps some new feature in a recent release?
If you really don't care which member of the set is picked, and if you don't need to compute additional aggregates (like count), there is a fast and simple alternative with DISTINCT ON (x) without ORDER BY:
SELECT DISTINCT ON (x) x, y, z FROM t;
x, y and z are from the same row, but the row is an arbitrary pick from each set of rows with the same x.
If you need a count anyway, your options with regard to performance are limited since the whole table has to be read in either case. Still, you can combine it with window functions in the same SELECT:
SELECT DISTINCT ON (x) x, y, z, count(*) OVER (PARTITION BY x) AS x_count FROM t;
Consider the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
Depending on requirements, there may be faster ways to get counts:
Fast way to discover the row count of a table in PostgreSQL
In combination with GROUP BY the only realistic option I see to gain some performance is the first_last_agg extension. But don't expect much.
For other use cases without count (including the simple case at the top), there are faster solutions, depending on your exact use case. In particular to get "first" or "last" value of each set. Emulate a loose index scan. (Like #Mihai commented):
Optimize GROUP BY query to retrieve latest record per user
Not an offical source, but some thoughts an a question perceived as rather generic:
In general aggregators neeed to process all matching rows. From your question text you might target aggregators that try identifying specific values (max, min, first, last, n-th, etc). Those could benefit from datastructures that maintain the proper values for a specific such aggregator. Then "selecting" that value can be sped up drastically.
E.g. some databases keep track of max and min values of columns.
You can view this support as highly specialised internal indexs that are maintained by the system itself and not under (direct) control of a user.
Now postgresql focusses more on support that helps improving queries in general, not just special cases. So, they avoid adding effort for speeding up special cases that are not obviously benefitting a broad range of use cases.
Back to speeding up sample value aggregators.
With aggregators having to process all rows in general case and not hving a general strategy that allows short circuiting that requirement for aggregators that try identying specific values (sample kind aggregators for now), it is obvious that any reformulating of a query that does not lead to a reduced set of rows that need to be processed, will take similar time to complete.
For speeding up such queries beyond processing all rows you will need a supporting datastructure. With databases this usually is provided in the form of an index.
You also could benefit from special execution operations that allow reducing the number of rows to be read.
With pg you have the capability of providing own index implementation. So you could add an implementation that best supports a special kind of aggregator you are interested in. (At least for cases where you do need to run such queries often.)
Also, execution operations like index only scans or lazy evaluation with recursive queries may allow writing a specific query in a way that speeds compared to "straight" coding.
If you are targeting your question more into general approaches you might better consult with researchers on such topics as this then is beyond anything SO is intended to provide.
If you have specific (set of) queries that need to be improved, providing explicit questions on those might allow the community to help identifying potential optimizations. Trying to optimize without good base of measurement leads nowhere, as what yields perfect result in one case might kill performance in another.

Is there a faster alternative to "group by" aggregation in Netezza?

This the minimal query statement I want to execute.
select count(*) from temper_300_1 group by onegid;
I do have "where" clauses to go along as well though. What I am trying to do is build a histogram query and determine the number of elements with a particular "onegid". the query takes about 7 seconds on 800 million rows. Could someone suggest a faster alternative or optimization.
I was to actually trying plot a heatmap from a spatial data consisting of latitudes and longitudes, I have assigned a grid id to each elements, but the "group by aggregation" is coming out to be pretty costly in terms of time.
You're not going to get much faster than group by, though your current query won't display which group item is associated with each count.
Make sure that the table is properly distributed with
select datasliceid, count(1) from temper_300_1 group by onegid;
The counts should be roughly equal. If they're not, your DBA needs to redistribute the table on a better distribution key.
If it is, you could ask your DBA to create a materialized view on that specific column, ordered by that column. You may see some performance gains.
I would say that there are two primary considerations for performance related to your query: distribution and row size/extent density.
Distribution:
As #jeremytwfortune mentions, it is important that your data be well distributed with little skew. In an MPP system such as Netezza, you are only as fast as your slowest data slice, and if one data slice has 10x the data as the rest it will likely drag your performance down.
The other distribution consideration is that if your table is not already distributed on onegid, it will be dynamically redistributed on onegid when the query runs in support of your GROUP BY onegid clause. This will happen for GROUP BYs and windowed aggregates with PARTITION BYs. If the distribution of onegid values is not relatively even you may be faced with processing skew.
If your table is already distributed on onegid and you don't supply any other WHERE predicates then you are probably already optimally configured from that standpoint.
Row Size / Extent Density
As Netezza reads data to support your query each data slice will read the its disk in 3 MB extents. If your row is substantially wider than just the onegid value, you will be reading more data from the disk than you need in order to answer your query. If your table is large, your rows are wider than just onegid, and query time performance is paramount, then you might consider creating a materialized view, like so:
CREATE MATERIALIZED VIEW temper_300_1_mv AS select onegid from temper_300_1 ORDER BY onegid;
When you execute your query against temper_300_1 with only onegid in the SELECT clause, the optimizer will refer to the materialized view only, which will be able to pack more rows into a given 3MB extent. This can be a significant performance boost.
The ORDER BY clause in the MVIEW creation statement will also likely increase the effectiveness of compression of the MVIEW, further reducing the number of extents required to hold a given number of rows, and further improving performance.

Confused about the cost of SQL in PL/SQL Developer

Select pn.pn_level_id From phone_number pn Where pn.phone_number='15183773646'
Select pn.pn_level_id From phone_number pn Where pn.phone_number=' 15183773646'
Do you think they are the same ? No. Basically they are not the same in the PL/SQL Devloper.
I'm wondering why the latter one's cost is less than the previous sql.
Any help would be appreciated!
The cost is not the same because when cost is calculated planner takes into account available statistics. Statistics among other things contains values which appear most frequently in column with their frequencies. It allows planner to better estimate number of rows which will be fetched and to decide how to better get data (e.g. via sequential scan or by index).
In your case value 15183773646 is probably among mostly frequently appearing and that's why planner estimation is different for the query involving it (as planner has better estimation of number of such rows) as compared to other values for which it basically guesses.
The optimizer evaluates all the ways to extract data, since you database doesn't have the pn.phone_number=' 15183773646' and this value is not stored in the index hence Oracle skip one I/O operation this is why the second cost is less.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?
First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Tips and Tricks to speed up an SQL [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Does the order of columns in a WHERE clause matter?
These are the basics SQL Function and Keywords.
Is there any tips or trick to speed up your SQL ?
For example; I have a query with a lot of keywords. (AND, GROUP BY, ORDER BY, IN, BETWEEN, LIKE... etc.)
Which Keyword should be on top in my query?
How can i decide it?
Example;
Where NUMBER IN (156, 646)
AND DATE BETWEEN '01/01/2011' AND '01/02/2011'
OR
Where DATE BETWEEN '01/01/2011' AND '01/02/2011'
AND NUMBER IN (156, 646)
Which one is faster? Depends of what?
Don't use functions in the where clause. Because the query engine must execute the function for every single row.
There are no "tricks".
Given the competition between the database vendors about which one is "faster", any "trick" that is always true would be implemented in the database itself. (The tricks are implemented in the part of the database called "optimizer").
There are only things to be aware of, but they typically can't be reduced into:
Use feature X
Avoid feature Y
Model like this
Never model like that
Look at all the raging questions/discussions about indexes, index types, index strategies, clustering, single column keys, compound keys, referential integrity, access paths, joins, join mechanisms, storage engines, optimizer behaviour, datatypes, normalization, query transformations, denormalization, procedures, buffer cache, resultset cache, application cache, modeling, aggregation, functions, views, indexed views, set processing, procedural processing and the list goes on.
All of them was invented to attack a specific problem area. Variations on that problem make the "trick" more or less suitable. Very often the tricks have zero effect, and sometimes sometimes flat out horrible. Why? Because when we don't understand why something works, we are basically just throwing features at the problem until it goes away.
The key point here is that there is a reason why something makes a query go faster, and the understanding of what that something is, is crucial to the process of understanding why a different unrelated query is slow, and how to deal with it. And it is never a trick, nor magic.
We (humans) are lazy, and we want to be thrown that fish when what we really need is to learn how to catch it.
Now, what specific fish do YOU want to catch?
Edited for comments:
The placement of your predicates in the where clause makes no difference since the order in which they are processed is determined by the database. Some of the things which will affect that order (for your example) are :
Whether or not the query can be rewritten against an indexed view
What indexes are available that covers one or both of columns NUMBER and DATE and in what order they exist in that index
The estimated selectivity of your predicates, which basically mean the estimated percentage of rows matched by your predicate. The lower % the more likely the optimizer is to use your index efficiently.
The clustering factor (or whatever the name is in SQL Server) if SQL Server factors that into the query cost. This has to do with how the order of the index entries aligns with the physical order of the table rows. Better alignment = reduces cost for higher % of rows fetched via that index.
Now, if the only values you have in column NUMBER are 156, 646 and they are pretty much evenly spread, an index would be useless. A full scan would be a better alternative.
On the other hand, if those are unique order numbers (backed by a unique index), the optimizer will pick that index and drive the query from there. Similarily, if the rows having a DATE between the first and second of January 2011 make up a small enough % of the rows, an index leading with DATE will be considered.
Or if you include order by NUMBER, DATE another parameter comes into the equation; the cost of sorting. An index on (NUMBER, DATE) will now seem more attractive to the optimizer, because even though it might not be the most efficient way of aquiring the rows, the sorting (which is expensive) can be skipped.
Or, if your query included a join to another table (say customer) on customer_id and you also had a filter on customer.ssn, again the equation changes, because (since you did a good job with foreign keys and a backing index) you will now have a very efficient access path into your first table, without using the indexes in NUMBER or DATE. Unless you only have one customer and all of the 10 million orders where his...
Read about sargable queries (ones which can use the index vice ones which can't). Avoid correlated subqueries, functions in where clauses, cursors and while loops. Don't use select * especially if you have joins, never return more than the data you need.
Actually there are whole books written on performance tuning, get one and read it for the datbase you are using as the techniques vary from database to database.
Learn to use indexes properly.
http://Use-The-Index-Luke.com/