question about aggregate function internals in SQL/Postgres - sql

How does a function like SUM work? If I execute
select id,sum(a) from mytable group by id
does it sort by id and then sum over each range of equal id's? I am no planner expert, but it looks like that is what is happening, where mytable is maybe a hundred million rows with a few million distinct id's.
Or does it just keep a hash of id -> current_sum, and then at each row either increments the value of id or add a new key? Isn't that far faster and less memory hungry?

SQL standards try to dictate external behavior, not internal behavior. In this particular case, a SQL implementation that conforms to (one of the many) standards is supposed to act like it does things in this order.
Build a working table from all the table constructors in the FROM clause. (There's only one in your example.)
In the GROUP BY clause, partition the working table into groups. Reduce each group to one row. Replace the working table with the grouped table.
Resolve the expressions in the SELECT clause.
Query optimizers that follow SQL standards are free to rearrange things however they like, as long as the result is the same as if it had followed those steps.
You can find more details in the answers and comments to this SO question.

So, I found this, http://helmingstay.blogspot.com/2009/06/postgresql-poetry-aggregate-median-with.html, which claims that it does indeed use the accumulator pattern. Hmmm.

Related

Question about execution relating to derived columns in SQL query

This is a more conceptual question about what happens during execution rather than anything wrong with the code. I was thinking about this in relation to the exercise here: https://www.hackerrank.com/challenges/earnings-of-employees/submissions/code/214661572
The solution I provided was:
SELECT * FROM (
SELECT (SALARY * MONTHS), COUNT(*)
FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC
)
WHERE ROWNUM = 1;
First, looking at the subquery, we see that we are grouping on the derived column salary*months.
However, something that confuses me is that it was explained to me that the order of execution begins with the from statement, and then proceeds to joins, where, group by, etc. clauses.
The problem I have was in my mind, I have imagined the from statement as the command which tells SQL what table we are dealing with - so when we invoke from the employee table we have the columns as specified in the exercise link.
However, now the next step is to group by (SALARY * MONTHS)... which is a derived column. But, the derived column does not exist in the table that we specified in the "FROM" statement. So how does SQL know what to group by if the column isn't provided in the original table?
The order of execution explanation I am looking at is here: https://sqlbolt.com/lesson/select_queries_order_of_execution
Thank you.
The article from SQLBolt, is not realistic.
The article explains how a naïve database engine would execute a query. It can be helpful as a simplistic approach to understand the basics of a query execution, but if an engine worked that way it would be very slow except for the most simple queries only. Nowadays the engines are much smarter than that.
The key concept that you must understand is that SQL is a declarative language, not an imperative one. You tell what you need, the engine decides how to produce it.
As a general rule modern databases process the query using the phases listed below:
Caching: Find if the query was already executed. If found, skip to step #7.
Parsing: Translate the SQL statement into an internal representation.
Rephrasing: Simplify the internal representation. All dirty tricks are valid here: for example using the same node for all occurrences of SALARY * MONTHS.
Planning/Prunning: Produce all possible execution plan trees, and prune as soon as possible. Include existing indexes to produce plans.
Cost Assessment: Determine the cost of a plan. The cost algorithm must be extremely fast (typically an heuristic can do), and somewhat accurate, since it must be computed for all candidate plans.
Optimizing: Select the best plan according to the cost. Update the cache.
Executing: Execute the plan tree starting from the root node and walk the tree by depth. This is not strictly true due to pipelining.
Node Pipelining or Materializing: a node can start returning rows as soon as possible to the parent node.
Return Result Set: Walk back to the parent nodes, until the root node is reached. The root node starts returning data to the client app.
As you see, there's a lot going on behind the scenes. Mind that some engines are much more sophisticated than this (e.g. Oracle, DB2, PostgreSQL) since they have smarter shortcuts and have implemented so may dirty tricks. Yep... all is valid as long as the returned result is correct.
Two different things are going on here.
The more important is that what gets executed is a directed acyclic graph of data operations. It really has (very little) to do with the SQL you write. The only guarantee is that it produces the results that you specify.
That is, SQL is a declarative language, not a procedural language. A query describes the result set.
The second thing that is going on is the scoping of identifiers: what does a column reference mean? These are defined in the FROM clause. JOINs have nothing to do with this, because they are just operators (like + or || except on tables) in the FROM clause.
Then, the references can be used in the WHERE, GROUP BY, SELECT, and other clauses. Column aliases defined in the SELECT can really only be used in the ORDER BY clause. Some databases also allow them in the HAVING and GROUP BY clauses as well but not Oracle.
As for your specific question, there is no requirement in SQL that the GROUP BY keys be present in the SELECT. Usually they are, but that is not a requirement. In fact, there are cases when using dates with string names that they might not be used. For instance, one could write:
select to_char(datecol, 'MON'), count(*)
from t
group by to_char(datecol, 'MON'), extract(month from date)
order by extract(month from date);
The month number is functionally equivalent to the month name. To have it for sorting, you could include it as a group by key.
Some people confuse the scoping rules with the order of execution. That is due to a misunderstanding of how SQL engines actually work.
I just give you my opinion, it is pretty intuitive.
You have the subquery query:
SELECT * FROM
-- init_subquery
(SELECT (SALARY * MONTHS), COUNT(*) FROM EMPLOYEE
GROUP BY (SALARY*MONTHS)
ORDER BY (SALARY*MONTHS) DESC)
-- finish_subquery
WHERE ROWNUM = 1;
First of all it do the arithmetic operation, It produces the result of SALARY*MONTHS. That result is the column for Oracle, It only understand that it has some values in one column...
Like: SELECT (SALARY * MONTHS), FROM EMPLOYEE
After this execution it gets the GROUP BY clouse to do the COUNT, because COUNT operation depends if there are any group clause.
And finally it order your table.
I do not want to get any point from your article. Just that is a good question and I am preparing for 1Z0_071 Exam, and it is interesting to me to discuse.

Current State Query / Architecture

I expect this is a common enough use-case, but I'm unsure the best way to leverage database features to do it. Hopefully the community can help.
Given a business domain where there are a number of attributes to make up a record. We can just call these a,b,c
Each of these belong to a parent record, of which there can be many,
Given an external datasource that will post updates to those attributes, at arbitrary times, and typically only a subset, so you get instructions like
z:{a:3}
or
y:{b:2,c:100}
What are good ways to be able to query postgres for the 'current state', ie. wanting a single row result that represents the most recent value for all of a,b,c, for each of the parent records.
current state looks overall like
x:{a:0, b:0, c:1}
y:{a:1, b:2, c:3}
z:{a:2, b:65, c:6}
If it matters, The difference in time between updates on a single value could be arbitrarily long
I am deliberately avoiding having a table that keeps updating and writing an individual row for the state because the write-contention could be a problem, and I think there must be a better overall pattern.
Your question is a bit theorical - but in essence you are describing a top-1-per-group problem. In Postgres, you can use distinct on for this.
Assuming that your table is called mytable, where attributes are stored in column attribute, and that column ordering_id defines the ordering of the rows (that could be a timestamp or an serial for example), you would phrase the query as:
select distinct on (attribute) t.*
from mytable t
order by attribute, ordering_id desc

How does implicit sorting in SELECT *,MIN(x) work in SQLite?

Today I had an apparently very common problem of selecting the row with the minimum value from each group of a dataset split by a group by. I found a solution that is unique to SQLite (it works incorrectly in MySQL and throws an error in PostgreSQL) and doesn't use any joins. It looks like this:
SELECT *, min(x) FROM table GROUP BY y
Here is a fiddle with an example.
However, I don't understand why this works - just by including an aggregate function each group was somehow implicitly sorted and returned the row to which the result of the aggregate function corresponds. Default SQL behavior is to select an arbitrary row. I dug through relevant SQLite documentation and found no explanation of this. This is what I'd like an explanation for.
Edit: both answers so far guess that this is a coincidence. It is not. In the actual table I have ~90 records split into ~30 groups with this method and it works as expected on every one. See for yourself.
To be compatible with MySQL, SQLite allows to use columns that are neither aggregated nor grouped by.
MySQL does not guarantee that the values come from any specific row, and neither did SQLite before version 3.7.11. However, due to how grouping is implemented in SQLite, the values in such columns happened to come from the row that matches the min()/max() in certain cases.
Some paying customer found this useful and wanted a guarantee for this, so SQLite enforced it in all cases and documented it in the changelog of version 3.7.11, which makes it a supported feature (i.e., it's tested, and will never be removed).
While it is safe to use, this behaviour is a violation extension of the SQL standard that was never properly designed, and never meant to be a selling feature, so it is not mentioned in the actual documentation.
It probably works by accident. SQLite will return an arbitrary row for each group. The row does not necessarily have to have the minimum x value for the group.
Learn to express the query correctly:
SELECT t.*
FROM table t
WHERE t.x = (SELECT MIN(t2.x) FROM table t2 WHERE t2.y = t.y)
The record you see was arbitrary chosen.
You cannot count on the behaviour which seems fix to you.
It can be changed due to changes in the table structure (e.g. added/removed indexes), between versions etc.
https://www.sqlite.org/lang_select.html
If the SELECT statement is an aggregate query with a GROUP BY clause
...
Each expression in the result-set is then evaluated once for each
group of rows. If the expression is an aggregate expression, it is
evaluated across all rows in the group. Otherwise, it is evaluated
against a single arbitrarily chosen row from within the group. If
there is more than one non-aggregate expression in the result-set,
then all such expressions are evaluated for the same row.
This reminds me of a famous pitfall related to Oracle's GROUP BY.
Everybody just knew that if you use GROUP BY you can skip the ORDER BY because the result set is already ordered.
The reason the result set was ordered at that time is that Oracle used a sort based algorithm for the implementation of the group by.
In version 10gR2 Oracle added an additional GROUP BY algorithm based on HASH.
You can guess the rest of the story.

Does ms access group by order the results?

I need to run a query that groups the result and orders it. When I used the following query I noticed that the results were ordered by the field name:
SELECT name, count(name)
FROM contacts
GROUP BY name
HAVING count(name)>1
Originally I planed on using the following query:
SELECT name, count(name)
FROM contacts
GROUP BY name
HAVING count(name)>1
ORDER BY name
I'm worried that order by significantly slows the running time.
Can I depend on ms-access to always order by the field I am grouping by, and eliminate the order by?
EDIT: I tried grouping different fields in other tables and it was always ordered by the grouped field.
I have found answers to this question to other SQL DBMSs, but not access.
How GROUP BY and ORDER BY work in general
Databases usually choose between sorting and hashing when creating groups for GROUP BY or DISTINCT operations. If they do choose sorting, you might get lucky and the sorting is stable between the application of GROUP BY and the actual result set consumption. But at some later point, this may break as the database might suddenly prefer an order-less hashing algorithm to produce groups.
In no database, you should ever rely on any implicit ordering behaviour. You should always use explicit ORDER BY. If the database is sophisticated enough, adding an explicit ORDER BY clause will hint that sorting is more optimal for the grouping operation as well, as the sorting can then be re-used in the query execution pipeline.
How this translates to your observation
I tried grouping different fields in other tables and it was always ordered by the grouped field.
Have you exhaustively tried all possible queries that could ever be expressed? I.e. have you tried:
JOIN
OUTER JOIN
semi-JOIN (using EXISTS or IN)
anti-JOIN (using NOT EXISTS or NOT IN)
filtering
grouping by many many columns
DISTINCT + GROUP BY (this will certainly break your ordering)
UNION or UNION ALL (which defeats this argument anyway)
I bet you haven't. And even if you tried all of the above, can you be sure there isn't a very peculiar configuration where the above breaks, just because you've observed the behaviour in some (many) experiments?
You cannot.
MS Access specific behaviour
As far as MS Access is concerned, consider the documentation on ORDER BY
Remarks
ORDER BY is optional. However, if you want your data displayed in sorted order, then you must use ORDER BY.
Notice the wording. "You must use ORDER BY". So, MS Acces is no different from other databases.
The answer
So your question about performance is going in the wrong direction. You cannot sacrifice correctness for performance in this case. Better tackle performance by using indexes.
Here is the MSDN documentation for the GROUP BY clause in Access SQL:
https://msdn.microsoft.com/en-us/library/bb177905(v=office.12).aspx
The page makes no reference to any implied or automatic ordering of results - if you do see desired ordering without an explicit ORDER BY then it is entirely coincidental.
The only way to guarantee the particular ordering of results in SQL is with ORDER BY.
There is a slight performance problem with using ORDER BY (in general) in that it requires the DBMS to get all of the results first before it outputs the first row of results (though the DBMS is free to use an "online sort" algorithm that sorts data as it gets each row from its backing store, it still needs to get the last row from the backing store before it can return the first row to the client (in case the last row from the backing-store happens to be the 1st result according to the ORDER BY) - however unless you're querying tens of thousands of rows in a latency-sensitive application this is not a problem - and as you're using Access already it's very clear that this is not a performance-sensitive application.

Does SELECT DISTINCT imply a sort of the results

Does including DISTINCT in a SELECT query imply that the resulting set should be sorted?
I don't think it does, but I'm looking for a an authoritative answer (web link).
I've got a query like this:
Select Distinct foo
From Bar
In oracle, the results are distinct but are not in sorted order. In Jet/MS-Access there seems to be some extra work being done to ensure that the results are sort. I'm assuming that oracle is following the spec in this case and MS Access is going beyond.
Also, is there a way I can give the table a hint that it should be sorting on foo (unless otherwise specified)?
From the SQL92 specification:
If DISTINCT is specified, then let TXA be the result of eliminating redundant duplicate values from TX. Otherwise, let TXA be TX.
...
4) If an is not specified, then the ordering of the rows of Q is implementation-dependent.
Ultimately the real answer is that DISTINCT and ORDER BY are two separate parts of the SQL statement; If you don't have an ORDER BY clause, the results by definition will not be specifically ordered.
No. There are a number of circumstances in which a DISTINCT in Oracle does not imply a sort, the most important of which is the hashing algorithm used in 10g+ for both group by and distinct operations.
Always specify ORDER BY if you want an ordered result set, even in 9i and below.
There is no "authoritative" answer link, since this is something that no SQL server guarantees.
You will often see results in order when using distinct as a side effect of the best methods of finding those results. However, any number of other things can mix up the results, and some server may hand back results in such a way as to not give them sorted even if it had to sort to get the results.
Bottom line: if your server doesn't guarantee something you shouldn't count on it.
Not to my knowledge, no. The only reason I can think of is that SQL Server would internally sort the data in order to detect and filter out duplicates, and thus return it in a "pre-sorted" manner. But I wouldn't rely on that "side effect" :-)
No, it is not implying a sort. In my experience, it sorts by the known index, which may happen to be foo.
Why be subtle? Why not specific Select Distinct foo from Bar Order by foo?
On at least one server I've used (probably either Oracle or SQL Server, about six years ago), SELECT DISTINCT was rejected if you didn't have an ORDER BY clause. It was accepted on the "other" server (Oracle or SQL Server). Your mileage may vary.
No, the results are not sorted. If you want to give it a 'hint', you can certainly supply an ORDER BY:
select distinct foo
from bar
order by foo
But keep in mind that you might want to sort on more than just alphabetically. Instead you might want to sort on criteria on other fields. See:
http://weblogs.sqlteam.com/jeffs/archive/2007/12/13/select-distinct-order-by-error.aspx
As the answers mostly say, DISTINCT does not mandate a sort - only ORDER BY mandates that. However, one standard way of achieving DISTINCT results is to sort; the other is to hash the values (which tends to lead to semi-random sequencing). Relying on the sort effect of DISTINCT would be foolish.
In my case (SQL server), as an example I had a list of countries with a numerical value X assigned against each. When I did a select distinct * from Table order by X, it ordered it by X but at the same time result set countries were also ordered which was not directly implemented.
From my experience, I'll say that distinct does imply an implicit sort.
Yes. Oracle does use a sort do calculate a distinct. You can see that if you look at the explain plan. The fact that it did a sort for that calculation does not in any way imply
that the result set will be sorted. If you want the result set sorted, you are required to use the ORDER BY clause.