How to avoid exponencial time cost with SQL Server SUM function? - sum

I realized my query takes exponential time by every time I use SUM function...
For Example, the below code takes 2 seconds
SELECT sub.a, SUM(sub.b)
FROM (
SELECT a, b, c
FROM temp
)sub
GROUP BY a;
And using a second SUM now takes 4 seconds and so on...
SELECT sub.a, SUM(sub.b), SUM(sub.c)
FROM (
SELECT a, b, c
FROM temp
)sub
GROUP BY a;
it seems the subquery is bein executed again by every SUM I do, whether this is correct, what would be the best practice to avoid the time cost?
example above is just representing in the most basic way the question

TL;DR: No, this is completely wrong.
When you run a query in SQL Server, the optimizer compiles it into the most efficient method it can find. You can see the result by clicking Include Actual Execution Plan in SSMS.
For the query you specify, it would typically do something like this:
It notes that the subquery can be inlined into the query, and does so:
SELECT sub.a, SUM(sub.b), SUM(sub.c)
FROM temp
GROUP BY a;
It then evaluates the best way to aggregate the table by a values. Let's assume there is no index at all, a Hash Aggregate would be most likely chosen here.
On execution, every row is fed into the Hash, which builds up an in-memory hash table, with a values as the key. Each row is looked up based on a, a key is added to the hash table if it hasn't been seen before. Then b and c values are added to that key.
Let's say you have an index on a,b,c. Now a much faster method is possible, called a Stream Aggregate, because now the values are passing through the Aggregate sorted by a.
Each row passes through the aggregate. If the a value is the same as the row before, it's b and c values are added to whatever we have so far. When the a value changes, the existing result is output, and we start aggregating again.
It is true that summing extra columns is extra overhead, but it's pretty small compared to reading the table of disk or hashing, which is only done once per the whole query.

Related

Are there any database implementations that allow for tables that don't contain data but generate data upon query?

I have an application that works well with database query outputs but now need to run with each output over a range of numbers. Sure, I could refactor the application to iterate over the range for me, but it would arguably be cleaner if I could just have a "table" in the database that I could CROSS JOIN with my normal query outputs. Sure, I could just make a table that contains a range of values, but that seems like unnecessary waste.
For example a "table" in a database that represents a range of values, say 0 to 999,999 in a column called "number" WITHOUT having to actually store a million rows, but can be used in a query with a CROSS JOIN with another table as though there actually existed such a table.
I am mostly just curious if such a construct exists in any database implementation.
PostgreSQL has generate_series. SQLite has it as a loadable extension.
SELECT * FROM generate_series(0,9);
On databases which support recursive CTE (SQLite, PostgreSQL, MariaDB), you can do this and then join with it.
WITH RECURSIVE cnt(x) AS (
VALUES(0)
UNION ALL
SELECT x+1 FROM cnt WHERE x < 1000000
)
SELECT x FROM cnt;
The initial-select runs first and returns a single row with a single column "1". This one row is added to the queue. In step 2a, that one row is extracted from the queue and added to "cnt". Then the recursive-select is run in accordance with step 2c generating a single new row with value "2" to add to the queue. The queue still has one row, so step 2 repeats. The "2" row is extracted and added to the recursive table by steps 2a and 2b. Then the row containing 2 is used as if it were the complete content of the recursive table and the recursive-select is run again, resulting in a row with value "3" being added to the queue. This repeats 999999 times until finally at step 2a the only value on the queue is a row containing 1000000. That row is extracted and added to the recursive table. But this time, the WHERE clause causes the recursive-select to return no rows, so the queue remains empty and the recursion stops.
Generally speaking, this depends a lot on the database you're using. In SQLite, for example, you are going to generator a sequence from 1 to 100. You could code like this:
WITH basic(i) AS (
VALUES(1)
),
seq(i) AS (
SELECT i FROM basic
UNION ALL
SELECT i + 1 FROM seq WHERE i < 100
)
SELECT * FROM seq;
Hope ring your bell.
Looks like the answer to my question "Are there any database implementations that allow for tables that don't contain data but generate data upon query?" is yes. For example in sqlite there exists virtual tables: https://www.sqlite.org/vtab.html
In fact, it has the exact sort of thing I was looking for with generate_series: https://www.sqlite.org/series.html

Query applying Where clause BEFORE Joins?

I'm honestly really confused here, so I'll try to keep it simple.
We have Table A:
id
Table B:
id || number
Table A is a "prefilter" to B, since B contains a lot of different objects, including A.
So my query, trying to get all A's with a filter;
SELECT * FROM A a
JOIN B b ON b.id = a.id
WHERE CAST(SUBSTRING(b.number, 2, 30) AS integer) between 151843 and 151865
Since ALL instances of A starts with a letter ("X******"), I just want to truncate the first letter to let the filter do his work with the number specified by the user.
At first glance, there should be absolutely no worries. But it seems I was wrong. And on something I didn't expect to be...
It seems like my WHERE clause is executed BEFORE my JOIN. Therefore, since many B's have number with more than one Letter at the start, I have an invalid conversion happening. Despite the fact that it would NEVER happen if we stay in A's.
I always thought that where clause was executed after joins, but in this case, it seems postgres wants to prove me wrong.
Any explanations ?
SQLFiddle demonstrating problem: http://sqlfiddle.com/#!15/cd7e6e/7
And even with the SubQuery, it still makes the same error...
You can use the regex substr function to remove everything but digits: CAST(substring(B.number from '\d') AS integer).
See working example here: http://sqlfiddle.com/#!15/cd7e6e/18
SQL is a declarative language. For a select statement, you declare the criteria the data you are looking for must meet. You don't get to choose the execution path, your query isn't executed procedurally.
Thus the optimizer is free to choose any execution plan it likes, as long as it returns records specified by your criteria.
I suggest you change your query to cast to string instead of to integer. Something like:
WHERE SUBSTRING(b.number, 2, 30) between CAST(151843 AS varchar) and CAST(151865 AS varchar)
Do the records of A that are in B have the same id in table B as in A. If those records are inserted in a different order, this may not be the case and therefore return different records than expected.

How to speed up min/max aggregates in Postgres without an index that is unnecessary otherwise

Say I have a table with an int column, and all I am ever going to read from it is the MAX() int value.
If I create an index on that column, Postgres can perform a reverse scan of that index to get the MAX() value. But since all except one row in the index are just an overhead, can we get the same performance without having to create the full index.
Yes, one can create a trigger to update a single-row table that tracks the MAX value, and query that table instead of issuing a MAX() against the main table. But I am looking for something elegant, because I know Postgres has partial indexes, and I can't seem to find a way to leverage them for this purpose.
Update: This partial-index definition is ideally what I'd like, but Postgres does not allow subqueries in the WHERE clause of a partial-index.
create index on test(a) where a = (select max(a) from test);
You cannot use aggregate functions or subquery expressions in the predicate of a partial index. That would make no sense logically anyway, given the IMMUTABLE nature of index entries.
If you have a range of integers and you can guarantee that the maximum will always be greater than x, you can benefit from this meta-information, though.
CREATE INDEX text_max_idx ON test (a) WHERE a > x;
This index will only be used by the query planner if you include a WHERE clause that matches the index predicate. For instance:
SELECT max(a) FROM test WHERE a > x;
There can be more conditions, but this one has to be included to use the index. (WHERE a > x + 123 would work, too.)
I am serious about "guarantee", though. Your query will return nothing if the predicate is false.
You could build in a fail-safe:
SELECT COALESCE( (SELECT max(a) FROM test WHERE a > x)
, (SELECT max(a) FROM test));
You can generalize this approach with more than one partial index. Similar to this answer, just a lot simpler:
Can spatial index help a "range - order by - limit" query
I would consider the trigger approach, except for very heavy write loads on the table.
The other rows in the index are not unnecessary because they enable you to keep the max accurate even in case of deletions, or in case of updates reducing the current maximum.
If you don't have such operations (IOW the max only ever increases) you can maintain the max value yourself. Do it in the application code or in a trigger.
Postgres cannot know that the max will only ever increase. It has to uphold the capability to do deletes and updates.

How does SQL choose which row to show when grouping multiple rows together?

Consider the following table:
CREATE TABLE t
(
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER,
PRIMARY KEY (a, b)
)
Now if I do this:
SELECT a,b,c FROM t GROUP BY a;
I expect to have get each distinct value of a only once. But since I'm asking for b and c as well, it's going to give me a row for every value of a. Therefor, if, for a single value of a, there are many rows to choose from, how can I predict which row SQL will choose? My tests show that it chooses to return the row for which b is the greatest. But what is the logic in that? How would this apply to strings of blobs or dates or anything else?
My question: How does SQL choose which row to show when grouping multiple rows together?
btw: My particular problem concerns SQLITE3, but I'm guessing this is an SQL issue not dependent of the DBMS...
That shouldn't actually work in a decent DBMS :-)
Any column not used in the group by clause should be subject to an aggregation function, such as:
select a, max(b), sum(c) from t group by a
If it doesn't complain in SQLite (and I have no immediate reason to doubt you), I'd just put it down to the way the DBMS is built. From memory, there's a few areas where it doesn't worry too much about the "purity" of the data (such as every column being able to hold multiple types, the type belonging to the data in that row/column intersect rather than the column specification).
All the SQL engines that I know will complain about the query that you mentioned with an error message like "b and c appear in the field list but not in the group by list". You are only allowed to use b or c in an aggregate function (like MAX / MIN / COUNT / AVG whatever) or you'll be forced to add them in the GROUP BY list.
You're not quite correct about your assumption that this is RDBMS-independent. Most RDBMS don't allow to select fields that are not also in the GROUP BY clause. Exceptions to this (to my knowledge) are SQLite and MySQL. In general, you shouldn't do this, because values for b and c are chosen pretty arbitrarily (depending on the applied grouping algorithm). Even if this may be documented in your database, it's always better to express a query in a way that fully and non-ambiguously specifies the outcome
It's not a matter of what the database will choose, but the order your data are going to be returned.
Your primary key is handling your sort order by default since you didn't provide one.
You can use Order By a, c if that's what you want.

What is the most efficient way to count rows in a table in SQLite?

I've always just used "SELECT COUNT(1) FROM X" but perhaps this is not the most efficient. Any thoughts? Other options include SELECT COUNT(*) or perhaps getting the last inserted id if it is auto-incremented (and never deleted).
How about if I just want to know if there is anything in the table at all? (e.g., count > 0?)
The best way is to make sure that you run SELECT COUNT on a single column (SELECT COUNT(*) is slower) - but SELECT COUNT will always be the fastest way to get a count of things (the database optimizes the query internally).
If you check out the comments below, you can see arguments for why SELECT COUNT(1) is probably your best option.
To follow up on girasquid's answer, as a data point, I have a sqlite table with 2.3 million rows. Using select count(*) from table, it took over 3 seconds to count the rows. I also tried using SELECT rowid FROM table, (thinking that rowid is a default primary indexed key) but that was no faster. Then I made an index on one of the fields in the database (just an arbitrary field, but I chose an integer field because I knew from past experience that indexes on short fields can be very fast, I think because the index is stored a copy of the value in the index itself). SELECT my_short_field FROM table brought the time down to less than a second.
If you are sure (really sure) that you've never deleted any row from that table and your table has not been defined with the WITHOUT ROWID optimization you can have the number of rows by calling:
select max(RowId) from table;
Or if your table is a circular queue you could use something like
select MaxRowId - MinRowId + 1 from
(select max(RowId) as MaxRowId from table) JOIN
(select min(RowId) as MinRowId from table);
This is really really fast (milliseconds), but you must pay attention because sqlite says that row id is unique among all rows in the same table. SQLite does not declare that the row ids are and will be always consecutive numbers.
The fastest way to get row counts is directly from the table metadata, if any. Unfortunately, I can't find a reference for this kind of data being available in SQLite.
Failing that, any query of the type
SELECT COUNT(non-NULL constant value) FROM table
should optimize to avoid the need for a table, or even an index, scan. Ideally the engine will simply return the current number of rows known to be in the table from internal metadata. Failing that, it simply needs to know the number of entries in the index of any non-NULL column (the primary key index being the first place to look).
As soon as you introduce a column into the SELECT COUNT you are asking the engine to perform at least an index scan and possibly a table scan, and that will be slower.
I do not believe you will find a special method for this. However, you could do your select count on the primary key to be a little bit faster.
sp_spaceused 'table_name' (exclude single quote)
this will return the number of rows in the above table, this is the most efficient way i have come across yet.
it's more efficient than select Count(1) from 'table_name' (exclude single quote)
sp_spaceused can be used for any table, it's very helpful when the table is exceptionally big (hundreds of millions of rows), returns number of rows right a way, whereas 'select Count(1)' might take more than 10 seconds. Moreover, it does not need any column names/key field to consider.