How do aggregates (group by) work on SQL Server? - sql

How does SQL Server implement group by clauses (aggregates)?
As inspiration, take the execution plan of this question's query:
select p_id, DATEDIFF(D, MIN(TreatmentDate), MAX(TreatmentDate)) from
patientsTable group by p_id
Before query data, simple select statement and its execution plan is this:
After retrieving the data with the query and execution plan:

Usually it's a Stream Aggregate or a Hash Aggregate.
Stream aggregate sorts the resultset, scans it and returns every new value (not equal to the last in scan). It allows to keep but one set of the aggregate state variables.
Hash aggregate builds a hash table from the resultset. Each entry keeps the aggregate state variables which are initialized on hash miss and updated on hash hit.
Let's see how AVG works. It needs two state variables: sum and count
grouper value
1 4
1 3
2 8
1 7
2 1
1 2
2 6
2 3
Stream Aggregate
First, it needs to sort the values:
grouper value
1 4
1 3
1 7
1 2
2 8
2 1
2 6
2 3
Then, it keeps one set of state variables, initialized to 0, and scans the sorted resultset:
grouper value sum count
-- Entered
-- Variables: 0 0
1 4 4 1
1 3 7 2
1 7 14 3
1 2 16 4
-- Group change. Return the result and reinitialize the variables
-- Returning 1, 4
-- Variables: 0 0
2 8 8 1
2 1 9 2
2 6 15 3
2 3 18 4
-- Group change. Return the result and reinitialize the variables
-- Returning 2, 4.5
-- End
Hash aggregate
Just scanning the values and keeping the state variables in the hash table:
grouper value
-- Hash miss. Adding new entry to the hash table
-- [1] (0, 0)
-- ... and updating it:
1 4 [1] (4, 1)
-- Hash hit. Updating the entry:
1 3 [1] (7, 2)
-- Hash miss. Adding new entry to the hash table
-- [1] (7, 2) [2] (0, 0)
-- ... and updating it:
2 8 [1] (7, 2) [2] (8, 1)
1 7 [1] (14, 3) [2] (8, 1)
2 1 [1] (14, 3) [2] (9, 2)
1 2 [1] (16, 4) [2] (9, 2)
2 6 [1] (16, 4) [2] (15, 3)
2 3 [1] (16, 4) [2] (18, 4)
-- Scanning the hash table and returning the aggregated values
-- 1 4
-- 2 4.5
Usually, sort is faster if the resultset is already ordered (like, the values come out of the index or a resultset sorted by the previous operation).
Hash is faster is the resultset is not sorted (hashing is faster than sorting).
MIN and MAX are special cases, since they don't require scanning the whole group: only the first and the last value of the aggregated column within the group.
Unfortunately, SQL Server, unlike most other systems, cannot utilize this efficiently, since it's not good in doing INDEX SKIP SCAN (jumping over distinct index keys).
While simple MAX and MIN (without GROUP BY clause) use a TOP method if the index on the aggregated column is present, MIN and MAX with GROUP BY use same methods as other aggregate functions do.

AS i dont have table available i tried with my custom table PRICE
I defined Primary key = ID_PRICE
Select PRICE.ID_PRICE, Max(PRICE.COSTKVARH) - min(PRICE.COSTKVARH) from PRICE
GROUP BY PRICE.ID_PRICE
Plan:
PLAN (PRICE ORDER PK_PRICE)
Adapted plan:
PLAN (PRICE ORDER PK_PRICE)
In your case p_id is a primary key so the adapted plan will be first order of patientsTable based on p_id then grouping and difference calculation will happen..

Related

INSERT rows into SQL Server by looping through a column with numbers

Let's say I have a very basic table:
DAY_ID
Value
Inserts
5
8
2
4
3
0
3
3
0
2
4
1
1
8
0
I want to be able to "loop" through the Inserts column, and add that many # of rows.
For each added row, I want DAY_ID to be decreased by 1 and Value to remain the same, Inserts column is irrelevant we can set to 0.
So 2 new rows should be added from DAY_ID = 5 and Value = 8, and 1 new row with DAY_ID = 2 and Value = 4. The final output of the new rows would be:
DAY_ID
Value
Inserts
(5-1)
8
0
(5-2)
8
0
(2-1)
4
0
I haven't tried much in SQL Server, I was able to create a solution in R and Python using arrays, but I'm really hoping I can make something work in SQL Server for this project.
I think this can be done using a loop in SQL.
Looping is generally not the way you solve any problems in SQL - SQL is designed and optimized to work with sets, not one row at a time.
Consider this source table:
CREATE TABLE dbo.src(DAY_ID int, Value int, Inserts int);
INSERT dbo.src VALUES
(5, 8, 2),
(4, 3, 0),
(3, 3, 0),
(2, 4, 1),
(1, 8, 0);
There are many ways to "explode" a set based on a single value. One is to split a set of commas (replicated to the length of the value, less 1).
-- INSERT dbo.src(DAY_ID, Value, Inserts)
SELECT
DAY_ID = DAY_ID - ROW_NUMBER() OVER (PARTITION BY DAY_ID ORDER BY ##SPID),
src.Value,
Inserts = 0
FROM dbo.src
CROSS APPLY STRING_SPLIT(REPLICATE(',', src.Inserts-1), ',') AS v
WHERE src.Inserts > 0;
Output:
DAY_ID
Value
Inserts
1
4
0
4
8
0
3
8
0
Working example in this fiddle.

Performant query count of rows within range over sequence

I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.
A vastly simplified version of this table is:
CREATE TABLE Data (
EntityId INTEGER NOT NULL,
Start INTEGER NOT NULL,
Finish INTEGER
);
With some example data
INSERT INTO Data VALUES
(1, 0, 2),
(1, 4, 6),
(1, 8, NULL),
(2, 5, 7),
(2, 9, NULL),
(3, 8, NULL);
And an desired output of something like:
Time
Count
0
1
1
1
2
0
3
0
4
1
5
2
6
1
7
0
8
2
9
3
For which I am querying with:
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time
There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.
I have added an index on the table covering Start/End:
CREATE INDEX Ix_Data ON Data (
Start,
Finish
);
and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.
I've been looking at the query plan too, and I think the slow part of the GROUP BY since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:
3 0 0 MATERIALIZE 3
7 3 0 SETUP
8 7 0 SCAN CONSTANT ROW
21 3 0 RECURSIVE STEP
22 21 0 SCAN TABLE Generate_Time
27 21 0 SCALAR SUBQUERY 2
32 27 0 SEARCH TABLE Data USING COVERING INDEX Ix_Data
57 0 0 SCAN SUBQUERY 3
59 0 0 SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71 0 0 USE TEMP B-TREE FOR GROUP BY
Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!
To get to the desired output as per your question, the following can be done.
For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT gt.Time
,count(d.entityid)
FROM Generate_Time gt
LEFT JOIN Data d
ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
GROUP BY gt.Time
This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.

Seperating a Oracle Query with 1.8 million rows into 40,000 row blocks

I have a project where I am taking Documents from one system and importing them into another.
The first system has the documents and associated keywords stored. I have a query that will return the results which will then be used as the index file to import them into the new system. There are about 1.8 million documents involved so this means 1.8 million rows (One per document).
I need to divide the returned results into blocks of 40,000 to make importing them in batches of 40,000 at a time, rather than one long import.
I have the query to return the results I need. Just need to know how to take that and break it up for easier import. My apologies if I have included to little information. This is my first time here asking for help.
Use the built-in function ORA_HASH to divide the rows into 45 buckets of roughly the same number of rows. For example:
select * from some_table where ora_hash(id, 44) = 0;
select * from some_table where ora_hash(id, 44) = 1;
...
select * from some_table where ora_hash(id, 44) = 44;
The function is deterministic and will always return the same result for the same input. The resulting number starts with 0 - which is normal for a hash, but unusual for Oracle, so the query may look off-by-one at first. The hash works better with more distinct values, so pass in the primary key or another unique value if possible. Don't use a low-cardinality column, like a status column, or the buckets will be lopsided.
This process is in some ways inefficient, since you're re-reading the same table 45 times. But since you're dealing with documents, I assume the table scanning won't be the bottleneck here.
A prefered way to bucketing the ID is to use the NTILE analytic function.
I'll demonstrate this on a simplified example with a table with 18 rows that should be divided in four chunks.
select listagg(id,',') within group (order by id) from tab;
1,2,3,7,8,9,10,15,16,17,18,19,20,21,23,24,25,26
Note, that the IDs are not consecutive, so no arithmetic can be used - the NTILE gets the parameter of the requested number of buckets (4) and calculates the chunk_id
select id,
ntile(4) over (order by ID) as chunk_id
from tab
order by id;
ID CHUNK_ID
---------- ----------
1 1
2 1
3 1
7 1
8 1
9 2
10 2
15 2
16 2
17 2
18 3
19 3
20 3
21 3
23 4
24 4
25 4
26 4
18 rows selected.
All but the last bucket are of the same size, the last one can be smaller.
If you want to calculate the ranges - use simple aggregation
with chunk as (
select id,
ntile(4) over (order by ID) as chunk_id
from tab)
select chunk_id, min(id) ID_from, max(id) id_to
from chunk
group by chunk_id
order by 1;
CHUNK_ID ID_FROM ID_TO
---------- ---------- ----------
1 1 8
2 9 17
3 18 21
4 23 26

Misleading count of 1 on JOIN in Postgres 11.7

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

MonetDB: Enumerate groups of rows based on a given "boundary" condition

Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --
You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;