I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.
A vastly simplified version of this table is:
CREATE TABLE Data (
EntityId INTEGER NOT NULL,
Start INTEGER NOT NULL,
Finish INTEGER
);
With some example data
INSERT INTO Data VALUES
(1, 0, 2),
(1, 4, 6),
(1, 8, NULL),
(2, 5, 7),
(2, 9, NULL),
(3, 8, NULL);
And an desired output of something like:
Time
Count
0
1
1
1
2
0
3
0
4
1
5
2
6
1
7
0
8
2
9
3
For which I am querying with:
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time
There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.
I have added an index on the table covering Start/End:
CREATE INDEX Ix_Data ON Data (
Start,
Finish
);
and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.
I've been looking at the query plan too, and I think the slow part of the GROUP BY since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:
3 0 0 MATERIALIZE 3
7 3 0 SETUP
8 7 0 SCAN CONSTANT ROW
21 3 0 RECURSIVE STEP
22 21 0 SCAN TABLE Generate_Time
27 21 0 SCALAR SUBQUERY 2
32 27 0 SEARCH TABLE Data USING COVERING INDEX Ix_Data
57 0 0 SCAN SUBQUERY 3
59 0 0 SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71 0 0 USE TEMP B-TREE FOR GROUP BY
Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!
To get to the desired output as per your question, the following can be done.
For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT gt.Time
,count(d.entityid)
FROM Generate_Time gt
LEFT JOIN Data d
ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
GROUP BY gt.Time
This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.
Related
Let's say I have a very basic table:
DAY_ID
Value
Inserts
5
8
2
4
3
0
3
3
0
2
4
1
1
8
0
I want to be able to "loop" through the Inserts column, and add that many # of rows.
For each added row, I want DAY_ID to be decreased by 1 and Value to remain the same, Inserts column is irrelevant we can set to 0.
So 2 new rows should be added from DAY_ID = 5 and Value = 8, and 1 new row with DAY_ID = 2 and Value = 4. The final output of the new rows would be:
DAY_ID
Value
Inserts
(5-1)
8
0
(5-2)
8
0
(2-1)
4
0
I haven't tried much in SQL Server, I was able to create a solution in R and Python using arrays, but I'm really hoping I can make something work in SQL Server for this project.
I think this can be done using a loop in SQL.
Looping is generally not the way you solve any problems in SQL - SQL is designed and optimized to work with sets, not one row at a time.
Consider this source table:
CREATE TABLE dbo.src(DAY_ID int, Value int, Inserts int);
INSERT dbo.src VALUES
(5, 8, 2),
(4, 3, 0),
(3, 3, 0),
(2, 4, 1),
(1, 8, 0);
There are many ways to "explode" a set based on a single value. One is to split a set of commas (replicated to the length of the value, less 1).
-- INSERT dbo.src(DAY_ID, Value, Inserts)
SELECT
DAY_ID = DAY_ID - ROW_NUMBER() OVER (PARTITION BY DAY_ID ORDER BY ##SPID),
src.Value,
Inserts = 0
FROM dbo.src
CROSS APPLY STRING_SPLIT(REPLICATE(',', src.Inserts-1), ',') AS v
WHERE src.Inserts > 0;
Output:
DAY_ID
Value
Inserts
1
4
0
4
8
0
3
8
0
Working example in this fiddle.
I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()
With example tables:
create table user_login (
user_id integer not null,
login_time numeric not null, -- seconds since epoch or similar
constraint unique(user_id, login_time)
);
create table user_page_visited (
page_id integer not null,
page_visited_at numeric not null -- seconds since epoch or similar
);
with example data:
> user_login
user_id login_time
1 1 100
2 1 140
> user_page_visited
page_id page_visited_at
1 1 100
2 1 200
3 2 120
4 2 130
5 3 160
6 3 150
I wish to return all rows of user_page_visited that fall into a range based off user_login.login_time, for example, return all pages accessed within 20 seconds of an existing login_time:
> user_page_visited
page_id page_visited_at
1 1 100
3 2 120
5 3 160
6 3 150
How would I do this efficiently when both tables have lots of rows? For example, the following query does something similar (returns duplicate rows when ranges overlap), but seems to very slow:
select * from
user_login l cross join
user_page_visited v
where v.page_visited_at >= l.login_time
and v.page_visited_at <= l.login_time + 20;
First, use regular join syntax:
select *
from user_login l join
user_page_visited v
on v.page_visited_at >= l.login_time and
v.page_visited_at <= l.login_time + 20;
Next, be sure that you have indexes on the columns used for the join. . . user_login(login_time) and user_page_visited(page_visited_at).
If these don't work, then you still have a couple of options. If the "20" is fixed, you can vary the type of index. There are also tricks if you are only looking for one match between, say, the login and the page visited.
This solution is based on the comments of the answer from Gordon Linoff.
First we retrieve the tuples that were accessed in the same time slice as a user connection or in the following time slice using the following query:
SELECT DISTINCT page_id, page_visited_at
FROM user_login
INNER JOIN user_page_visited ON login_time::INT / 20 = page_visited_at::INT / 20 OR login_time::INT / 20 = page_visited_at::INT / 20 - 1;
We now need indexes in order to get a good query plan:
CREATE INDEX i_user_login_login_time_20 ON user_login ((login_time::INT / 20));
CREATE INDEX i_user_page_visited_page_visited_at_20 ON user_page_visited ((page_visited_at::INT / 20));
CREATE INDEX i_user_page_visited_page_visited_at_20_minus_1 ON user_page_visited ((page_visited_at::INT / 20 - 1));
If you EXPLAIN the query with these indexes, you get a BitmapOr on two Bitmap Index Scan operations, with some low constant cost. On the other hand, without these indexes you get a sequential scan with a way higher cost (I tested with tables of ~100k tuples each).
However this query gives too much results. We need to filter it again to get the final result:
SELECT DISTINCT page_id, page_visited_at
FROM user_login
INNER JOIN user_page_visited ON login_time::INT / 20 = page_visited_at::INT / 20 OR login_time::INT / 20 = page_visited_at::INT / 20 - 1
WHERE page_visited_at BETWEEN login_time AND login_time + 20;
Using EXPLAIN on this query shows that PostgreSQL still uses the Bitmap Index Scans.
With ~100k rows in user_login and ~200k rows in user_page_visited the query needs ~1.4s to retrieve ~200k rows versus 3.5s without the slice prefilter.
(uname -a: Linux shepwork 4.4.26-gentoo #8 SMP Mon Nov 21 09:45:10 CET 2016 x86_64 AMD FX(tm)-6300 Six-Core Processor AuthenticAMD GNU/Linux)
I want to SUM a lot of rows.
Is it quicker (or better practice, etc) to do Option A or Option B?
Option A
SELECT
[Person]
SUM([Value]) AS Total
FROM
Database
WHERE
[Value] > 0
GROUP BY
[Person]
Option B
SELECT
[Person]
SUM([Value]) AS Total
FROM
Database
GROUP BY
[Person]
So if I have, for Person X:
0, 7, 0, 6, 0, 5, 0, 0, 0, 4, 0, 9, 0, 0
Option A does:
a) Remove zeros
b) 7 + 6 + 5 + 4 + 9
Option B does:
a) 0 + 7 + 0 + 6 + 0 + 5 + 0 + 0 + 0 + 4 + 0 + 9 + 0 + 0
Option A has less summing, because it has fewer records to sum, because I've excluded the load that have a zero value. But Option B doesn't need a WHERE clause.
Anyone got an idea as to whether either of these are significantly quicker/better than the other? Or is it just something that doesn't matter either way?
Thanks :-)
Well, if you have a filtered index that exactly matches the where clause, and if that index removes a significant amount of data (as in: a good chunk of the data is zeros), then definitely the first... If you don't have such an index: then you'll need to test it on your specific data, but I would probably expect the unfiltered scenario to be faster, as it can do use a range of tricks to do the sum if it doesn't need to do branching etc.
However, the two examples aren't functionally equivalent at the moment (the second includes negative values, the first doesn't).
Assuming that Value is always positive the 2nd query might still return less rows if there's a Person with all zeroes.
Otherwise you should simply test actual runtime/CPU on a really large amount of rows.
As already pointed out, the two are not functionally equivalent. In addition to the differences already pointed out (negative values, different output row count), Option B also filters out rows where Value is NULL. Option A doesn't.
Based on the Execution plan for both of these and using a small dataset similar to the one you provided, Option B is slightly faster with an Estimated Subtree Cost of .0146636 vs .0146655. However, you may get different results depending on the query or size of dataset. Only option is to test and see for yourself.
http://www.developer.com/db/how-to-interpret-query-execution-plan-operators.html
Drop Table #Test
Create Table #Test (Person nvarchar(200), Value int)
Insert Into #Test
Select 'Todd', 12 Union
Select 'Todd', 11 Union
Select 'Peter', 20 Union
Select 'Peter', 29 Union
Select 'Griff', 10 Union
Select 'Griff', 0 Union
Select 'Peter', 0 Union
SELECT [Person], SUM([Value]) AS Total
FROM #Test
WHERE [Value] > 0
GROUP BY [Person]
SELECT [Person],SUM([Value]) AS Total
FROM #Test
GROUP BY [Person]
How does SQL Server implement group by clauses (aggregates)?
As inspiration, take the execution plan of this question's query:
select p_id, DATEDIFF(D, MIN(TreatmentDate), MAX(TreatmentDate)) from
patientsTable group by p_id
Before query data, simple select statement and its execution plan is this:
After retrieving the data with the query and execution plan:
Usually it's a Stream Aggregate or a Hash Aggregate.
Stream aggregate sorts the resultset, scans it and returns every new value (not equal to the last in scan). It allows to keep but one set of the aggregate state variables.
Hash aggregate builds a hash table from the resultset. Each entry keeps the aggregate state variables which are initialized on hash miss and updated on hash hit.
Let's see how AVG works. It needs two state variables: sum and count
grouper value
1 4
1 3
2 8
1 7
2 1
1 2
2 6
2 3
Stream Aggregate
First, it needs to sort the values:
grouper value
1 4
1 3
1 7
1 2
2 8
2 1
2 6
2 3
Then, it keeps one set of state variables, initialized to 0, and scans the sorted resultset:
grouper value sum count
-- Entered
-- Variables: 0 0
1 4 4 1
1 3 7 2
1 7 14 3
1 2 16 4
-- Group change. Return the result and reinitialize the variables
-- Returning 1, 4
-- Variables: 0 0
2 8 8 1
2 1 9 2
2 6 15 3
2 3 18 4
-- Group change. Return the result and reinitialize the variables
-- Returning 2, 4.5
-- End
Hash aggregate
Just scanning the values and keeping the state variables in the hash table:
grouper value
-- Hash miss. Adding new entry to the hash table
-- [1] (0, 0)
-- ... and updating it:
1 4 [1] (4, 1)
-- Hash hit. Updating the entry:
1 3 [1] (7, 2)
-- Hash miss. Adding new entry to the hash table
-- [1] (7, 2) [2] (0, 0)
-- ... and updating it:
2 8 [1] (7, 2) [2] (8, 1)
1 7 [1] (14, 3) [2] (8, 1)
2 1 [1] (14, 3) [2] (9, 2)
1 2 [1] (16, 4) [2] (9, 2)
2 6 [1] (16, 4) [2] (15, 3)
2 3 [1] (16, 4) [2] (18, 4)
-- Scanning the hash table and returning the aggregated values
-- 1 4
-- 2 4.5
Usually, sort is faster if the resultset is already ordered (like, the values come out of the index or a resultset sorted by the previous operation).
Hash is faster is the resultset is not sorted (hashing is faster than sorting).
MIN and MAX are special cases, since they don't require scanning the whole group: only the first and the last value of the aggregated column within the group.
Unfortunately, SQL Server, unlike most other systems, cannot utilize this efficiently, since it's not good in doing INDEX SKIP SCAN (jumping over distinct index keys).
While simple MAX and MIN (without GROUP BY clause) use a TOP method if the index on the aggregated column is present, MIN and MAX with GROUP BY use same methods as other aggregate functions do.
AS i dont have table available i tried with my custom table PRICE
I defined Primary key = ID_PRICE
Select PRICE.ID_PRICE, Max(PRICE.COSTKVARH) - min(PRICE.COSTKVARH) from PRICE
GROUP BY PRICE.ID_PRICE
Plan:
PLAN (PRICE ORDER PK_PRICE)
Adapted plan:
PLAN (PRICE ORDER PK_PRICE)
In your case p_id is a primary key so the adapted plan will be first order of patientsTable based on p_id then grouping and difference calculation will happen..