Let's say I have a very basic table:
DAY_ID
Value
Inserts
5
8
2
4
3
0
3
3
0
2
4
1
1
8
0
I want to be able to "loop" through the Inserts column, and add that many # of rows.
For each added row, I want DAY_ID to be decreased by 1 and Value to remain the same, Inserts column is irrelevant we can set to 0.
So 2 new rows should be added from DAY_ID = 5 and Value = 8, and 1 new row with DAY_ID = 2 and Value = 4. The final output of the new rows would be:
DAY_ID
Value
Inserts
(5-1)
8
0
(5-2)
8
0
(2-1)
4
0
I haven't tried much in SQL Server, I was able to create a solution in R and Python using arrays, but I'm really hoping I can make something work in SQL Server for this project.
I think this can be done using a loop in SQL.
Looping is generally not the way you solve any problems in SQL - SQL is designed and optimized to work with sets, not one row at a time.
Consider this source table:
CREATE TABLE dbo.src(DAY_ID int, Value int, Inserts int);
INSERT dbo.src VALUES
(5, 8, 2),
(4, 3, 0),
(3, 3, 0),
(2, 4, 1),
(1, 8, 0);
There are many ways to "explode" a set based on a single value. One is to split a set of commas (replicated to the length of the value, less 1).
-- INSERT dbo.src(DAY_ID, Value, Inserts)
SELECT
DAY_ID = DAY_ID - ROW_NUMBER() OVER (PARTITION BY DAY_ID ORDER BY ##SPID),
src.Value,
Inserts = 0
FROM dbo.src
CROSS APPLY STRING_SPLIT(REPLICATE(',', src.Inserts-1), ',') AS v
WHERE src.Inserts > 0;
Output:
DAY_ID
Value
Inserts
1
4
0
4
8
0
3
8
0
Working example in this fiddle.
Related
I have a SQLite table with an Id and an active period, and I am trying to get counts of the number of active of rows over a sequence of times.
A vastly simplified version of this table is:
CREATE TABLE Data (
EntityId INTEGER NOT NULL,
Start INTEGER NOT NULL,
Finish INTEGER
);
With some example data
INSERT INTO Data VALUES
(1, 0, 2),
(1, 4, 6),
(1, 8, NULL),
(2, 5, 7),
(2, 9, NULL),
(3, 8, NULL);
And an desired output of something like:
Time
Count
0
1
1
1
2
0
3
0
4
1
5
2
6
1
7
0
8
2
9
3
For which I am querying with:
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT Time, COUNT(EntityId)
FROM Data
JOIN Generate_Time ON Start <= Time AND (Finish > Time OR Finish IS NULL)
GROUP BY Time
There is also some data I need to categorise the counts by (some are on the original table, some are using a join), but I am hitting a performance bottleneck in the order of seconds on even small amounts of data (~25,000 rows) without any of that.
I have added an index on the table covering Start/End:
CREATE INDEX Ix_Data ON Data (
Start,
Finish
);
and that helped somewhat but I can't help but feel there's a more elegant & performant way of doing this. Using the CTE to iterate over a range doesn't seem like it will scale very well but I can't think of another way to calculate what I need.
I've been looking at the query plan too, and I think the slow part of the GROUP BY since it can't use an index for that since it's from the CTE so SQLite generates a temporary BTree:
3 0 0 MATERIALIZE 3
7 3 0 SETUP
8 7 0 SCAN CONSTANT ROW
21 3 0 RECURSIVE STEP
22 21 0 SCAN TABLE Generate_Time
27 21 0 SCALAR SUBQUERY 2
32 27 0 SEARCH TABLE Data USING COVERING INDEX Ix_Data
57 0 0 SCAN SUBQUERY 3
59 0 0 SEARCH TABLE Data USING INDEX Ix_Data (Start<?)
71 0 0 USE TEMP B-TREE FOR GROUP BY
Any suggestions of a way to speed this query up, or even a better way of storing this data to craft a tighter query would be most welcome!
To get to the desired output as per your question, the following can be done.
For better performance, on option is to make use of generate_series to generate rows instead of the recursive CTE and limit the number of rows to the max-value available in data.
WITH RECURSIVE Generate_Time(Time) AS (
SELECT 0
UNION ALL
SELECT Time + 1 FROM Generate_Time
WHERE Time + 1 <= (SELECT MAX(Start) FROM Data)
)
SELECT gt.Time
,count(d.entityid)
FROM Generate_Time gt
LEFT JOIN Data d
ON gt.Time between d.start and IFNULL(d.finish,gt.Time)
GROUP BY gt.Time
This ended up being simply a case of the result set being too large. In my real data, the result set before grouping was ~19,000,000 records. I was able to do some partitioning on my client side, splitting the queries into smaller discrete chunks which improved performance ~10x, which still wasn't quite as fast as I wanted but was acceptable for my use case.
I am looking for a simple clean method to obtain the sequence {1, 2, 3, 4, 5, 6, 7,...,1000000} in MS Access SQL. I thought of creating a table with a column that numbers from 1 to 100000 however, this is inefficient.
Is there a way of generating numbers 1 to 10000000 in MS Access using SQL?
I tried the GENERATE_SERIES() function but MS Access SQL does not support this function.
id | number
------------
1. | 1
2. | 2
3. | 3
4. | 4
5. | 5
6. | 6
7. | 7
8. | 8
Yes, and it is not painfull - use a Cartesian query.
First, create a small query returning 10 records:
SELECT
DISTINCT Abs([id] Mod 10) AS N
FROM
MSysObjects;
Save it as Ten.
Then run this simple query:
SELECT
[Ten_0].[N]+[Ten_1].[N]*10+[Ten_2].[N]*100+[Ten_3].[N]*1000+[Ten_4].[N]*10000+[Ten_5].[N]*100000 AS Id
FROM
Ten AS Ten_0,
Ten AS Ten_1,
Ten AS Ten_2,
Ten AS Ten_3,
Ten AS Ten_4,
Ten AS Ten_5
which, in two seconds, will return Id from 0 to 999999.
Very painful but you can do the following:
create table numbers (
id int autoincrement,
number int
);
-- 1 row
insert into numbers (number) values (1);
-- 2 rows
insert into numbers (number) select number from numbers;
-- 4 rows
insert into numbers (number) select number from numbers;
-- 8 rows
insert into numbers (number) select number from numbers;
-- repeat a total of 20 times
The value of number is always 1, but id increments. You can make them equal using an update:
update numbers
set number = id;
If this were SQL server you could use a recursive CTE to do this.
WITH number
AS
(
SELECT num = 1
UNION ALL
SELECT num + 1
from number
where num < 1000000
)
SELECT num FROM number
option(maxrecursion 0)
But you are asking about MS Access, since access does not support recursive CTEs, you can try doing it with a macro and insert it into a table and read it out?
I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()
I have a column COL in a table which has integer values like: 1, 2, 3, 10, 11 ... and son on. Uniqueness in the table is created by an ID. Each ID can be associated with multiple COL values. For example
ID | COL
——————————
1 | 2
————+—————
1 | 3
————+—————
1 | 10
————+—————
is valid.
What I want to do is select only the COL values from the table that are greater than 3, AND (the problematic part) also select the value that is the MAX of 1, 2, and 3, if they exist at all. So in the table above, I would want to select values [3, 10] because 10 is greater than 3 and 3 = MAX(3, 2).
I know I can do this with two SQL statements, but it's sort of messy. Is there a way of doing it with one statement only?
SELECT col FROM table
WHERE
col > 3
UNION
SELECT MAX(col) FROM table
WHERE
col <= 3
This query does not assume you want the results per id, because you don't explicitely mention it.
I don't think you need pl/sql for this, SQL is enough.
How does SQL Server implement group by clauses (aggregates)?
As inspiration, take the execution plan of this question's query:
select p_id, DATEDIFF(D, MIN(TreatmentDate), MAX(TreatmentDate)) from
patientsTable group by p_id
Before query data, simple select statement and its execution plan is this:
After retrieving the data with the query and execution plan:
Usually it's a Stream Aggregate or a Hash Aggregate.
Stream aggregate sorts the resultset, scans it and returns every new value (not equal to the last in scan). It allows to keep but one set of the aggregate state variables.
Hash aggregate builds a hash table from the resultset. Each entry keeps the aggregate state variables which are initialized on hash miss and updated on hash hit.
Let's see how AVG works. It needs two state variables: sum and count
grouper value
1 4
1 3
2 8
1 7
2 1
1 2
2 6
2 3
Stream Aggregate
First, it needs to sort the values:
grouper value
1 4
1 3
1 7
1 2
2 8
2 1
2 6
2 3
Then, it keeps one set of state variables, initialized to 0, and scans the sorted resultset:
grouper value sum count
-- Entered
-- Variables: 0 0
1 4 4 1
1 3 7 2
1 7 14 3
1 2 16 4
-- Group change. Return the result and reinitialize the variables
-- Returning 1, 4
-- Variables: 0 0
2 8 8 1
2 1 9 2
2 6 15 3
2 3 18 4
-- Group change. Return the result and reinitialize the variables
-- Returning 2, 4.5
-- End
Hash aggregate
Just scanning the values and keeping the state variables in the hash table:
grouper value
-- Hash miss. Adding new entry to the hash table
-- [1] (0, 0)
-- ... and updating it:
1 4 [1] (4, 1)
-- Hash hit. Updating the entry:
1 3 [1] (7, 2)
-- Hash miss. Adding new entry to the hash table
-- [1] (7, 2) [2] (0, 0)
-- ... and updating it:
2 8 [1] (7, 2) [2] (8, 1)
1 7 [1] (14, 3) [2] (8, 1)
2 1 [1] (14, 3) [2] (9, 2)
1 2 [1] (16, 4) [2] (9, 2)
2 6 [1] (16, 4) [2] (15, 3)
2 3 [1] (16, 4) [2] (18, 4)
-- Scanning the hash table and returning the aggregated values
-- 1 4
-- 2 4.5
Usually, sort is faster if the resultset is already ordered (like, the values come out of the index or a resultset sorted by the previous operation).
Hash is faster is the resultset is not sorted (hashing is faster than sorting).
MIN and MAX are special cases, since they don't require scanning the whole group: only the first and the last value of the aggregated column within the group.
Unfortunately, SQL Server, unlike most other systems, cannot utilize this efficiently, since it's not good in doing INDEX SKIP SCAN (jumping over distinct index keys).
While simple MAX and MIN (without GROUP BY clause) use a TOP method if the index on the aggregated column is present, MIN and MAX with GROUP BY use same methods as other aggregate functions do.
AS i dont have table available i tried with my custom table PRICE
I defined Primary key = ID_PRICE
Select PRICE.ID_PRICE, Max(PRICE.COSTKVARH) - min(PRICE.COSTKVARH) from PRICE
GROUP BY PRICE.ID_PRICE
Plan:
PLAN (PRICE ORDER PK_PRICE)
Adapted plan:
PLAN (PRICE ORDER PK_PRICE)
In your case p_id is a primary key so the adapted plan will be first order of patientsTable based on p_id then grouping and difference calculation will happen..