how to speed up left join qualify operation in sql

how to speed up left join qualify operation in sql - sql

I have a following sql code:
with mytable(stock, datetime, price) as (
select * from values
(1, '2022-12-13 12:31:45.00'::timestamp, 10.0),
(1, '2022-12-13 12:31:45.01'::timestamp, 10.1),
(1, '2022-12-13 12:31:45.02'::timestamp, 10.2),
(1, '2022-12-13 12:31:46.00'::timestamp, 11.0),
(1, '2022-12-13 12:31:46.01'::timestamp, 11.1),
(1, '2022-12-13 12:31:46.02'::timestamp, 11.2),
(1, '2022-12-13 12:31:46.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:47.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:48.00'::timestamp, 11.3)
)
select t1.*
,t2.datetime as next_datetime
,t2.price as next_price
,t3.datetime as next2_datetime
,t3.price as next2_price
,t4.datetime as next3_price
,t4.price as next3_price
from mytable as t1
left join mytable as t2
on t1.stock = t2.stock and timediff(second, t1.datetime, t2.datetime) < 1
left join mytable as t3
on t1.stock = t3.stock and timediff(second, t1.datetime, t3.datetime) < 2
left join mytable as t4
on t1.stock = t4.stock and timediff(second, t1.datetime, t4.datetime) < 3
qualify row_number() over (partition by t1.stock, t1.datetime order by (t2.datetime, t3.datetime, t4.datetime) desc) = 1
ORDER BY 1,2;
With a small table, this works just fine. However, I have a table with at least 100,000 rows and doing a single left join takes about 30 minutes. Is there a better faster way to do the above operation in sql?

There is more extensive refactoring that could be done to this query, and there is also a lot that can be done with data shaping and other techniques on these kinds of time series joins. For now, it's important to reduce the intermediate cardinality in the plan. Specifically, we want to reduce the cardinality of the join, which feeds into the window function and filter. That's currently 3,897.
Step one is to run this after running your sample query to see if the refactor produces identical results:
select hash_agg(*) from table(result_scan(last_query_id()));
This produces 1433824845005768014 when running your sample.
Now on the join, we want to restrict the number of rows that survive the join condition. Since we're looking for rows that happen in the future only, we don't have to join the ones that happen in the past:
with mytable(stock, datetime, price) as (
select * from values
(1, '2022-12-13 12:31:45.00'::timestamp, 10.0),
(1, '2022-12-13 12:31:45.01'::timestamp, 10.1),
(1, '2022-12-13 12:31:45.02'::timestamp, 10.2),
(1, '2022-12-13 12:31:46.00'::timestamp, 11.0),
(1, '2022-12-13 12:31:46.01'::timestamp, 11.1),
(1, '2022-12-13 12:31:46.02'::timestamp, 11.2),
(1, '2022-12-13 12:31:46.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:47.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:48.00'::timestamp, 11.3)
)
select t1.*
,t2.datetime as next_datetime
,t2.price as next_price
,t3.datetime as next2_datetime
,t3.price as next2_price
,t4.datetime as next3_price
,t4.price as next3_price
from mytable as t1
left join mytable as t2
on t1.stock = t2.stock and timediff(second, t1.datetime, t2.datetime) < 1 and t1.datetime <= t2.datetime
left join mytable as t3
on t1.stock = t3.stock and timediff(second, t1.datetime, t3.datetime) < 2 and t1.datetime <= t3.datetime
left join mytable as t4
on t1.stock = t4.stock and timediff(second, t1.datetime, t4.datetime) < 3 and t1.datetime <= t4.datetime
qualify row_number() over (partition by t1.stock, t1.datetime order by (t2.datetime, t3.datetime, t4.datetime) desc) = 1
ORDER BY 1,2;
This reduces the cardinality heading into the window function and qualify filter to 989. That's a 75% reduction of rows heading into the window function. Next, run the check again to make sure the results are identical:
select hash_agg(*) from table(result_scan(last_query_id()));
Edit: A slight change to the method of ensuring only future rows get joined reduced the intermediate cardinality from 989 to 497, an 87% reduction.
Edit 2: This is a JavaScript UDTF approach that does not increase cardinality at all. I tested this on real-world NTSE data. It processed 2.55 billion rows in 12min on a Large warehouse.
create or replace function LAG_BY_TIME(ROW_TIME float, LAGGED_VALUE float, LAG_TIME1 float, LAG_TIME2 float, LAG_TIME3 float)
returns table(LAGGED_TIME1 float, LAGGED_VALUE1 float, LAGGED_TIME2 float, LAGGED_VALUE2 float, LAGGED_TIME3 float, LAGGED_VALUE3 float)
language javascript
as
$$
{
initialize: function (argumentInfo, context) {
this.buffer1 = [];
this.buffer2 = [];
this.buffer3 = [];
},
processRow: function (row, rowWriter, context) {
var shifted;
var laggedTime = new Date(row.ROW_TIME);
this.buffer1.push({laggedTime:laggedTime, laggedValue:row.LAGGED_VALUE});
this.buffer2.push({laggedTime:laggedTime, laggedValue:row.LAGGED_VALUE});
this.buffer3.push({laggedTime:laggedTime, laggedValue:row.LAGGED_VALUE});
var laggedTime1 = new Date(laggedTime.getTime() + row.LAG_TIME1);
var laggedTime2 = new Date(laggedTime.getTime() + row.LAG_TIME2);
var laggedTime3 = new Date(laggedTime.getTime() + row.LAG_TIME3);
do {
if (this.buffer1[0].laggedTime >= laggedTime1) {
this.buffer1.shift();
shifted = true;
} else {
shifted = false;
}
} while (shifted)
do {
if (this.buffer2[0].laggedTime >= laggedTime2) {
this.buffer2.shift();
shifted = true;
} else {
shifted = false;
}
} while (shifted)
do {
if (this.buffer3[0].laggedTime >= laggedTime3) {
this.buffer3.shift();
shifted = true;
} else {
shifted = false;
}
} while (shifted)
rowWriter.writeRow({
LAGGED_TIME1:this.buffer1[0].laggedTime, LAGGED_VALUE1:this.buffer1[0].laggedValue,
LAGGED_TIME2:this.buffer2[0].laggedTime, LAGGED_VALUE2:this.buffer2[0].laggedValue,
LAGGED_TIME3:this.buffer3[0].laggedTime, LAGGED_VALUE3:this.buffer3[0].laggedValue,
});
},
finalize: function (rowWriter, context) {/*...*/},
}
$$;
create or replace function TO_EPOCH(TS timestamp) returns float as
$$ round(datediff(milliseconds, '1970-01-01'::timestamp, TS),0)::float $$;
create or replace function FROM_EPOCH(EPOCH float) returns timestamp as
$$ dateadd(milliseconds, round(EPOCH,0), '1970-01-01'::timestamp) $$;
with mytable(stock, datetime, price) as (
select * from values
(1, '2022-12-13 12:31:45.00'::timestamp, 10.0),
(1, '2022-12-13 12:31:45.01'::timestamp, 10.1),
(1, '2022-12-13 12:31:45.02'::timestamp, 10.2),
(1, '2022-12-13 12:31:46.00'::timestamp, 11.0),
(1, '2022-12-13 12:31:46.01'::timestamp, 11.1),
(1, '2022-12-13 12:31:46.02'::timestamp, 11.2),
(1, '2022-12-13 12:31:46.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:47.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:48.00'::timestamp, 11.3)
)
select STOCK
,DATETIME
,PRICE as PRICE
,from_epoch(LAGGED_TIME1) as NEXT_DATETIME
,LAGGED_VALUE1::number(12,2) as NEXT_PRICE
,from_epoch(LAGGED_TIME2) as NEXT2_DATETIME
,LAGGED_VALUE2::number(12,2) as NEXT2_PRICE
,from_epoch(LAGGED_TIME1) as NEXT3_DATETIME
,LAGGED_VALUE3::number(12,2) as NEXT3_PRICE
from MYTABLE, table(lag_by_time(to_epoch(DATETIME), PRICE::float, 1000::float, 2000::float, 3000::float)
over (partition by stock order by DATETIME desc))
order by STOCK, DATETIME
;

If you quantize your data to the seconds, and pick "winner row" per second, then you can use lag/lead.
Or:
If you need each milli-second entry to be matched to the next n second rows. quantize the data to windows and double the data, then do an equi join on that, so the sliding windows are on very heavly constrained groupings
Or:
Use a user defined table function to provide aggregate/cache values, and build yourself a latch/caching scrolling window.

I run your query and here is a part of the execution plan:
As you can see, your 9 rows became 3897 rows. If each stock has average 10 records/rows in your real table, that means even a single left join will multiply the rows by 10! If there are some stocks that have more records/rows than others, this may cause serious skewness and performance degradation.
The result of your sample query doesn't make any sense to me. Maybe your sample data has some issues because the difference between them is mostly milliseconds but your query looks for seconds. Anyway I suggest you to check LEAD function:
https://docs.snowflake.com/en/sql-reference/functions/lead.html
with mytable(stock, datetime, price) as (
select * from values
(1, '2022-12-13 12:31:45.00'::timestamp, 10.0),
(1, '2022-12-13 12:31:45.01'::timestamp, 10.1),
(1, '2022-12-13 12:31:45.02'::timestamp, 10.2),
(1, '2022-12-13 12:31:46.00'::timestamp, 11.0),
(1, '2022-12-13 12:31:46.01'::timestamp, 11.1),
(1, '2022-12-13 12:31:46.02'::timestamp, 11.2),
(1, '2022-12-13 12:31:46.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:47.03'::timestamp, 11.3),
(1, '2022-12-13 12:31:48.00'::timestamp, 11.3)
)
select t1.*
,LEAD( t1.datetime) over (partition by t1.stock order by t1.datetime ) as next_datetime
,LEAD( t1.price) over (partition by t1.stock order by t1.datetime ) as next_price
,LEAD( t1.datetime, 2) over (partition by t1.stock order by t1.datetime ) as next2_datetime
,LEAD( t1.price, 2) over (partition by t1.stock order by t1.datetime ) as next2_price
,LEAD( t1.datetime, 3) over (partition by t1.stock order by t1.datetime ) as next3_datetime
,LEAD( t1.price, 3) over (partition by t1.stock order by t1.datetime ) as next3_price
from mytable as t1;
ORDER BY 1,2;

Related

SQL - Getting Sum of 'X' Consecutive Values where X is an Integer in another Row (With Categories)

Say for example, I wanted to SUM all the values from the current row until the provided count. See table below:
For example:
Category A, Row 1: 10+15+25 = 50 (because it adds Rows 1 to 3 due to Count)
Category A, Row 2: 15+25+30+40 = 110 (because it adds Rows 2 to 5 due to count)
Category A, Row 5: 40+60 = 100 (because it Adds Rows 5 and 6. Since the count is 5, but the category ends at Row 6, so instead of that, it sums all available data which is Rows 5 and 6 only, thus having a value of 100.
Same goes for Category B.
How do I do this?

You can do this using window functions:
with tt as (
select t.*,
sum(quantity) over (partition by category order by rownumber) as running_quantity,
max(rownumber) over (partition by category) as max_rownumber
from t
)
select tt.*,
coalesce(tt2.running_quantity, ttlast.running_quantity) - tt.running_quantity + tt.quantity
from tt left join
tt tt2
on tt2.category = tt.category and
tt2.rownumber = tt.rownumber + tt.count - 1 left join
tt ttlast
on ttlast.category = tt.category and
ttlast.rownumber = ttlast.max_rownumber
order by category, rownumber;
I can imagine that under some circumstances this would be much faster -- particularly if the count values are relatively large. For small values of count, the lateral join is probably faster, but it is worth checking if performance is important.
Actually, a pure window functions approach is probably the best approach:
with tt as (
select t.*,
sum(quantity) over (partition by category order by rownumber) as running_quantity
from t
)
select tt.*,
(coalesce(lead(tt.running_quantity, tt.count - 1) over (partition by tt.category order by tt.rownumber),
first_value(tt.running_quantity) over (partition by tt.category order by tt.rownumber desc)
) - tt.running_quantity + tt.quantity
)
from tt
order by category, rownumber;
Here is a db<>fiddle.

Try this:
DECLARE #DataSource TABLE
(
[Category] CHAR(1)
,[Row Number] BIGINT
,[Quantity] INT
,[Count] INT
);
INSERT INTO #DataSource ([Category], [Row Number], [Quantity], [Count])
VALUES ('A', 1, 10, 3)
,('A', 2, 15, 4)
,('A', 3, 25, 2)
,('A', 4, 30, 1)
,('A', 5, 40, 5)
,('A', 6, 60, 2)
--
,('B', 1, 12, 2)
,('B', 2, 13, 3)
,('B', 3, 17, 1)
,('B', 4, 11, 2)
,('B', 5, 10, 5)
,('B', 6, 7, 3);
SELECT *
FROM #DataSource E
CROSS APPLY
(
SELECT SUM(I.[Quantity])
FROM #DataSource I
WHERE I.[Row Number] <= E.[Row Number] + E.[Count] - 1
AND I.[Row Number] >= E.[Row Number]
AND E.[Category] = I.[Category]
) DS ([Sum]);

Several joins in query - possible to replacement to gain performance?

I have a table consisting of 10 million rows where I am trying to find who was the first/last maintainer of some machines (id) depending on some dates and also depending on what status the machine had. My query uses six joins, is there any other preferred option?
EDIT: The original table has index, trying to optimise the query replacing the joins - if its possible?
SQL Fiddle with example:
SQL Fiddle
EDIT (added additional information below):
Example table:
CREATE TABLE vendor_info (
id INT,
datestamp INT,
statuz INT,
maintainer VARCHAR(25));
INSERT INTO vendor_info VALUES (1, 20180101, 0, 'Jay');
INSERT INTO vendor_info VALUES (2, 20180101, 0, 'Eric');
INSERT INTO vendor_info VALUES (3, 20180101, 1, 'David');
INSERT INTO vendor_info VALUES (1, 20180201, 1, 'Jay');
INSERT INTO vendor_info VALUES (2, 20180201, 0, 'Jay');
INSERT INTO vendor_info VALUES (3, 20180201, 1, 'Jay');
INSERT INTO vendor_info VALUES (1, 20180301, 1, 'Jay');
INSERT INTO vendor_info VALUES (2, 20180301, 1, 'David');
INSERT INTO vendor_info VALUES (3, 20180301, 1, 'Eric');
Query and desired output:
SELECT
id
, MIN(datestamp) AS min_datestamp
, MAX(datestamp) AS max_datestamp
, MAX(case when statuz = 0 then datestamp end) AS max_s0_date
, MAX(case when statuz = 1 then datestamp end) AS max_s1_date
, MIN(case when statuz = 0 then datestamp end) AS min_s0_date
, MIN(case when statuz = 1 then datestamp end) AS min_s1_date
INTO vendor_dates
FROM vendor_info
GROUP BY id;
SELECT
vd.id
, v1.maintainer AS first_maintainer
, v2.maintainer AS last_maintainer
, v3.maintainer AS last_s0_maintainer
, v4.maintainer AS last_s1_maintainer
, v5.maintainer AS first_s0_maintainer
, v6.maintainer AS first_s1_maintainer
FROM vendor_dates vd
LEFT JOIN vendor_info v1 ON vd.id = v1.id AND vd.min_datestamp = v1.datestamp
LEFT JOIN vendor_info v2 ON vd.id = v2.id AND vd.max_datestamp = v2.datestamp
LEFT JOIN vendor_info v3 ON vd.id = v3.id AND vd.max_s0_date = v3.datestamp
LEFT JOIN vendor_info v4 ON vd.id = v4.id AND vd.max_s1_date = v4.datestamp
LEFT JOIN vendor_info v5 ON vd.id = v5.id AND vd.min_s0_date = v5.datestamp
LEFT JOIN vendor_info v6 ON vd.id = v6.id AND vd.min_s1_date = v6.datestamp;

Adding an index to vendor_info reduces duration of your 2nd query from over 300ms to under 30ms average over repeated runs
PRIMARY KEY CLUSTERED (id, datestamp)
Changing the 2 step process into a CTE reduces total duration even more to well under 15ms average over repeated runs.
The CTE method lets the query optimiser use the new primary key
CREATE TABLE vendor_info (
id INT,
datestamp INT,
statuz INT,
maintainer VARCHAR(25)
PRIMARY KEY CLUSTERED (id, datestamp)
);
INSERT INTO vendor_info VALUES (1, 20180101, 0, 'Jay');
INSERT INTO vendor_info VALUES (2, 20180101, 0, 'Eric');
INSERT INTO vendor_info VALUES (3, 20180101, 1, 'David');
INSERT INTO vendor_info VALUES (1, 20180201, 1, 'Jay');
INSERT INTO vendor_info VALUES (2, 20180201, 0, 'Jay');
INSERT INTO vendor_info VALUES (3, 20180201, 1, 'Jay');
INSERT INTO vendor_info VALUES (1, 20180301, 1, 'Jay');
INSERT INTO vendor_info VALUES (2, 20180301, 1, 'David');
INSERT INTO vendor_info VALUES (3, 20180301, 1, 'Eric');
WITH vendor_dates AS
(SELECT
id
, MIN(datestamp) AS min_datestamp
, MAX(datestamp) AS max_datestamp
, MAX(case when statuz = 0 then datestamp end) AS max_s0_date
, MAX(case when statuz = 1 then datestamp end) AS max_s1_date
, MIN(case when statuz = 0 then datestamp end) AS min_s0_date
, MIN(case when statuz = 1 then datestamp end) AS min_s1_date
FROM vendor_info
GROUP BY id
)
SELECT
vd.id
, v1.maintainer AS first_maintainer
, v2.maintainer AS last_maintainer
, v3.maintainer AS last_s0_maintainer
, v4.maintainer AS last_s1_maintainer
, v5.maintainer AS first_s0_maintainer
, v6.maintainer AS first_s1_maintainer
FROM vendor_dates vd
LEFT JOIN vendor_info v1 ON vd.id = v1.id AND vd.min_datestamp = v1.datestamp
LEFT JOIN vendor_info v2 ON vd.id = v2.id AND vd.max_datestamp = v2.datestamp
LEFT JOIN vendor_info v3 ON vd.id = v3.id AND vd.max_s0_date = v3.datestamp
LEFT JOIN vendor_info v4 ON vd.id = v4.id AND vd.max_s1_date = v4.datestamp
LEFT JOIN vendor_info v5 ON vd.id = v5.id AND vd.min_s0_date = v5.datestamp
LEFT JOIN vendor_info v6 ON vd.id = v6.id AND vd.min_s1_date = v6.datestamp;

Check the following query.
WITH
a AS (
SELECT
id, datestamp, maintainer, statuz,
MIN(datestamp) OVER(PARTITION BY id) AS fm,
MAX(datestamp) OVER(PARTITION BY id) AS lm,
MIN(datestamp) OVER(PARTITION BY id, statuz) AS fZm,
MAX(datestamp) OVER(PARTITION BY id, statuz) AS lZm
FROM vendor_info
)
SELECT
id,
MIN(IIF(datestamp = fm, maintainer, NULL)) AS first_maintainer,
MAX(IIF(datestamp = lm, maintainer, NULL)) AS last_maintainer,
MAX(IIF(datestamp = lZm AND statuz = 0, maintainer, NULL)) AS last_s0_maintainer,
MAX(IIF(datestamp = lZm AND statuz = 1, maintainer, NULL)) AS last_s1_maintainer,
MIN(IIF(datestamp = fZm AND statuz = 0, maintainer, NULL)) AS first_s0_maintainer,
MIN(IIF(datestamp = fZm AND statuz = 1, maintainer, NULL)) AS first_s1_maintainer
FROM a
GROUP BY id;
It can be tested on SQL Fiddle.

I haven't had time yet to generate 10 mil test records , but try this with index on Id, datestamp - I've got hopes for it - the execution plan looked good - edit - with 50 mil records I generated, it looked pretty fast as long as the (id,datestamp) index (or other suitable index) is there.
SELECT tID.id, V1.first_maintainer, V2.last_maintainer, V3.last_s0_maintainer, V4.last_s1_maintainer, V5.first_s0_maintainer, V6.first_s1_maintainer
FROM (SELECT DISTINCT ID from vendor_info) tID
OUTER APPLY
(SELECT TOP 1 vi1.maintainer first_maintainer
FROM vendor_info vi1
WHERE vi1.id = tID.id
ORDER BY vi1.datestamp ASC) V1
OUTER APPLY
(SELECT TOP 1 vi2.maintainer last_maintainer
FROM vendor_info vi2
WHERE vi2.id = tID.id
ORDER BY vi2.datestamp DESC) V2
OUTER APPLY
(SELECT TOP 1 vi3.maintainer last_s0_maintainer
FROM vendor_info vi3
WHERE vi3.statuz = 0 AND vi3.id = tID.id
ORDER BY vi3.datestamp DESC) V3
OUTER APPLY
(SELECT TOP 1 vi4.maintainer last_s1_maintainer
FROM vendor_info vi4
WHERE vi4.statuz = 1 AND vi4.id = tID.id
ORDER BY vi4.datestamp DESC) V4
OUTER APPLY
(SELECT TOP 1 vi5.maintainer first_s0_maintainer
FROM vendor_info vi5
WHERE vi5.statuz = 0 AND vi5.id = tID.id
ORDER BY vi5.datestamp ASC) V5
OUTER APPLY
(SELECT TOP 1 vi6.maintainer first_s1_maintainer
FROM vendor_info vi6
WHERE vi6.statuz = 1 AND vi6.id = tID.id
ORDER BY vi6.datestamp ASC) V6

I'd go with Andrei Odegov's answer.
The perfect solution would be an aggregation function that gives you the name for the maximum or minimum date, like Oracle's KEEP FIRST/LAST. SQL Server doesn't feature such function, so using window functions as shown by Andrei Odegov seems the best solution.
If this is still too slow, it may be worth a try to concatenate days and names and look for MIN/MAX of these (e.g. '20180101Eric' < '20180201Jay'), then extract the names. A lot of string manipulation, but simple aggregation, and you must read the whole table anyway.
WITH vi AS
(
SELECT
id,
statuz,
CONVERT(VARCHAR, datestamp) + maintainer AS date_and_name
FROM vendor_info
)
SELECT
id
, SUBSTRING(MIN(date_and_name), 9, 100) AS first_maintainer
, SUBSTRING(MAX(date_and_name), 9, 100) AS last_maintainer
, SUBSTRING(MAX(case when statuz = 0 then date_and_name end), 9, 100) AS last_s0_maintainer
, SUBSTRING(MAX(case when statuz = 1 then date_and_name end), 9, 100) AS last_s1_maintainer
, SUBSTRING(MIN(case when statuz = 0 then date_and_name end), 9, 100) AS first_s0_maintainer
, SUBSTRING(MIN(case when statuz = 1 then date_and_name end), 9, 100) AS first_s1_maintainer
FROM vi
GROUP BY id
ORDER BY id;
(If you store the dates as dates and not as integers as shown in your SQL fiddle, then you'll have to change CONVERT and maybe SUBSTRING accordingly.)
SQL fiddle: http://sqlfiddle.com/#!18/9ee2c7/46

Also it's possible to use UNPIVOT/JOIN/PIVOT:
WITH
a AS (
SELECT
id, statuz,
MIN(datestamp) AS fzm, MAX(datestamp) AS lzm,
MIN(MIN(datestamp)) OVER(PARTITION BY id) AS fm,
MAX(MAX(datestamp)) OVER(PARTITION BY id) AS lm
FROM vendor_info
GROUP BY id, statuz
),
b AS (
SELECT
v.id,
up.[type] + IIF(up.[type] IN('fm', 'lm'), '', STR(up.statuz, 1)) AS p,
v.maintainer
FROM a
UNPIVOT(datestamp FOR [type] IN(fm, lm, fzm, lzm)) AS up
JOIN vendor_info v
ON up.id = v.id AND up.datestamp = v.datestamp
)
SELECT
id,
fm AS first_maintainer, lm AS last_maintainer,
lzm0 AS last_s0_maintainer, lzm1 AS last_s1_maintainer,
fzm0 AS fzmfirst_s0_maintainer, fzm1 AS first_s1_maintainer
FROM b
PIVOT(MIN(maintainer) FOR p IN(fm, lm, lzm0, lzm1, fzm0, fzm1)) AS p;
It can be tested on SQL Fiddle.

Logic to check if exact ids (3+ records) are present in a group in SQL Server

I have some sample data like:
INSERT INTO mytable
([FK_ID], [TYPE_ID])
VALUES
(10, 1),
(11, 1), (11, 2),
(12, 1), (12, 2), (12, 3),
(14, 1), (14, 2), (14, 3), (14, 4),
(15, 1), (15, 2), (15, 4)
Now, here I am trying to check if in each group by FK_ID we have exact match of TYPE_ID values for 1, 2 & 3.
So, the expected output is like:
(10, 1) this should fail
As in group FK_ID = 10 we only have one record
(11, 1), (11, 2) this should also fail
As in group FK_ID = 11 we have two records.
(12, 1), (12, 2), (12, 3) this should pass
As in group FK_ID = 12 we have two records.
And all the TYPE_ID are exactly matching 1, 2 & 3 values.
(14, 1), (14, 2), (14, 3), (14, 4) this should also fail
As we have 4 records here.
(15, 1), (15, 2), (15, 4) this should also fail
Even though we have three records, it should fail as the TYPE_ID here (1, 2, 4) are not matching with required match (1, 2, 3).
Here is my attempt:
select * from mytable t1
where exists (select COUNT(t2.TYPE_ID)
from mytable t2 where t2.FK_ID = t1.FK_ID
and t2.TYPE_ID IN (1, 2, 3)
group by t2.FK_ID having COUNT(t2.TYPE_ID) = 3);
This is not working as expected, because it also pass for FK_ID = 14 which has four records.
Demo: SQL Fiddle
Also, how we can make it generic so that if we need to check for 4 or more TYPE_ID values like (1,2,3,4) or (1,2,3,4,5), we can do that easily by updating few values.

The following query will do what you want:
select fk_id
from t
group by fk_id
having sum(case when type_id in (1, 2, 3) then 1 else 0 end) = 3 and
sum(case when type_id not in (1, 2, 3) then 1 else 0 end) = 0;
This assumes that you have no duplicate pairs (although depending on how you want to handle duplicates, it might be as easy as using, from (select distinct * from t) t).
As for "genericness", you need to update the in lists and the 3.
If you want something more generic:
with vals as (
select id
from (values (1), (2), (3)) v(id)
)
select fk_id
from t
group by fk_id
having sum(case when type_id in (select id from vals) then 1 else 0 end) = (select count(*) from vals) and
sum(case when type_id not in (select id from vals) then 1 else 0 end) = 0;

You can use this code:
SELECT y.fk_id FROM
(SELECT x.fk_id, COUNT(x.type_id) AS count, SUM(x.type_id) AS sum
FROM mytable x GROUP BY (x.fk_id)) AS y
WHERE y.count = 3 AND y.sum = 6
For making it generic, you can equal y.count with N and y.sum with N*(N-1)/2, where N is the number you are looking for (1, 2, ..., N).

You can try this query. COUNT and DISTINCT used for eliminate duplicate records.
SELECT
[FK_ID]
FROM
#mytable T
GROUP BY
[FK_ID]
HAVING
COUNT(DISTINCT CASE WHEN [TYPE_ID] IN (1,2,3) THEN [TYPE_ID] END) = 3
AND COUNT(CASE WHEN [TYPE_ID] NOT IN (1,2,3) THEN [TYPE_ID] END) = 0

Try this:
select FK_ID,count(distinct TYPE_ID) from mytable
where TYPE_ID<=3
group by FK_ID
having count(distinct TYPE_ID)=3

You should use CTE with Dynamic pass Value which you have mentioned in Q.
WITH CTE
AS (
SELECT FK_ID,
COUNT(*) CNT
FROM #mytable
GROUP BY FK_ID
HAVING COUNT(*) = 3) <----- Pass Value here What you want to Display Result,
CTE1
AS (
SELECT T.[ID],
T.[FK_ID],
T.[TYPE_ID],
ROW_NUMBER() OVER(PARTITION BY T.[FK_ID] ORDER BY
(
SELECT NULL
)) RN
FROM #mytable T
INNER JOIN CTE C ON C.FK_ID = T.FK_ID),
CTE2
AS (
SELECT C1.FK_ID
FROM CTE1 C1
GROUP BY C1.FK_ID
HAVING SUM(C1.TYPE_ID) = SUM(C1.RN))
SELECT TT1.*
FROM CTE2 C2
INNER JOIN #mytable TT1 ON TT1.FK_ID = C2.FK_ID;
From above SQL Command which will produce Result (I have passed 3) :
ID FK_ID TYPE_ID
4 12 1
5 12 2
6 12 3

When IDs are identical check that the Ordinal is greater than the previous submission

Example query
USE HES
SELECT T1.ID, T2.DATE, T1.ORDINAL
FROM TABLE1 AS T1
LEFT JOIN TABLE2 AS T2
ON T1.ID = T2.ID AND T1.PARTYEAR = T2.PARTYEAR
WHERE
T1.MONTHYEAR = '201501'
Results from example query
ID Date Ordinal
1 01/01/2016 1
1 02/01/2016 2
1 03/01/2016 3
2 04/01/2016 1
2 05/01/2016 2
3 06/01/2016 1
3 07/01/2016 2
3 08/01/2016 3
4 09/01/2016 1
4 10/01/2016 1
Question
Each user has a unique ID, for each ID how would I to check that each data submission contains an Ordinal that is greater than the one that was previously submitted.
So, in the example query results above, ID 4 contains an issue.
I'm fairly new to SQL, I've been searching for similar examples but with no success.
Any help would be greatly appreciated.

Use LAG with OVER clause:
WITH cte AS
(
SELECT T1.ID, T2.DATE, T1.ORDINAL, LAG(T1.ORDINAL) OVER(PARTITION BY T1.ID ORDER BY T1.ORDINAL) AS LagOrdinal
FROM TABLE1 AS T1
LEFT JOIN TABLE2 AS T2
ON T1.ID = T2.ID AND T1.PARTYEAR = T2.PARTYEAR
WHERE
T1.MONTHYEAR = '201501'
)
SELECT ID, DATE, ORDINAL, CASE WHEN ORDINAL > LagOrdinal THEN 1 ELSE 0 END AS OrdinalIsGreater
FROM cte;

Try this one:
SELECT * INTO #tmp
FROM (VALUES
(1, CONVERT(date, '01/01/2016'), 1),
(1, '02/01/2016', 2),
(1, '03/01/2016', 3),
(2, '04/01/2016', 1),
(2, '05/01/2016', 2),
(3, '06/01/2016', 1),
(3, '07/01/2016', 2),
(3, '08/01/2016', 3),
(4, '09/01/2016', 1),
(4, '10/01/2016', 1)
)T(ID, Date, Ordinal)
WITH Numbered AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Date) R, *
FROM #tmp
)
SELECT N2.ID, N2.Date, N1.Ordinal Prev, N2.Ordinal Curr
FROM Numbered N1
JOIN Numbered N2 ON N1.R+1=N2.R AND N1.ID=N2.ID
WHERE N1.Ordinal >= N2.Ordinal
It can be simplified when SQL Server version >= 2012, #tmp is your current result.

Like #Serg said, you can achieve this using lag
select *
from (
SELECT T1.ID, T2.DATE, T1.ORDINAL,
lag(t1.ordinal) over (partition by t1.id order by t2.date) as prevOrdinal
FROM TABLE1 AS T1
LEFT JOIN TABLE2 AS T2
ON T1.ID = T2.ID AND T1.PARTYEAR = T2.PARTYEAR
WHERE
T1.MONTHYEAR = '201501') as t
where t.prevOrdinal >= t.ordinal;
OUTPUT
ID DATE ORDINAL prevOrdinal
4 2016-10-01 1 1

Delete rows in table that are sum of other rows per group

Group rows by T, and in each group find the row that is the largest or smallest (if values are negative) sum of other rows from that group, and delete that row (one for each group), if group does not have enough elements to find sum or enough but none of the rows indicates sum of others nothing happens
CREATE TABLE Test (
T varchar(10),
V int
);
INSERT INTO Test
VALUES ('A', 4),
('B', -5),
('C', 5),
('A', 2),
('B', -1),
('C', 10),
('A', 2),
('B', -4),
('C', 5),
('D', 0);
expected result:
A 2
A 2
B -1
B -4
C 5
C 5
D 0

Like the comments, the requirements seem strange. The below code assumes that the summing is already pre-populated and merely removes the largest/smallest as long as the highest value is not 0.
if object_id('tempdb..#test') is not null
drop table #test
CREATE TABLE #Test (
T varchar(10),
V int
);
INSERT INTO #Test
VALUES ('A', 4), ('B', -5), ('C', 5), ('A', 2), ('B', -1), ('C', 10), ('A', 2), ('B', -4), ('C', 5), ('D', 0);
if object_id('tempdb..#test2') is not null
drop table #test2
SELECT
T,
V,
ABS(V) as absV
INTO #TEST2
FROM #TEST
SELECT * FROM #TEST2
if object_id('tempdb..#max') is not null
drop table #max
SELECT
T,
MAX(absV) AS MaxAbsV
INTO #Max
FROM #TEST2
GROUP BY T
HAVING MAX(AbsV) != 0
DELETE #TEST2
FROM #TEST2
INNER JOIN #MAX ON #TEST2.T = #MAX.T AND #TEST2.absV = #Max.MaxAbsV
SELECT * FROM #TEST2
ORDER BY T ASC

; with cte as
(
select T, V,
R = row_number() over (partition by T order by ABS(V) desc),
C = count(*) over (partition by T)
from Test
)
delete c
from cte c
inner join
(
select T, S = sum(V)
from cte
where R <> 1
group by T
) s on c.T = s.T
where c.C >= 3
and c.R = 1
and c.V = s.S

Using ABS and NOT Exists
DECLARE #Test TABLE (
T varchar(10),
V int
);
INSERT INTO #Test
VALUES ('A', 4), ('B', -5), ('C', 5), ('A', 2), ('B', -1), ('C', 10), ('A', 2), ('B', -4), ('C', 5), ('D', 0);
;WITH CTE as (
select T,max(ABS(v ))v from #Test
WHERE V <> 0
GROUP BY T )
SELECT T,V FROM #Test T where NOT exists (Select 1 FROM cte WHERE T = T.T AND v = ABS(T.V) )
ORDER BY T.T

Determine first if the rows are positive or negative by checking if SUM(V) is positive. And then determine if the smallest or largest value is equal to the SUM of the other rows, by subtracting from SUM(V) the MIN(V) if negative or MAX(V) if positive:
DELETE t
FROM Test t
INNER JOIN (
SELECT
T,
SUM(V) - CASE WHEN SUM(V) >= 0 THEN MAX(V) ELSE MIN(V) END AS ToDelete
FROM Test
GROUP BY T
HAVING COUNT(*) >= 3
) a
ON a.T = t.T
AND a.ToDelete = t.V
ONLINE DEMO

You can use the below query to get the required output :-
select * into #t1 from test
select * from
(
select TT.T as T,TT.V as V
from test TT
JOIN
(select T,max(abs(V)) as V from #t1
group by T) P
on TT.T=P.T
where abs(TT.V) <> P.V
UNION ALL
select A.T as T,A.V as V from test A
JOIN(
select T,count(T) as Tcount from test
group by T
having count(T)=1) B on A.T=B.T
) X order by T
drop table #t1

You are looking for a value per group that is the sum of all the group's other values. E.g. 4 of (2,2,4) or -5 of (-5,-4,-1).
This is usually only one record per group. But it can be multiple times the same number. Here are examples for ties: (0,0) or (-2,2,4,4), or (-2,-2,4,4,4) or (-10,3,3,3,3,4).
As you see, you are looking in any way for values that equal half of the group's total sum. (Of course. We are looking for n+n, where one n is in one record and the other n is the sum of all the other records.)
The only special case is when there is only one value in the group which is zero. That we don't want to delete of course.
Here is an update statement that cannot deal with ties, but would delete all maximum values instead of just one:
delete from test
where 2 * v =
(
select case when count(*) = 1 then null else sum(v) end
from test fullgroup
where fullgroup.t = test.t
);
In order to deal with ties you would need artificial row numbers, so as to delete only one record of all candidates.
with candidates as
(
select t, v, row_number() over (partition by t order by t) as rn
from
(
select
t, v,
sum(v) over (partition by t) as sumv,
count(*) over (partition by t) as cnt
from test
) comparables
where sumv = 2 * v and cnt > 1
)
delete
from candidates
where rn = 1;
SQL fiddle: http://sqlfiddle.com/#!6/6d97e/1

See if below query helps:
DELETE [Audit].[dbo].[Test] FROM [Audit].[dbo].[Test] as AA
INNER JOIN (select T,
CASE
WHEN MAX(V) < 0 THEN MIN(V)
WHEN MIN(V) > 0 THEN MAX(V) ELSE MAX(V)
END as MAX_V,
CASE
WHEN SUM(V) > 0 THEN SUM(V) - MAX(V)
WHEN SUM(V) < 0 THEN SUM(V) - MIN(V) ELSE SUM(V)
END as SUM_V_REST
from [Audit].[dbo].[Test]
Group by T
Having Count(V) > 1) as BB ON AA.T = BB.T and AA.V = BB.MAX_V

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to speed up left join qualify operation in sql - sql

Related

SQL - Getting Sum of 'X' Consecutive Values where X is an Integer in another Row (With Categories)

Several joins in query - possible to replacement to gain performance?

Logic to check if exact ids (3+ records) are present in a group in SQL Server

When IDs are identical check that the Ordinal is greater than the previous submission

Delete rows in table that are sum of other rows per group

Categories

Resources