Group by difference from starting point of group - sql

I have a lot of measurements in Postgres database table and I need to split this set in groups when some value goes too far away from a "starting" point of the current group (more then some threshold). Sort order is determined by the id column.
Example: splitting with threshold = 1:
id measurements
---------------
1 1.5
2 1.4
3 1.8
4 2.6
5 3.7
6 3.5
7 3.0
8 2.6
9 2.5
10 2.8
Should be split in groups as follows:
id measurements group
---------------------
1 1.5 0 --- start new group
2 1.4 0
3 1.8 0
4 2.6 1 --- start new group because it too far from 1.5
5 3.7 2 --- start new group because it too far from 2.6
6 3.5 2
7 3.0 2
8 2.6 3 --- start new group because it too far from 3.7
9 2.5 3
10 2.8 3
I can do this by writing a function using LOOP, but I'm looking for a more efficient way. Performance is very important as the actual table contains millions of rows.
Is it possible to achieve the goal by using PARTITION OVER, CTE or any other kind of SELECT?

Is it possible to achieve the goal by using PARTITION OVER, CTE or any other kind of SELECT?
This is an inherently procedural problem. Depending on where you start, all later rows can end up in a different group and / or with a different group value. Window functions (using the PARTITION clause) are no good for this.
You can use a recursive CTE:
WITH RECURSIVE rcte AS (
(
SELECT id
, measurement
, measurement - 1 AS grp_min
, measurement + 1 AS grp_max
, 1 AS grp
FROM tbl
ORDER BY id
LIMIT 1
)
UNION ALL
(
SELECT t.id
, t.measurement
, CASE WHEN t.same_grp THEN r.grp_min ELSE t.measurement - 1 END -- AS grp_min
, CASE WHEN t.same_grp THEN r.grp_max ELSE t.measurement + 1 END -- AS grp_max
, CASE WHEN t.same_grp THEN r.grp ELSE r.grp + 1 END -- AS grp
FROM rcte r
CROSS JOIN LATERAL (
SELECT *, t.measurement BETWEEN r.grp_min AND r.grp_max AS same_grp
FROM tbl t
WHERE t.id > r.id
ORDER BY t.id
LIMIT 1
) t
)
)
SELECT id, measurement, grp
FROM rcte;
It's elegant. And decently fast. But only about as fast as - or even slower than - a procedural language function with a single loop over the set - when implemented efficiently:
CREATE OR REPLACE FUNCTION f_measurement_groups(_threshold numeric = 1)
RETURNS TABLE (id int, grp int, measurement numeric) AS
$func$
DECLARE
_grp_min numeric;
_grp_max numeric;
BEGIN
grp := 0; -- init
FOR id, measurement IN
SELECT * FROM tbl t ORDER BY t.id
LOOP
IF measurement BETWEEN _grp_min AND _grp_max THEN
RETURN NEXT;
ELSE
SELECT INTO grp , _grp_min , _grp_max
grp + 1, measurement - _threshold, measurement + _threshold;
RETURN NEXT;
END IF;
END LOOP;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_measurement_groups(); -- optionally supply different threshold
db<>fiddle here
My money is on the procedural function.
Typically, set-based solutions are faster. But not when solving an inherently procedural problem.
Related:
GROUP BY and aggregate sequential numeric values

You seem to be starting a group when the difference between rows exceeds 0.5. If I assume you have an ordering column, you can use lag() and cumulative sum to get your groups:
select t.*,
count(*) filter (where prev_value < value - 0.5) as grouping
from (select t.*,
lag(value) over (order by <ordering col>) as prev_value
from t
) t

One way to attack this problem is by using a recursive CTE. This example is written using SQL Server syntax (because I don't work with postgres). It should be straightforward to translate, however.
-- Table #Test:
-- sequenceno measurements
-- ----------- ------------
-- 1 1.5
-- 2 1.4
-- 3 1.8
-- 4 2.6
-- 5 3.7
-- 6 3.5
-- 7 3.0
-- 8 2.6
-- 9 2.5
-- 10 2.8
WITH datapoints
AS
(
SELECT sequenceno,
measurements,
startmeasurement = measurements,
groupno = 0
FROM #Test
WHERE sequenceno = 1
UNION ALL
SELECT sequenceno = A.sequenceno + 1,
measurements = B.measurements,
startmeasurement =
CASE
WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN B.measurements
ELSE A.startmeasurement
END,
groupno =
A.groupno +
CASE
WHEN abs(B.measurements - A.startmeasurement) >= 1 THEN 1
ELSE 0
END
FROM datapoints as A
INNER JOIN #Test as B
ON A.sequenceno + 1 = B.sequenceno
)
SELECT sequenceno,
measurements,
groupno
FROM datapoints
ORDER BY
sequenceno
-- Output:
-- sequenceno measurements groupno
-- ----------- --------------- -------
-- 1 1.5 0
-- 2 1.4 0
-- 3 1.8 0
-- 4 2.6 1
-- 5 3.7 2
-- 6 3.5 2
-- 7 3.0 2
-- 8 2.6 3
-- 9 2.5 3
-- 10 2.8 3
Note that I added a "sequenceno" column in the starting table because relational tables are considered to be unordered sets. Also, if the number of input values is too large (over 90-100), you may have to adjust the MAXRECURSION value (at least in SQL Server).
Additional note: Just noticed that the original question mentions that there are millions of records in the input data sets. The CTE approach would only work if that data could be broken up into manageable chunks.

Related

Snowflake: Repeating rows based on column value

How to repeat rows based on column value in snowflake using sql.
I tried a few methods but not working such as dual and connect by.
I have two columns: Id and Quantity.
For each ID, there are different values of Quantity.
So if you have a count, you can use a generator:
with ten_rows as (
select row_number() over (order by null) as rn
from table(generator(ROWCOUNT=>10))
), data(id, count) as (
select * from values
(1,2),
(2,4)
)
SELECT
d.*
,r.rn
from data as d
join ten_rows as r
on d.count >= r.rn
order by 1,3;
ID
COUNT
RN
1
2
1
1
2
2
2
4
1
2
4
2
2
4
3
2
4
4
Ok let's start by generating some data. We will create 10 rows, with a QTY. The QTY will be randomly chosen as 1 or 2.
Next we want to duplicate the rows with a QTY of 2 and leave the QTY =1 as they are.
Obviously you can change all parameters above to suit your needs - this solution works super fast and in my opinion way better than table generation.
Simply stack SPLIT_TO_TABLE(), REPEAT() with a LATERAL() join and voila.
WITH TEN_ROWS AS (SELECT ROW_NUMBER()OVER(ORDER BY NULL)SOME_ID,UNIFORM(1,2,RANDOM())QTY FROM TABLE(GENERATOR(ROWCOUNT=>10)))
SELECT
TEN_ROWS.*
FROM
TEN_ROWS,LATERAL SPLIT_TO_TABLE(REPEAT('hire me $10/hour',QTY-1),'hire me $10/hour')ALTERNATIVE_APPROACH;

Keyset pagination with composite key

I am using oracle 12c database and I have a table with the following structure:
Id NUMBER
SeqNo NUMBER
Val NUMBER
Valid VARCHAR2
A composite primary key is created with the field Id and SeqNo.
I would like to fetch the data with Valid = 'Y' and apply ketset pagination with a page size of 3. Assume I have the following data:
Id SeqNo Val Valid
1 1 10 Y
1 2 20 N
1 3 30 Y
1 4 40 Y
1 5 50 Y
2 1 100 Y
2 2 200 Y
Expected result:
----------------------------
Page 1
----------------------------
Id SeqNo Val Valid
1 1 10 Y
1 3 30 Y
1 4 40 Y
----------------------------
Page 2
----------------------------
Id SeqNo Val Valid
1 5 50 Y
2 1 100 Y
2 2 200 Y
Offset pagination can be done like this:
SELECT * FROM table ORDER BY Id, SeqNo OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY;
However, in the actual db it has more than 5 millions of records and using OFFSET is going to slow down the query a lot. Therefore, I am looking for a ketset pagination approach (skip records using some unique fields instead of OFFSET)
Since a composite primary key is used, I need to offset the page with information from more than 1 field.
This is a sample SQL that should work in PostgreSQL (fetch 2nd page):
SELECT * FROM table WHERE (Id, SeqNo) > (1, 4) AND Valid = 'Y' ORDER BY Id, SeqNo LIMIT 3;
How do I achieve the same in oracle?
Use row_number() analytic function with ceil arithmetic fuction. Arithmetic functions don't have a negative impact on performance, and row_number() over (order by ...) expression automatically orders the data without considering the insertion order, and without adding an extra order by clause for the main query. So, consider :
select Id,SeqNo,
ceil(row_number() over (order by Id,SeqNo)/3) as page
from tab
where Valid = 'Y';
P.S. It also works for Oracle 11g, while OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY works only for Oracle 12c.
Demo
You can use order by and then fetch rows using fetch and offset like following:
Select ID, SEQ, VAL, VALID FROM TABLE
WHERE VALID = 'Y'
ORDER BY ID, SEQ
--FETCH FIRST 3 ROWS ONLY -- first page
--OFFSET 3 ROWS FETCH NEXT 3 ROWS ONLY -- second pages
--OFFSET 6 ROWS FETCH NEXT 3 ROWS ONLY -- third page
--Update--
You can use row_number analytical function as following.
Select id, seqNo, Val, valid from
(Select t.*,
Row_number(order by id, seq) as rn from table t
Where valid = 'Y')
Where ceil(rn/3) = 2 -- for page no. 2
Cheers!!

oracle sql query optimization further 1

I have written a query to select * from bdb to get only updated values in PRICE for the combination of DAY,INST in the newest ACT
I created A table like
CREATE TABLE bdb(
ACT NUMBER(8) NOT NULL,
INST NUMBER(8) NOT NULL,
DAY DATE NOT NULL,
PRICE VARCHAR2 (3),
CURR NUMBER (8,2),
PRIMARY KEY (ACT,INST,DAY)
);
used this to populate the table
DECLARE
t_day bdb.day%type:= '1-JAN-16';
n pls_integer;
BEGIN
<< act_loop >>
FOR i IN 1..3 LOOP --NUMBER OF ACT i
<< inst_loop >>
FOR j IN 1..1000 LOOP --NUMBER OF INST j
t_day:='3-JAN-16';
<< day_loop >>
FOR k IN 1..260 LOOP --NUMBER OF DAYS k
n:= dbms_random.value(1,3);
INSERT into bdb (ACT,INST,DAY,PRICE,CURR) values (i,j,t_day,n,10.3);
t_day:=t_day+1;
END loop day_loop;
END loop inst_loop;
END loop act_loop;
END;
/
using this query
I get only the DAY,INST,PRICE
select day,inst,price from bdb where (act=(select max(act) from bdb))
minus
select day,inst,price from bdb where act=(select max(act)-1 from bdb);
above one is fast.but I want to get all the field in efficient way.
the one I came up with bit slow which is this,
select
e1.*
from
(select
*
from
bdb
where
(act=(select max(act) from bdb))
)e1,
(select day,inst,price from bdb where (act=(select max(act) from bdb))
minus
select day,inst,price from bdb where act=(select max(act)-1 from bdb)) e2
where
e1.day=e2.day and e1.inst=e2.inst;
can anyone give any suggestion to how to optimized this any more? or with out using cross join with two table how to get the required output.Help me ;)
simply I need is
ACT INST DAY PRI CURR
------------------------------------
3 890 05-MAR-16 3 10.3
3 890 06-MAR-16 2 10.3
3 890 07-MAR-16 2 10.3
3 891 05-MAR-16 2 10.3
3 891 06-MAR-16 1 10.3
3 891 07-MAR-16 2 10.3
4 890 05-MAR-16 3 10.3
4 890 06-MAR-16 2 10.3
4 890 07-MAR-16 1 10.3
4 891 05-MAR-16 2 10.3
4 891 06-MAR-16 2 10.3
4 891 07-MAR-16 1 10.3
Here for (890,05-MAR-16) (890,06-MAR-16) (890,06-MAR-16)
(891,05-MAR-16) (891,06-MAR-16) (891,06-MAR-16) in act=3
price are
3,2,2
2,1,2
but when act=4 happens
(890,07-MAR-16)
(891,06-MAR-16)
(891,07-MAR-16)
price values are change from what they were in act=3.
others not change
ultimately what I need is
ACT INST DAY PRI CURR
------------------------------------
4 890 07-MAR-16 1 10.3
4 891 06-MAR-16 2 10.3
4 891 07-MAR-16 1 10.3
It looks like you're after the day, inst and price values which have a row where the act column has the maximum act value out of the whole table, but doesn't have a row where the act column is one less than the max act value.
You could try this:
SELECT day,
inst,
price
FROM (SELECT day,
inst,
price,
act,
MAX(act) OVER () max_overall_act
FROM bdb)
WHERE act IN (max_overall_act, max_overall_act -1)
GROUP BY day, inst, price
HAVING MAX(CASE WHEN act = max_overall_act THEN 1 END) = 1
AND MAX(CASE WHEN act = max_overall_act - 1 THEN 1 END) IS NULL;
First of all, the subquery finds the maximum act value across the whole table.
Then we select all rows that have an act value that is the maximum value or one less than that.
After that, we group the rows and find out which ones have an act = max act val, but don't have an act = max act val -1.
However, from what you said in your post:
I have written a query to select * from bdb to get only updated values in PRICE for the combination of DAY,INST in the newest ACT
neither the query you came up with and the above query in my answer seem to tally with what you are after.
I think instead, you're after something like:
SELECT act,
inst,
DAY,
price,
curr,
prev_price -- if desired
FROM (SELECT act,
inst,
DAY,
price,
curr,
LEAD(price) OVER (PARTITION BY inst, DAY ORDER BY act DESC) prev_price,
row_number() OVER (PARTITION BY inst, DAY ORDER BY act DESC) rn
FROM bdb)
WHERE rn = 1
AND prev_price != price;
What this does is use the LEAD() analytic (based on the descending act order) to find the price of the row with the previous act for each inst and day, along with the rownumber.
Then to find the latest act row, we simply select the rows where the rownumber is 1 and also where the previous price doesn't match the current price. You can then display both the current and the previous price, if you want to.

SQL random number that doesn't repeat within a group

Suppose I have a table:
HH SLOT RN
--------------
1 1 null
1 2 null
1 3 null
--------------
2 1 null
2 2 null
2 3 null
I want to set RN to be a random number between 1 and 10. It's ok for the number to repeat across the entire table, but it's bad to repeat the number within any given HH. E.g.,:
HH SLOT RN_GOOD RN_BAD
--------------------------
1 1 9 3
1 2 4 8
1 3 7 3 <--!!!
--------------------------
2 1 2 1
2 2 4 6
2 3 9 4
This is on Netezza if it makes any difference. This one's being a real headscratcher for me. Thanks in advance!
To get a random number between 1 and the number of rows in the hh, you can use:
select hh, slot, row_number() over (partition by hh order by random()) as rn
from t;
The larger range of values is a bit more challenging. The following calculates a table (called randoms) with numbers and a random position in the same range. It then uses slot to index into the position and pull the random number from the randoms table:
with nums as (
select 1 as n union all select 2 union all select 3 union all select 4 union all select 5 union all
select 6 union all select 7 union all select 8 union all select 9
),
randoms as (
select n, row_number() over (order by random()) as pos
from nums
)
select t.hh, t.slot, hnum.n
from (select hh, randoms.n, randoms.pos
from (select distinct hh
from t
) t cross join
randoms
) hnum join
t
on t.hh = hnum.hh and
t.slot = hnum.pos;
Here is a SQLFiddle that demonstrates this in Postgres, which I assume is close enough to Netezza to have matching syntax.
I am not an expert on SQL, but probably do something like this:
Initialize a counter CNT=1
Create a table such that you sample 1 row randomly from each group and a count of null RN, say C_NULL_RN.
With probability C_NULL_RN/(10-CNT+1) for each row, assign CNT as RN
Increment CNT and go to step 2
Well, I couldn't get a slick solution, so I did a hack:
Created a new integer field called rand_inst.
Assign a random number to each empty slot.
Update rand_inst to be the instance number of that random number within this household. E.g., if I get two 3's, then the second 3 will have rand_inst set to 2.
Update the table to assign a different random number anywhere that rand_inst>1.
Repeat assignment and update until we converge on a solution.
Here's what it looks like. Too lazy to anonymise it, so the names are a little different from my original post:
/* Iterative hack to fill 6 slots with a random number between 1 and 13.
A random number *must not* repeat within a household_id.
*/
update c3_lalfinal a
set a.rand_inst = b.rnum
from (
select household_id
,slot_nbr
,row_number() over (partition by household_id,rnd order by null) as rnum
from c3_lalfinal
) b
where a.household_id = b.household_id
and a.slot_nbr = b.slot_nbr
;
update c3_lalfinal
set rnd = CAST(0.5 + random() * (13-1+1) as INT)
where rand_inst>1
;
/* Repeat until this query returns 0: */
select count(*) from (
select household_id from c3_lalfinal group by 1 having count(distinct(rnd)) <> 6
) x
;

Best way to interpolate values in SQL

I have a table with rate at certain date :
Rates
Id | Date | Rate
----+---------------+-------
1 | 01/01/2011 | 4.5
2 | 01/04/2011 | 3.2
3 | 04/06/2011 | 2.4
4 | 30/06/2011 | 5
I want to get the output rate base on a simple linear interpolation.
So if I enter 17/06/2011:
Date Rate
---------- -----
01/01/2011 4.5
01/04/2011 3.2
04/06/2011 2.4
17/06/2011
30/06/2011 5.0
the linear interpolation is (5 + 2,4) / 2 = 3,7
Is there a way to do a simple query (SQL Server 2005), or this kind of stuff need to be done in a programmatic way (C#...) ?
Something like this (corrected):
SELECT CASE WHEN next.Date IS NULL THEN prev.Rate
WHEN prev.Date IS NULL THEN next.Rate
WHEN next.Date = prev.Date THEN prev.Rate
ELSE ( DATEDIFF(d, prev.Date, #InputDate) * next.Rate
+ DATEDIFF(d, #InputDate, next.Date) * prev.Rate
) / DATEDIFF(d, prev.Date, next.Date)
END AS interpolationRate
FROM
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date <= #InputDate
ORDER BY Date DESC
) AS prev
CROSS JOIN
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date >= #InputDate
ORDER BY Date ASC
) AS next
As #Mark already pointed out, the CROSS JOIN has its limitations. As soon as the target value falls outside the range of defined values no records will be returned.
Also the above solution is limited to one result only. For my project I needed an interpolation for a whole list of x values and came up with the following solution. Maybe it is of interested to other readers too?
-- generate some grid data values in table #ddd:
CREATE TABLE #ddd (id int,x float,y float, PRIMARY KEY(id,x));
INSERT INTO #ddd VALUES (1,3,4),(1,4,5),(1,6,3),(1,10,2),
(2,1,4),(2,5,6),(2,6,5),(2,8,2);
SELECT * FROM #ddd;
-- target x-values in table #vals (results are to go into column yy):
CREATE TABLE #vals (xx float PRIMARY KEY,yy float null, itype int);
INSERT INTO #vals (xx) VALUES (1),(3),(4.3),(9),(12);
-- do the actual interpolation
WITH valstyp AS (
SELECT id ii,xx,
CASE WHEN min(x)<xx THEN CASE WHEN max(x)>xx THEN 1 ELSE 2 END ELSE 0 END flag,
min(x) xmi,max(x) xma
FROM #vals INNER JOIN #ddd ON id=1 GROUP BY xx,id
), ipol AS (
SELECT v.*,(b.x-xx)/(b.x-a.x) f,a.y ya,b.y yb
FROM valstyp v
INNER JOIN #ddd a ON a.id=ii AND a.x=(SELECT max(x) FROM #ddd WHERE id=ii
AND (flag=0 AND x=xmi OR flag=1 AND x<xx OR flag=2 AND x<xma))
INNER JOIN #ddd b ON b.id=ii AND b.x=(SELECT min(x) FROM #ddd WHERE id=ii
AND (flag=0 AND x>xmi OR flag=1 AND x>xx OR flag=2 AND x=xma))
)
UPDATE v SET yy=ROUND(f*ya+(1-f)*yb,8),itype=flag FROM #vals v INNER JOIN ipol i ON i.xx=v.xx;
-- list the interpolated results table:
SELECT * FROM #vals
When running the above script you will get the following data grid points in table #ddd
id x y
-- -- -
1 3 4
1 4 5
1 6 3
1 10 2
2 1 4
2 5 6
2 6 5
2 8 2
[[ The table contains grid points for two identities (id=1 and id=2). In my example I referenced only the 1-group by using where id=1 in the valstyp CTE. This can be changed to suit your requirements. ]]
and the results table #vals with the interpolated data in column yy:
xx yy itype
--- ---- -----
1 2 0
3 4 0
4.3 4.7 1
9 2.25 1
12 1.5 2
The last column itype indicates the type of interpolation/extrapolation that was used to calculate the value:
0: extrapolation to lower end
1: interpolation within given data range
2: extrapolation to higher end
This working example can be found here.
The trick with CROSS JOIN here is it wont return any records if either of the table does not have rows (1 * 0 = 0) and the query may break. Better way to do is use FULL OUTER JOIN with inequality condition (to avoid getting more than one row)
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date <= #InputDate
ORDER BY Date DESC
) AS prev
FULL OUTER JOIN
( SELECT TOP 1
Date, Rate
FROM Rates
WHERE Date >= #InputDate
ORDER BY Date ASC
) AS next
ON (prev.Date <> next.Date) [or Rate depending on what is unique]