How to make a specific group by (window like function) - sql

Bellow you can see the Table and context.
I want to get 3 groups from the context in the table, so i want to group by ABDC_IDENT but when the DATE_RANK order breaks as you can see in the data after DATE_RANK 11 comes 1,2 (because it is the group B) then it continues to rank up for the group A (the order by is by VARIOUS_DATES).
What i want to get is 3 groups, the first is group A rank 1 to 11, the second group is B rank 1,2 and the third group should be the group A but ranks from 12 to 21. I hope this is clear for everyone.
Im currently experimenting with rows between UNBOUNDED PRECEDING and current row, any idea is welcomed, maybe it can be done in some other way also. Cheers and thx
Here is my fiddle, so you can build it easy for yourself:
CREATE TABLE Table1
(ABDC_IDENT varchar(5), VARIOUS_DATES date, DATE_RANKS int)
;
INSERT INTO Table1
(ABDC_IDENT, VARIOUS_DATES, DATE_RANKS)
VALUES
('A', '31.12.2010', 1),
('A', '31.01.2011', 2),
('A', '28.02.2011', 3),
('A', '31.03.2011', 4),
('A', '29.04.2011', 5),
('A', '31.05.2011', 6),
('A', '30.06.2011', 7),
('A', '29.07.2011', 8),
('A', '31.08.2011', 9),
('A', '30.09.2011', 10),
('A', '31.10.2011', 11),
('B', '30.11.2011', 1),
('B', '30.12.2011', 2),
('A', '31.01.2012', 12),
('A', '29.02.2012', 13),
('A', '30.03.2012', 14),
('A', '30.04.2012', 15),
('A', '31.05.2012', 16),
('A', '29.06.2012', 17),
('A', '31.07.2012', 18),
('A', '31.08.2012', 19),
('A', '28.09.2012', 20),
('A', '31.10.2012', 21)
;
The desired result would be then inserted into another table
Table2
GROUP_ABC | MIN_DATE | MAX_DATE |
A |31.12.2010 | 31.10.2011 |
B |30.11.2011 | 30.12.2011 |
C |31.01.2012 | 31.10.2012 |

I think you can use convert format 104 to handle the date strings.
Does this work?
select
substring('ABCDEF', row_number() over (order by min(VARIOUS_DATES)), 1) as GROUP_ABC,
min(VARIOUS_DATES) as MIN_DATE,
max(VARIOUS_DATES) as MAX_DATE
from (
select
ABDC_IDENT,
convert(date, VARIOUS_DATES, 104) as VARIOUS_DATES
row_number() over (order by convert(date, VARIOUS_DATES, 104)) - DATE_RANKS as grp
from Table1
) data
group by ABDC_IDENT, grp
or:
select
substring('ABCDEF', row_number() over (order by MIN_DATE), 1) as GROUP_ABC,
MIN_DATE, MAX_DATE
from (
select
ABDC_IDENT as GROUP_ABC,
min(VARIOUS_DATES) as MIN_DATE,
max(VARIOUS_DATES) as MAX_DATE
from (
select
ABDC_IDENT,
convert(date, VARIOUS_DATES, 104) as VARIOUS_DATES
row_number()
over (order by convert(date, VARIOUS_DATES, 104)) - DATE_RANKS as grp
from Table1
) data
group by ABDC_IDENT, grp
) t

Related

sql that finds records within 3 days of a condition being met

I am trying to find all records that exist within a date range prior to an event occurring. In my table below, I want to pull all records that are 3 days or less from when the switch field changes from 0 to 1, ordered by date, partitioned by product. My solution does not work, it includes the first record when it should skip as it's outside the 3 day window. I am scanning a table with millions of records, is there a way to reduce the complexity/cost while maintaining my desired results?
http://sqlfiddle.com/#!18/eebe7
CREATE TABLE productlist
([product] varchar(13), [switch] int, [switchday] date)
;
INSERT INTO productlist
([product], [switch], [switchday])
VALUES
('a', 0, '2019-12-28'),
('a', 0, '2020-01-02'),
('a', 1, '2020-01-03'),
('a', 0, '2020-01-06'),
('a', 0, '2020-01-07'),
('a', 1, '2020-01-09'),
('a', 1, '2020-01-10'),
('a', 1, '2020-01-11'),
('b', 1, '2020-01-01'),
('b', 0, '2020-01-02'),
('b', 0, '2020-01-03'),
('b', 1, '2020-01-04')
;
my solution:
with switches as (
SELECT
*,
case when lead(switch) over (partition by product order by switchday)=1
and switch=0 then 'first day switch'
else null end as leadswitch
from productlist
),
switchdays as (
select * from switches
where leadswitch='first day switch'
)
select pl.*
,'lead'
from productlist pl
left join switchdays ss
on pl.product=ss.product
and pl.switchday = ss.switchday
and datediff(day, pl.switchday, ss.switchday)<=3
where pl.switch=0
desired output, capturing records that occur within 3 days of a switch going from 0 to 1, for each product, ordered by date:
product switch switchday
a 0 2020-01-02 lead
a 0 2020-01-06 lead
a 0 2020-01-07 lead
b 0 2020-01-02 lead
b 0 2020-01-03 lead
If I understand correctly, you can just use lead() twice:
select pl.*
from (select pl.*,
lead(switch) over (partition by product order by switchday) as next_switch_1,
lead(switch, 2) over (partition by product order by switchday) as next_switch_2
from productlist pl
) pl
where switch = 0 and
1 in (next_switch_1, next_switch_2);
Here is a db<>fiddle.
EDIT (based on comment):
select pl.*
from (select pl.*,
min(case when switch = 1 then switchdate end) over (partition by product order by switchdate desc) as next_switch_1_day
from productlist pl
) pl
where switch = 0 and
next_switch_one_day <= dateadd(day, 2, switchdate);

LEAST(STRING) and GREATEST(STRING) for long STRINGS in Legacy BigQuery SQL

I'd like to run the following SQL query in a BigQuery table:
SELECT
LEAST(origin, destination) AS point_1,
GREATEST(origin, destination) AS point_2,
COUNT(*) AS journey_count,
FROM route
GROUP BY point_1, point_2
ORDER BY point_1, point_2;
on a table like:
INSERT INTO route
( route_id, origin, destination, dur)
VALUES
( 1, 'AA', 'BB', 2),
( 2, 'CC', 'DD', 4),
( 3, 'BB', 'AA', 6),
( 4, 'CC', 'AA', 2),
( 5, 'DD', 'CC', 12);
But BigQuery tells me that, although the query is syntactically correct, string is not a valid argument type for the LEAST function, for string length > 1. I tried to cast it to numeric, like LEAST(cast(origin as numeric), cast(destination as numeric)) AS point_1 but it tells me LEAST cannot handle bytes.
How do I make LEAST and GREATEST compare long strings in BigQuery?
#legacydSQL
SELECT
IF(origin < destination, CONCAT(origin, ' - ', destination), CONCAT(destination, ' - ', origin)) route,
COUNT(1) journey_count
FROM [project:dataset.table]
GROUP BY route
ORDER BY route
if to apply to sample data from your example - result is
Row route journey_count
1 AA - BB 2
2 AA - CC 1
3 CC - DD 2
see this
with t as (
(select 1 as route_id, 'AA' as origin, 'BB' as destination, 2 as dur)
union all
(select 2, 'CC', 'DD', 4)
union all
(select 3, 'BB', 'AA', 6)
union all
(select 4, 'CC', 'AA', 2)
union all
(select 5, 'DD', 'CC', 12))
select
if(origin<destination,origin,destination) as point_1,
if(origin<destination,destination,origin) as point_2,
count(1) as journey_count
from t
GROUP BY point_1, point_2
ORDER BY point_1, point_2;

Postgres select if another event exists before and after a time range

I have a table like this:
I need a select the following records:
All category A
Category B only if before and after 20 seconds a category A exists for the same name
To create a test table:
CREATE TABLE test(
time TIMESTAMP,
name CHAR(10),
category CHAR(50)
);
INSERT INTO test (time, name, category)
VALUES ('2019-02-25 18:30:10', 'john', 'A'),
('2019-02-25 18:30:15', 'john', 'B'),
('2019-02-25 19:00:00', 'phil', 'A'),
('2019-02-25 20:00:00', 'tim', 'A'),
('2019-02-25 21:00:00', 'tim', 'B'),
('2019-02-25 21:00:00', 'frank', 'B');
So from the above, this is the desired output:
You can use an exists subquery to determine if there is an A row within 20 seconds:
select *
from test t1
where category = 'A'
or exists
(
select *
from test t2
where t2.category = 'A'
and t2.name = t1.name
and abs(extract(epoch from t2.time - t1.time)) < 20
)
You can use exists. But you can also use window functions:
select t.*
from (select t.*,
max(t.time) filter (t.category = 'A') over (partition by name order by time) as prev_a,
min(t.time) filter (t.category = 'A') over (partition by name order by time desc) as next_a
from test t
) t
where category = 'A' or
(category = 'B' and
(prev_a > time - interval '20 second' or
next_a < time + interval '20 second'
)
);

Increment column for streaks

How do I get the following result highlighted in yellow?
Essentially I want a calculated field which increments by 1 when VeganOption = 1 and is zero when VeganOption = 0
I have tried using the following query but using partition continues to increment after a zero. I'm a bit stuck on this one.
SELECT [UniqueId]
,[Meal]
,[VDate]
,[VeganOption]
, row_number() over (partition by [VeganOption] order by [UniqueId])
FROM [Control]
order by [UniqueId]
Table Data:
CREATE TABLE Control
([UniqueId] int, [Meal] varchar(10), [VDate] datetime, [VeganOption] int);
INSERT INTO Control ([UniqueId], [Meal], [VDate], [VeganOption])
VALUES
('1', 'Breakfast',' 2018-08-01 00:00:00', 1),
('2', 'Lunch',' 2018-08-01 00:00:00', 1),
('3', 'Dinner',' 2018-08-01 00:00:00', 1),
('4', 'Breakfast',' 2018-08-02 00:00:00', 1),
('5', 'Lunch',' 2018-08-02 00:00:00', 0),
('6', 'Dinner',' 2018-08-02 00:00:00', 0),
('7', 'Breakfast',' 2018-08-03 00:00:00', 1),
('8', 'Lunch',' 2018-08-03 00:00:00', 1),
('9', 'Dinner',' 2018-08-03 00:00:00', 1),
('10', 'Breakfast',' 2018-08-04 00:00:00', 0),
('11', 'Lunch',' 2018-08-04 00:00:00', 1),
('12', 'Dinner',' 2018-08-04 00:00:00', 1)
;
This is for SQL Server 2016+
You could create subgroups using SUM and then ROW_NUMBER:
WITH cte AS (
SELECT [UniqueId]
,[Meal]
,[VDate]
,[VeganOption]
,sum(CASE WHEN VeganOption = 1 THEN 0 ELSE 1 END)
over (order by [UniqueId]) AS grp --switching 0 <-> 1
FROM [Control]
)
SELECT *,CASE WHEN VeganOption =0 THEN 0
ELSE ROW_NUMBER() OVER(PARTITION BY veganOption, grp ORDER BY [UniqueId])
END AS VeganStreak -- main group and calculated subgroup
FROM cte
order by [UniqueId];
Rextester Demo
This is a variant on gaps-and-islands.
I like to define streaks using the difference of row numbers. This looks like
select c.*,
(case when veganoption = 1
then row_number() over (partition by veganoption, seqnum - seqnum_v order by uniqueid)
else 0
end) as veganstreak
from (select c.*,
row_number() over (partition by veganoption order by uniqueid) as seqnum_v,
row_number() over (order by uniqueid) as seqnum
from c
) c;
Why this works is a bit hard to explain. But, if you look at the results of the subquery, you'll see how the difference of row numbers defines the streaks you want to identify. The rest is just applying row_number() to enumerate the meals.
Here is a Rextester.
One method is to use a CTE to define your groupings, and then do a further ROW_NUMBER() on those, resulting in:
WITH Grps AS(
SELECT *,
ROW_NUMBER() OVER (ORDER BY UniqueID ASC) -
ROW_NUMBER() OVER (PARTITION BY VeganOption ORDER BY UniqueID ASC) AS Grp
FROM Control)
SELECT *,
CASE VeganOption WHEN 0 THEN 0 ELSE ROW_NUMBER() OVER (PARTITION BY Grp ORDER BY UniqueID ASC) END
FROM Grps
ORDER BY UniqueId;

Running Distinct Count with a Partition

I'd like a running distinct count with a partition by year for the following data:
DROP TABLE IF EXISTS #FACT;
CREATE TABLE #FACT("Year" INT,"Month" INT, "Acc" varchar(5));
INSERT INTO #FACT
values
(2015, 1, 'A'),
(2015, 1, 'B'),
(2015, 1, 'B'),
(2015, 1, 'C'),
(2015, 2, 'D'),
(2015, 2, 'E'),
(2015, 3, 'E'),
(2016, 1, 'A'),
(2016, 1, 'A'),
(2016, 2, 'B'),
(2016, 2, 'C');
SELECT * FROM #FACT;
The following returns the correct answer but is there a more concise way that is also performant?
WITH
dnsRnk AS
(
SELECT
"Year"
, "Month"
, DenseR = DENSE_RANK() OVER(PARTITION BY "Year", "Month" ORDER BY "Acc")
FROM #FACT
),
mxPerMth AS
(
SELECT
"Year"
, "Month"
, RunningTotal = MAX(DenseR)
FROM dnsRnk
GROUP BY
"Year"
, "Month"
)
SELECT
"Year"
, "Month"
, X = SUM(RunningTotal) OVER (PARTITION BY "Year" ORDER BY "Month")
FROM mxPerMth
ORDER BY
"Year"
, "Month";
The above returns the following - the answer should also return exactly the same table:
If you want a running count of distinct accounts:
SELECT f.*,
sum(case when seqnum = 1 then 1 else 0 end) over (partition by year order by month) as cume_distinct_acc
FROM (
SELECT
f.*
,row_number() over (partition by account order by year, month) as seqnum
FROM #fact f
) f;
This counts each account during the first month when it appears.
EDIT:
Oops. The above doesn't aggregate by year and month and then start over for each year. Here is the correct solution:
SELECT
year
,month
,sum( sum(case when seqnum = 1 then 1 else 0 end)
) over (partition by year order by month) as cume_distinct_acc
FROM (
SELECT
f.*
,row_number() over (partition by account, year order by month) as seqnum
FROM #fact f
) f
group by year, month
order by year, month;
And, SQL Fiddle isn't working but the following is an example:
with FACT as (
SELECT yyyy, mm, account
FROM (values
(2015, 1, 'A'),
(2015, 1, 'B'),
(2015, 1, 'B'),
(2015, 1, 'C'),
(2015, 2, 'D'),
(2015, 2, 'E'),
(2015, 3, 'E'),
(2016, 1, 'A'),
(2016, 1, 'A'),
(2016, 2, 'B'),
(2016, 2, 'C')) v(yyyy, mm, account)
)
SELECT
yyyy
,mm
,sum(sum(case when seqnum = 1 then 1 else 0 end)) over (partition by yyyy order by mm) as cume_distinct_acc
FROM (
SELECT
f.*
,row_number() over (partition by account, yyyy order by mm) as seqnum
FROM fact f
) f
group by yyyy, mm
order by yyyy, mm;
Demo Here:
;with cte as (
SELECT yearr, monthh, count(distinct acc) as cnt
FROM #fact
GROUP BY yearr, monthh
)
SELECT
yearr
,monthh
,sum(cnt) over (Partition by yearr order by yearr, monthh rows unbounded preceding ) as x
FROM cte