group a set of records by date in teradata - sql

Currently I have data in a table as shown below:
date id value
1-Jan-13 1 100
2-Jan-13 1 100
3-Jan-13 1 100
4-Jan-13 1 200
5-Jan-13 1 200
6-Jan-13 1 100
7-Jan-13 1 100
I am trying to group the records based on the id and val and version records with startdate and end date .
Desired output:
start date end date id value
1-Jan-13 3-Jan-13 1 100
4-Jan-13 5-Jan-13 1 200
6-Jan-13 7-Jan-13 1 100

I'm not an expert in Teradata but you most likely, since windowing functions are supported (specifically ROW_NUMBER), be able to do something like this
SELECT MIN(date) start_date, MAX(date) end_date, id, value
FROM
(
SELECT date, id, value,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) -
ROW_NUMBER() OVER (PARTITION BY id, value ORDER BY date) island
FROM table1
) q
GROUP BY id, value, island
ORDER BY start_date, end_date
Sample output:
| START_DATE | END_DATE | ID | VALUE |
|------------|------------|----|-------|
| 2013-01-01 | 2013-01-03 | 1 | 100 |
| 2013-01-04 | 2013-01-05 | 1 | 200 |
| 2013-01-06 | 2013-01-07 | 1 | 100 |
Here is SQLFiddle demo (It's a SQL Server demo, but should work as expected in Teradata)

The ROW_NUMBER version can be further simplified: modified SQL Fiddle
For Teradata:
SELECT
id,val,MIN(dt),MAX(dt)
FROM
(
SELECT
id,val,dt,
dt - ROW_NUMBER() OVER (PARTITION BY id ORDER BY val, dt) AS dummy
FROM table1
) AS dt
GROUP BY 1,2,dummy
And there are some hardly known functions in TD13.10 for processing time series data:
WITH cte(id,val,pd) AS
(
SELECT id, val, PERIOD(dt, dt+1) AS pd
FROM table1
)
SELECT
id, val,
BEGIN(pd) AS start_dt,
LAST(pd) AS end_dt
FROM
TABLE (TD_NORMALIZE_MEET
(NEW VARIANT_TYPE(cte.id,cte.val)
,cte.pd)
RETURNS (id INTEGER
,val INTEGER
,pd PERIOD(DATE)
,Nrm_Count INTEGER)
HASH BY id
LOCAL ORDER BY id, val, pd
) A
ORDER BY start_dt, end_dt

Related

SQL: Get date difference between rows in the same column [duplicate]

This question already has an answer here:
SQL or LINQ: how do I select records where only one paramater changes?
(1 answer)
Closed 3 years ago.
I am trying to create a report and this is my input data.
Stage Name Date
1 x 12/05/2019 10:00:03
1 x 12/05/2019 10:05:01
1 y 12/06/2019 12:00:07
2 x 12/06/2019 13:12:03
2 x 12/06/2019 13:23:00
1 y 12/08/2019 16:00:07
2 x 12/09/2019 09:17:59
This is my desired output.
Stage Name DateFrom DateTo DateDiff
1 x 12/05/2019 10:00:03 12/06/2019 12:00:07 1
1 y 12/06/2019 12:00:07 12/06/2019 13:12:03 0
2 x 12/06/2019 13:12:03 12/08/2019 16:00:07 2
1 y 12/08/2019 16:00:07 12/09/2019 09:17:59 1
I cannot use group by clause over stage and name, since it will group the 3rd and 6th rows from my input. I tried joining the table to itself, but I am not getting the desired result. Is this even possible in SQL ? Any ideas would be helpful. I am using Microsoft SQL Server.
This is a variation of the gaps and island problem. You want to group together groups of adjacent rows (ie having the same stage and name); but you want to use the start date of the next group as ending date for the current group.
Here is one way to do it:
select
stage,
name,
min(date) date_from,
lead(min(date)) over(order by min(date)) date_to,
datediff(day, min(date), lead(min(date)) over(order by min(date))) date_diff
from (
select
t.*,
row_number() over(order by date) rn1,
row_number() over(partition by stage, name order by date) rn2
from mytable t
) t
group by stage, name, rn1 - rn2
order by date_from
Demo on DB Fiddle:
stage | name | date_from | date_to | datediff
----: | :--- | :------------------ | :------------------ | -------:
1 | x | 12/05/2019 10:00:03 | 12/06/2019 12:00:07 | 1
1 | y | 12/06/2019 12:00:07 | 12/06/2019 13:12:03 | 0
2 | x | 12/06/2019 13:12:03 | 12/08/2019 16:00:07 | 2
1 | y | 12/08/2019 16:00:07 | 12/09/2019 09:17:59 | 1
2 | x | 12/09/2019 09:17:59 | null | null
Note that this does not produce exactly the result that you showed: there is an additional, pending record at the end of the resultset, that represents the "on-going" series of records. If needed, you can filter it out by nesting the query:
select *
from (
select
stage,
name,
min(date) date_from,
lead(min(date)) over(order by min(date)) date_to,
datediff(day, min(date), lead(min(date)) over(order by min(date))) date_diff
from (
select
t.*,
row_number() over(order by date) rn1,
row_number() over(partition by stage, name order by date) rn2
from mytable t
) t
group by stage, name, rn1 - rn2
) t
where date_to is not null
order by date_from
This is a variation of the gaps-and-islands problem, but it has a pretty simple solution.
Just keep every row where the previous row has a different stage or name. Then use lead() to get the next date. Here is the basic idea:
select t.stage, t.name, t.date as datefrom
lead(t.date) over (order by t.date) as dateto,
datediff(day, t.date, lead(t.date) over (order by t.date)) as diff
from (select t.*,
lag(date) over (partition by stage, name order by date) as prev_sn_date,
lag(date) over (order by date) as prev_date
from t
) t
where prev_sn_date <> prev_date or prev_sn_date is null;
If you really want to filter out the last row, you need one more step; I'm not sure if that is desirable.

Query for negative account balance period in bigquery

I am playing around with bigquery and hit an interesting use case. I have a collection of customers and account balances. The account balances collection records any account balance change.
Customers:
+---------+--------+
| ID | Name |
+---------+--------+
| 1 | Alice |
| 2 | Bob |
+---------+--------+
Accounts balances:
+---------+---------------+---------+------------+
| ID | customer_id | value | timestamp |
+---------+---------------+---------+------------+
| 1 | 1 | -500 | 2019-02-12 |
| 2 | 1 | -200 | 2019-02-10 |
| 3 | 2 | 200 | 2019-02-10 |
| 4 | 1 | 0 | 2019-02-09 |
+---------+---------------+---------+------------+
The goal is to find out, for how long a customer has a negative account balance. The resulting collection would look like this:
+---------+--------+---------------------------------+
| ID | Name | Negative account balance since |
+---------+--------+---------------------------------+
| 1 | Alice | 2 days |
+---------+--------+---------------------------------+
Bob is not in the collection, because his last account record shows a positive value.
I think following steps are involved:
get last account balance per customer, see if it is negative
go through the account balance values until you hit a positive (or no more) value
compute datediff
Is something like this even possible in sql? Do you have any ideas on who to create such query? To get customers that currently have a negative account balance, I use this query:
SELECT customer_id FROM (
SELECT t.account_balance, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY timestamp DESC) as seqnum FROM `account_balances` t
) t
WHERE seqnum = 1 AND account_balance<0
Below is for BigQuery Standard SQL
#standardSQL
SELECT customer_id, name,
SUM(IF(negative_positive < 0, days, 0)) negative_days,
SUM(IF(negative_positive = 0, days, 0)) zero_days,
SUM(IF(negative_positive > 0, days, 0)) positive_days
FROM (
SELECT customer_id, negative_positive, grp,
1 + DATE_DIFF(MAX(ts), MIN(ts), DAY) days
FROM (
SELECT customer_id, ts, SIGN(value) negative_positive,
COUNTIF(flag) OVER(PARTITION BY customer_id ORDER BY ts) grp
FROM (
SELECT *, SIGN(value) = IFNULL(LEAD(SIGN(value)) OVER(PARTITION BY customer_id ORDER BY ts), 0) flag
FROM `project.dataset.balances`
)
)
GROUP BY customer_id, negative_positive, grp
)
LEFT JOIN `project.dataset.customers`
ON id = customer_id
GROUP BY customer_id, name
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.balances` AS (
SELECT 1 customer_id, -500 value, DATE '2019-02-12' ts UNION ALL
SELECT 1, -200, '2019-02-10' UNION ALL
SELECT 2, 200, '2019-02-10' UNION ALL
SELECT 1, 0, '2019-02-09'
), `project.dataset.customers` AS (
SELECT 1 id, 'Alice' name UNION ALL
SELECT 2, 'Bob'
)
SELECT customer_id, name,
SUM(IF(negative_positive < 0, days, 0)) negative_days,
SUM(IF(negative_positive = 0, days, 0)) zero_days,
SUM(IF(negative_positive > 0, days, 0)) positive_days
FROM (
SELECT customer_id, negative_positive, grp,
1 + DATE_DIFF(MAX(ts), MIN(ts), DAY) days
FROM (
SELECT customer_id, ts, SIGN(value) negative_positive,
COUNTIF(flag) OVER(PARTITION BY customer_id ORDER BY ts) grp
FROM (
SELECT *, SIGN(value) = IFNULL(LEAD(SIGN(value)) OVER(PARTITION BY customer_id ORDER BY ts), 0) flag
FROM `project.dataset.balances`
)
)
GROUP BY customer_id, negative_positive, grp
)
LEFT JOIN `project.dataset.customers`
ON id = customer_id
GROUP BY customer_id, name
-- ORDER BY customer_id
with result
Row customer_id name negative_days zero_days positive_days
1 1 Alice 3 1 0
2 2 Bob 0 0 1

SQL: FIlter rows by direction

I have a table with 2 column date (timestamp), status (boolean).
I have a lot of value like:
| date | status |
|-------------------------- |-------- |
| 2018-11-05T19:04:21.125Z | true |
| 2018-11-05T19:04:22.125Z | true |
| 2018-11-05T19:04:23.125Z | true |
....
I need to get a result like this:
| date_from | date_to | status |
|-------------------------- |-------------------------- |-------- |
| 2018-11-05T19:04:21.125Z | 2018-11-05T19:04:27.125Z | true |
| 2018-11-05T19:04:27.125Z | 2018-11-05T19:04:47.125Z | false |
| 2018-11-05T19:04:47.125Z | 2018-11-05T19:04:57.125Z | true |
So, I need to filter all "same" value and get in return only period of status true/false.
I create query like this:
SELECT max("current_date"), current_status, previous_status
FROM (SELECT date as "current_date",
status as current_status,
(lag(status, 1) OVER (ORDER BY msgtime))::boolean AS previous_status
FROM "table" as table
) as raw_data
group by current_status, previous_status
but in response I get only no more than 4 value
This is a gaps-and-islands problem. A typical method uses the difference of row numbers:
select min(date), max(date), status
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by status order by date) as seqnum_s
from t
) t
group by status, (seqnum - seqnum_s);
Yes you could use LAG but then you also need a running counter that increments every time the status changes:
WITH cte1 AS (
SELECT date, status, CASE WHEN LAG(status) OVER (ORDER BY date) = status THEN 0 ELSE 1 END AS chg
FROM yourdata
), cte2 AS (
SELECT date, status, SUM(chg) OVER (ORDER BY date) AS grp
FROM cte1
)
SELECT MIN(date) AS date_from, MAX(date) AS date_to, status
FROM cte2
GROUP BY grp, status
ORDER BY date_from
DB Fiddle

SQL Server - find absence date occurrences [duplicate]

This question already has an answer here:
SQL: Gaps and Islands, Grouped dates
(1 answer)
Closed 5 years ago.
I have the following dataset:
enter image description here
Here is script for this data:
;with dataset AS (
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-07' AS DATE) AS CUT_DATE
UNION
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-15' AS DATE) AS CUT_DATE
UNION
select 'EMP02' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-09' AS DATE) AS CUT_DATE
)
select *
from dataset
I need to divide these periods (PERIOD_START and PERIOD_END) by CUT_DATE (exclude cut dates from that periods) The number of cut dates could be any (3,5,8 etc).
Expecting result for the dataset above is:
If your version of SQL Server supports LAG, you can use this.
SELECT EMPLOYEE_ID,
ITEM_TYPE,
MIN(APPLY_DATE) AS STARTDATE,
MAX(APPLY_DATE) AS ENDDATE
FROM
(SELECT T.*,
SUM(CASE WHEN PREV_TYPE=ITEM_TYPE THEN 0 ELSE 1 END)
OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS GRP
FROM (SELECT D.*,
LAG(ITEM_TYPE) OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS PREV_TYPE
FROM DATA D
) T
) T
WHERE ITEM_TYPE IN ('Sickness','Vacation')
GROUP BY EMPLOYEE_ID,ITEM_TYPE,GRP
The logic is to get the previous row's item_type (based on ascending order of apply_date) and compare it with the current row's value. If they are equal, they belong to the same group. Else you start a new group. This is done in the sum window function. After groups are assigned, you just need to get the max and min date for an employee_id,item_type.
Sample Demo
You would use the LAG function.
If you order by something, the LAG function gives the previous value;
a full description can be found at: http://www.sqlservercentral.com/articles/T-SQL/106783/
Take a look at vkp's answer for a full query
This is another way if way if lag is supported.
Rextester Sample
with tbl as
(select d.*
,case when (item_type = lag(item_type) over (partition by employee_id order by apply_date))
then 0
else 1
end grp_tmp
from DATA2 d
where
item_type <> 'Worked'
)
,tbl2 as
(select t.*
,sum(grp_tmp) over (order by employee_id,apply_date
rows between unbounded preceding and current row
)
as grp
from tbl t
)
select
EMPLOYEE_ID
,ITEM_TYPE
,(CONVERT(VARCHAR(24),min(apply_date),103)
+' - '
+CONVERT(VARCHAR(24),max(apply_date),103)
) as range
from tbl2
group by EMPLOYEE_ID,
ITEM_TYPE
,grp
order by
employee_id
,min(apply_date);
Output
+-------------+-----------+-------------------------+
| EMPLOYEE_ID | ITEM_TYPE | range |
+-------------+-----------+-------------------------+
| 1 | Sickness | 23/05/2017 - 24/05/2017 |
| 1 | Vacation | 26/05/2017 - 29/05/2017 |
| 1 | Sickness | 01/06/2017 - 01/06/2017 |
| 2 | Sickness | 25/05/2017 - 30/05/2017 |
+-------------+-----------+-------------------------+

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?
Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).
Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"
Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE
Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period