How to add a new column to the result table? - sql

This is the table mytable:
identifier thedate direction
111 2017-06-03 11:20 2
111 2017-06-03 12:22 1
222 2017-06-04 12:15 1
333 2017-06-05 12:21 1
444 2017-06-05 12:39 2
444 2017-06-08 14:23 2
555 2017-06-08 15:33 1
555 2017-06-08 16:12 2
I am calculating the average hourly count of unique identifiers in Apache Hive as follows:
SELECT HOUR(thedate) as hour,
COUNT(DISTINCT identifier, CAST(thedate as date),
HOUR(thedate)) / COUNT(DISTINCT CAST(thedate as date),
HOUR(thedate)) as hourly_avg_count
FROM mytable
GROUP BY HOUR(thedate)
Now I need to add a new calculated column to the result table (not the original one). This column called newcolumn should have value A for the results of thedate from the list ["2017-06-03","2017-06-04"]. It must have value B when thedate belongs to ["2017-06-05","2017-06-06"]. The rest of values of thedate that are not included in both lists should have the value C assigned.
The resulted table should have the following columns:
newcolumn hour hourly_avg_count
A 11 0.5
A 12 1
B ... ...
C ... ...

You would just add this to the GROUP BY:
SELECT (CASE WHEN DATE(thedate) IN ('2017-06-03', '2017-06-04') THEN 'A'
WHEN DATE(thedate) IN ('2017-06-05', '2017-06-06') THEN 'B'
ELSE 'C'
END) as grp,
HOUR(thedate) as hour,
COUNT(DISTINCT identifier, CAST(thedate as date), HOUR(thedate)
) / COUNT(DISTINCT CAST(thedate as date), HOUR(thedate)) as hourly_avg_count
FROM mytable
GROUP BY HOUR(thedate),
(CASE WHEN DATE(thedate) IN ('2017-06-03', '2017-06-04') THEN 'A'
WHEN DATE(thedate) IN ('2017-06-05', '2017-06-06') THEN 'B'
ELSE 'C'
END);

USE CASE STATEMENT
SELECT CASE WHEN thedate BETWEEN '2017-06-03' AND '2017-06-04'
THEN 'A'
WHEN thedate BETWEEN '2017-06-05' AND '2017-06-06'
THEN 'B'
ELSE 'C'
END newcolumn
...

Related

BigQuery - Picking latest not null value within 28 interval

I'm trying to add a column on this table and stuck for a little while
ID
Category 1
Date
Data1
A
1
2022-05-30
21
B
2
2022-05-21
15
A
2
2022-05-02
33
A
1
2022-02-11
3
B
2
2022-05-01
19
A
1
2022-05-15
null
A
1
2022-05-20
11
A
2
2022-04-20
22
to
ID
Category 1
Date
Data1
Picked_Data
A
1
2022-05-30
21
11
B
2
2022-05-21
15
19
A
2
2022-05-02
33
22
A
1
2022-02-11
3
some number or null
B
2
2022-05-01
19
some number or null
A
1
2022-05-15
null
some number or null
A
1
2022-05-20
11
some number or null
A
2
2022-04-20
22
some number or null
The logic is to partition by Category1 and ID then pick the latest none null value within the past 28 days. If there is no data exist, it'll be null
For the first row, ID = A and Category 1, it will pick 7th row as they are in the same category, ID and the date difference is <= 28. It skipped row 4th and 6th as the date is too far back and null value.
I've tried querying this by
select first_value(Data1) over (partition bty Category1 order by case when Data1 is not null and Date between Date - Inteverval 28 DAY and Date then 1 else 2) as Picked_Data
but it's picking incorrect rows,my guess is this query
Date between Date - Inteverval 28 DAY and Date
is not picking the correct date.. could anyone give me advise/suggestion how I could twick this query?
Consider below approach
select *,
first_value(data1 ignore nulls) over past_28_days as picked_data
from your_table
window past_28_days as (
partition by id, category_1
order by unix_date(date)
range between 29 preceding and 1 preceding
)
if applied to sample data in your question - output is
Consider below approach:
with sample_data as (
select 'A' as ID, 1 as category_1, date('2022-05-30') as date, 21 as data1,
union all select 'B' as ID, 2 as category_1, date('2022-05-21') as date, 15 as data1,
union all select 'A' as ID, 2 as category_1, date('2022-05-02') as date, 33 as data1,
union all select 'A' as ID, 1 as category_1, date('2022-02-11') as date, 3 as data1,
union all select 'B' as ID, 2 as category_1, date('2022-05-01') as date, 19 as data1,
union all select 'A' as ID, 1 as category_1, date('2022-05-15') as date, NULL as data1,
union all select 'A' as ID, 1 as category_1, date('2022-05-20') as date, 11 as data1,
union all select 'A' as ID, 2 as category_1, date('2022-04-20') as date, 22 as data1,
),
with_next_data as (
select *,
lag(date) over (partition by ID,category_1 order by date) as next_date,
lag(data1) over (partition by ID,category_1 order by date) as next_data,
from sample_data
)
select
id,
category_1,
date,
data1,
if(date_diff(date, next_date,day) <= 28, next_data, null) as picked_data
from with_next_data
Output:

Get max date for each from either of 2 columns

I have a table like below
AID BID CDate
-----------------------------------------------------
1 2 2018-11-01 00:00:00.000
8 1 2018-11-08 00:00:00.000
1 3 2018-11-09 00:00:00.000
7 1 2018-11-15 00:00:00.000
6 1 2018-12-24 00:00:00.000
2 5 2018-11-02 00:00:00.000
2 7 2018-12-15 00:00:00.000
And I am trying to get a result set as follows
ID MaxDate
-------------------
1 2018-12-24 00:00:00.000
2 2018-12-15 00:00:00.000
Each value in the id columns(AID,BID) should return the max of CDate .
ex: in the case of 1, its max CDate is 2018-12-24 00:00:00.000 (here 1 appears under BID)
in the case of 2 , max date is 2018-12-15 00:00:00.000 . (here 2 is under AID)
I tried the following.
1.
select
g.AID,g.BID,
max(g.CDate) as 'LastDate'
from dbo.TT g
inner join
(select AID,BID,max(CDate) as maxdate
from dbo.TT
group by AID,BID)a
on (a.AID=g.AID or a.BID=g.BID)
and a.maxdate=g.CDate
group by g.AID,g.BID
and 2.
SELECT
AID,
CDate
FROM (
SELECT
*,
max_date = MAX(CDate) OVER (PARTITION BY [AID])
FROM dbo.TT
) AS s
WHERE CDate= max_date
Please suggest a 3rd solution.
You can assemble the data in a table expression first, and the compute the max for each value is simple. For example:
select
id, max(cdate)
from (
select aid as id, cdate from t
union all
select bid, cdate from t
) x
group by id
You seem to only care about values that are in both columns. If this interpretation is correct, then:
select id, max(cdate)
from ((select aid as id, cdate, 1 as is_a, 0 as is_b
from t
) union all
(select bid as id, cdate, 1 as is_a, 0 as is_b
from t
)
) ab
group by id
having max(is_a) = 1 and max(is_b) = 1;

PIVOT without aggregation function

I have a single table, and I want to pivot it to new table. I used pivot to implements but the aggregate function will filter data. How to pivot table without aggregate function, or could u can give me recommendation for this question.
Orginal Table
ID Name Value Date
1 A 5.00 06/01/2019 13:00
2 A 13.15 06/02/2019 15:32
3 B 3.20 06/02/2019 15.32
4 B 33.11 05/11/2019 13:00
5 B 32.00 05/11/2019 13:00
trans to new table
ID A B Date
1 5.00 NULL 06/01/2019 13:00
2 13.15 3.20 06/02/2019 15:32
3 NULL 33.11 05/11/2019 13:00
4 Null 32.00 05/11/2019 13:00
notes: ID is identity on two table.
my pivot code, it only keep max value.
PIVOT(
MAX(Value)
FOR Name IN (A,B)) AS S
ORDER BY Date DESC
A standard pivot query should work here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Date, Name ORDER BY ID) rn
FROM yourTable
)
SELECT
Date,
MAX(CASE WHEN Name = 'A' THEN [Value] END) AS A,
MAX(CASE WHEN Name = 'B' THEN [Value] END) AS B
FROM cte
GROUP BY
Date, rn;
Demo

Using EXISTS within a GROUP BY clause

Is it possible to do the following:
I have a table that looks like this:
declare #tran_TABLE TABLE(
EOMONTH DATE,
AccountNumber INT,
CLASSIFICATION_NAME VARCHAR(50),
Value Float
)
INSERT INTO #tran_TABLE VALUES('2018-11-30','123','cat1',10)
INSERT INTO #tran_TABLE VALUES('2018-11-30','123','cat1',15)
INSERT INTO #tran_TABLE VALUES('2018-11-30','123','cat1',5 )
INSERT INTO #tran_TABLE VALUES('2018-11-30','123','cat2',10)
INSERT INTO #tran_TABLE VALUES('2018-11-30','123','cat3',12)
INSERT INTO #tran_TABLE VALUES('2019-01-31','123','cat1',5 )
INSERT INTO #tran_TABLE VALUES('2019-01-31','123','cat2',10)
INSERT INTO #tran_TABLE VALUES('2019-01-31','123','cat2',15)
INSERT INTO #tran_TABLE VALUES('2019-01-31','123','cat3',5 )
INSERT INTO #tran_TABLE VALUES('2019-01-31','123','cat3',2 )
INSERT INTO #tran_TABLE VALUES('2019-03-31','123','cat1',15)
EOMONTH AccountNumber CLASSIFICATION_NAME Value
2018-11-30 123 cat1 10
2018-11-30 123 cat1 15
2018-11-30 123 cat1 5
2018-11-30 123 cat2 10
2018-11-30 123 cat3 12
2019-01-31 123 cat1 5
2019-01-31 123 cat2 10
2019-01-31 123 cat2 15
2019-01-31 123 cat3 5
2019-01-31 123 cat3 2
2019-03-31 123 cat1 15
I want to produce a result where it will check whether in each month, for each AccountNumber (just one in this case) there exists a CLASSIFICATION_NAME cat1, cat2, cat3.
If all 3 exist for the month, then return 1 but if any are missing return 0.
The result should look like:
EOMONTH AccountNumber CLASSIFICATION_NAME
2018-11-30 123 1
2019-01-31 123 1
2019-03-31 123 0
But I want to do it as compactly as possible, without first creating a table that groups everything by CLASSIFICATION_NAME, EOMONTH and AccountNumber and then selects from that table.
For example, in the pseudo code below, is it possible to use maybe an EXISTS statement to do the group by?
SELECT
EOMONTH
,AccountNumber
,CASE WHEN EXISTS (CLASSIFICATION_NAME = 'cat1' AND 'cat2' AND 'cat3') THEN 1 ELSE 0 end
,SUM(Value) AS totalSpend
FROM #tran_TABLE
GROUP BY
EOMONTH
,AccountNumber
You could emulate this behavior by counting the distinct classifications that answer this condition (per group):
SELECT
EOMONTH
,AccountNumber
,CASE COUNT(DISTINCT CASE WHEN classification_name IN ('cat1', 'cat2', 'cat3') THEN classification_name END)
WHEN 3 THEN 1
ELSE 0
END
,SUM(Value) AS totalSpend
FROM #tran_TABLE
GROUP BY
EOMONTH
,AccountNumber
Try this-
SELECT EOMONTH,
AccountNumber,
CASE
WHEN COUNT(DISTINCT CLASSIFICATION_NAME) = 3 THEN 1
ELSE 0
END CLASSIFICATION_NAME
FROM #tran_TABLE
GROUP BY EOMONTH,AccountNumber
Output is-
2018-11-30 123 1
2019-01-31 123 1
2019-03-31 123 0
Query like this. You can count distinct values.
When you count unique values then column 'Three_Unique_Cat'. When you count exactly 'cat1','cat2','cat3' then column 'Three_Cat1_Cat2_Cat3'
SELECT
EOMONTH, AccountNumber
,CASE WHEN
COUNT(DISTINCT CLASSIFICATION_NAME)=3 THEN 1
ELSE 0
END AS 'Three_Unique_Cat'
,CASE WHEN
COUNT(DISTINCT CASE WHEN CLASSIFICATION_NAME IN ('cat1','cat2','cat3')
THEN CLASSIFICATION_NAME ELSE NULL END)=3 THEN 1
ELSE 0
END AS 'Three_Cat1_Cat2_Cat3'
,SUM(Value) AS totalSpend
FROM #tran_TABLE
GROUP BY EOMONTH, AccountNumber
Output:
EOMONTH AccountNumber Three_Unique_Cat Three_Cat1_Cat2_Cat3 totalSpend
2018-11-30 123 1 1 52
2019-01-31 123 1 1 37
2019-03-31 123 0 0 15
It's easy, just as below:
select
EOMONTH,
AccountNumber,
case when count(distinct CLASSIFICATION_NAME) = 3 then 1 else 0 end as CLASSIFICATION_NAME
from
tran_TABLE
group by
EOMONTH,
AccountNumber

SQL Server : remove duplicates and add columns

I have a table which has duplicate record this is how the table looks like.
ID Date Status ModifiedBy
------------------------------------------
1 1/2/2019 10:29 Assigned(0) xyz
1 1/2/2019 12:21 Pending(1) abc
1 1/4/2019 11:42 Completed(5)abc
1 1/20/2019 2:45 Closed(8) pqr
2 9/18/2018 10:05 Assigned(0) xyz
2 9/18/2018 11:15 Pending(1) abc
2 9/21/2018 11:15 Completed(5)abc
2 10/7/2018 2:46 Closed(8) pqr
What I want to do is take the minimum date value but also I want to add additional column which is PendingStartDate and PendingEndDate.
PendingStartDate: date when ID went into pending status
PendingEndDate: date when ID went from pending status to any other status
So my final output should look like this
ID AuditDate Status ModifiedBy PendingStartDate PendingEndDate
---------------------------------------------------------------------------
1 1/2/2019 10:29 Assigned(0) xyz 1/2/2019 12:21 1/4/2019 11:42
2 9/18/2018 10:05 Assigned(0) abc 9/18/2018 11:15 9/21/2018 11:15
Any help as to how to do this is appreciated.
Thanks
I think you want conditional aggregation:
select id, min(date) as auditdate,
max(case when seqnum = 1 then status end) as status,
max(case when seqnum = 1 then modifiedBy end) as modifiedBy,
min(case when status like 'Pending%' then date end) as pendingStartDate,
max(case when status like 'Pending%' then next_date end) as pendingEndDate
from (select t.*,
row_number() over (partition by id order by date) as seqnum,
lead(date) over (partition by id order by date) as next_date
from t
) t
group by id;
please try this:
Declare #Tab Table(Id int, [Date] DATETIME,[Status] Varchar(25),ModifiedBy varchar(10))
Insert into #Tab
SELECT 1,'1/2/2019 10:29','Assigned(0)','xyz' Union All
SELECT 1,'1/2/2019 11:29','Started(0)','xyz' Union All
SELECT 1,'1/2/2019 12:21','Pending(1)','abc' Union All
SELECT 1,'1/2/2019 12:21','In-Progress(1)','abc' Union All
SELECT 1,'1/4/2019 11:42','Completed(5)','abc'Union All
SELECT 1,'1/20/2019 2:45','Closed(8)','pqr' Union All
SELECT 2,'9/18/2018 10:05','Assigned(0)','xyz'Union All
SELECT 2,'9/18/2018 11:15','Pending(1)','abc' Union All
SELECT 2,'9/21/2018 11:15','Completed(5)','abc' Union All
SELECT 2,'10/7/2018 2:46','Closed(8)','pqr'
;with cte As
(
Select * ,lead(date) over (partition by id order by date) as pendingStartDate
from #Tab
Where Status in ('Assigned(0)','Pending(1)','Completed(5)')
)
,cte2 As
(
Select * , lead(pendingStartDate) over (partition by id order by date) As pendingEndDate
from cte
)
Select * from cte2 where Status ='Assigned(0)'
As you mentioned in comment, i have included few states between Assigned,pending and completed.