Select first and last row for each group - sql

I want find delta between first row and last row value for each group in my SQL query, but sub-query for each run is always return different values in time_last and last_value column .
Please help me fix my query.
table1 containts uniq time values and duplicated name and value.
something like this:
time name value
2023-01-16 08:52:51.965 apple 1100.0
2023-01-16 08:52:23.665 apple 691.3
2023-01-16 08:52:01.915 apple 107.0
2023-01-16 08:51:33.621 apple 1000.0
2023-01-16 08:51:11.815 apple_two 50.0
2023-01-16 08:50:51.574 apple_two 61.9
2023-01-16 08:50:42.575 apple_two 69.0
2023-01-16 08:50:21.800 apple_two 94.0
Problematic sub-query:
SELECT groupArray(time)[-1] as time_last, name , (groupArray(value)[-1]) as last_value
FROM stage.table1 il
WHERE time >= '2023-01-16 08:08:15'
AND time <= '2023-01-16 08:54:00'
AND name like '%apple%'
GROUP BY name
ORDER BY time_last
The totals query I want to use:
SELECT name, (last_value - first_value) as delta
FROM
(SELECT groupArray(time)[1] as time_first, name , (groupArray(value)[1]) as first_value
FROM stage.table1 il
WHERE time >= '2023-01-16 08:08:15'
AND time <= '2023-01-16 08:54:00'
AND name like '%apple%'
GROUP BY name
ORDER BY time_first
) as frst
JOIN
(SELECT groupArray(time)[-1] as time_last, name , (groupArray(value)[-1]) as last_value
FROM stage.table1 il
WHERE time >= '2023-01-16 08:08:15'
AND time <= '2023-01-16 08:54:00'
AND name like '%apple%'
GROUP BY name
ORDER BY time_last ) lst on frst.name = lst.name
having name like '%apple%'
returned values:
first run:
time_first name first_value time_last `lst.name` last_value delta
2023-01-16 08:08:15.010 apple 1100 2023-01-16 08:29:04.804 apple 1000 -100
second run:
time_first name first_value time_last `lst.name` last_value delta
2023-01-16 08:10:44.813 apple 200 2023-01-16 08:53:59.782 apple 254 54

create table t(time DateTime64(3), name String, value Float64) Engine=Memory as
select * from values(
('2023-01-16 08:52:51.965','apple', 1100.0),
('2023-01-16 08:52:23.665','apple', 691.3),
('2023-01-16 08:52:01.915','apple', 107.0),
('2023-01-16 08:51:33.621','apple', 1000.0),
('2023-01-16 08:51:11.815','apple_two', 50.0),
('2023-01-16 08:50:51.574','apple_two', 61.9),
('2023-01-16 08:50:42.575','apple_two', 69.0),
('2023-01-16 08:50:21.800','apple_two', 94.0));
SELECT
name,
max(time),
min(time),
argMax(value, time) AS last,
argMin(value, time) AS first
FROM t
GROUP BY name
┌─name──────┬───────────────max(time)─┬───────────────min(time)─┬─last─┬─first─┐
│ apple │ 2023-01-16 08:52:51.965 │ 2023-01-16 08:51:33.621 │ 1100 │ 1000 │
│ apple_two │ 2023-01-16 08:51:11.815 │ 2023-01-16 08:50:21.800 │ 50 │ 94 │
└───────────┴─────────────────────────┴─────────────────────────┴──────┴───────┘

Related

Counting Sick days over the weekend

I'm trying to solve a problem in the following (simplified) dataset:
Name
Date
Workday
Calenderday
Leave
PersonA
2023-01-01
0
1
NULL
PersonA
2023-01-07
0
1
NULL
PersonA
2023-01-08
0
1
NULL
PersonA
2023-01-13
1
1
Sick
PersonA
2023-01-14
0
1
NULL
PersonA
2023-01-15
0
1
NULL
PersonA
2023-01-16
1
1
Sick
PersonA
2023-01-20
1
1
Holiday
PersonA
2023-01-21
0
1
NULL
PersonA
2023-01-22
0
1
NULL
PersonA
2023-01-23
1
1
Holiday
PersonB
2023-01-01
0
1
NULL
PersonB
2023-01-02
1
1
Sick
PersonB
2023-01-03
1
1
Sick
Where the lines with NULL in [Leave] is weekend.
What I want is a result looking like this:
Name
Leave
PeriodStartDate
PeriodEndDate
Workdays
Weekdays
PersonA
Sick
2023-01-13
2023-01-16
2
4
PersonA
Holiday
2023-01-20
2023-01-23
2
4
PersonB
Sick
2023-01-02
2023-01-03
2
2
where the difference between [Workdays] and [Weekdays] is that weekdays also counts the weekend.
What I have been trying is to first make a row (in two different ways)
ROW_NUMBER() OVER (PARTITION BY \[Name\] ORDER BY \[Date\]) as RowNo1
ROW_NUMBER() OVER (PARTITION BY \[Name\], \[Leave\] ORDER BY \[Date\]) as RowNo2
and after that to make a period base date:
DATEADD(DAY, 0 - \[RowNo1\], Date) as PeriodBaseDate1
,DATEADD(DAY, 0 - \[RowNo2\], \[Date\]) as PeriodBaseDate2
and after that do something like this:
MIN(\[Date\]) as PeriodStartDate
,MAX(\[Dato\]) as PeriodEndDate
,SUM(\[Calenderday\]) as Weekdays
,SUM(\[Workday\]) as Workdays
GROUP BY \[PeriodBaseDate (1 or 2?)\], \[Leave\], \[Name\]
But whatever I do I can't seem to get it to count the weekends in the periods.
It doesn't have to include my try with the RowNo, PeriodBaseDate etc.
As we don't have your actual full solutions, I've provided a full working one. I firstly use LAST_VALUE to have all the rows have a value for their Leave (provided there was a non-NULL value previously).
Once I do that, you have a gaps and island problem, and can aggregate based on that.
I assume you are using SQL Server 2022, the latest version of SQL Server at the time of writing, as no version details are given and thus have access to the IGNORE NULLS syntax.
SELECT *
INTO dbo.YourTable
FROM (VALUES('PersonA',CONVERT(date,'2023-01-01'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-07'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-08'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-13'),1,1,'Sick'),
('PersonA',CONVERT(date,'2023-01-14'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-15'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-16'),1,1,'Sick'),
('PersonA',CONVERT(date,'2023-01-20'),1,1,'Holiday'),
('PersonA',CONVERT(date,'2023-01-21'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-22'),0,1,NULL),
('PersonA',CONVERT(date,'2023-01-23'),1,1,'Holiday'),
('PersonB',CONVERT(date,'2023-01-01'),0,1,NULL),
('PersonB',CONVERT(date,'2023-01-02'),1,1,'Sick'),
('PersonB',CONVERT(date,'2023-01-03'),1,1,'Sick'))V(Name,Date,Workday,Calenderday,Leave);
GO
WITH Leaves AS(
SELECT Name,
[Date],
Workday,
Calenderday, --It's spelt Calendar, you should correct this typopgraphical error as objects with typoes lead to further problems.
--Leave,
LAST_VALUE(Leave) IGNORE NULLS OVER (PARTITION BY Name ORDER BY Date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Leave
FROM dbo.YourTable YT),
LeaveGroups AS(
SELECT Name,
[Date],
Workday,
CalenderDay,
Leave,
ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Date) -
ROW_NUMBER() OVER (PARTITION BY Name, Leave ORDER BY Date) AS Grp
FROM Leaves)
SELECT Name,
Leave,
MIN([Date]) AS PeriodStartDate,
MAX([Date]) AS PeriodEndDate,
SUM(WorkDay) AS WorkDays, --Assumes Workday is not a bit, if it is, CAST or CONVERT it to a int
DATEDIFF(DAY,MIN([Date]), MAX([Date]))+1 AS Weekdays
--SUM(CASE WHEN (DATEPART(WEEKDAY,[Date]) + ##DATEFIRST + 5) % 7 BETWEEN 0 AND 4 THEN 1 END) AS Weekdays --This method is language agnostic
FROM LeaveGroups
WHERE Leave IS NOT NULL
GROUP BY Name,
Leave,
Grp
ORDER BY Name,
PeriodStartDate;
GO
DROP TABLE dbo.YourTable;
I am not sure what you are trying to do. Based on what I understood, below script gives the expected output.
SELECT Name, Leave, Min(Date) PeriodStartDate,Max(Date) PeriodEndDate, SUM(Workday) Workdays, DATEDIFF(DAY,Min(Date),Max(Date))+ 1 Weekdays from YourTable
WHERE Leave IS NOT NULL
GROUP BY Name, Leave

Obtain corresponding value to max value of another column

I need to find the corresponding value to the max value of another column.
My data is as below:
group
subgroup
subgroup_2
value_a
value_b
date
A
101
1
200
101
20220301
A
102
1
105
90
20220301
A
103
2
90
202
20220301
A
211
2
75
107
20220301
B
212
1
91
65
20220301
B
213
1
175
101
20220301
I would need to format the data like this:
group
subgroup_2
max_value_a
value_b
date
A
1
200
101
20220301
A
2
90
202
20220301
B
1
175
101
20220301
I can achieve the format fairly easily via a group by, however I have to aggregate value_b to do this which doesn't give me the result I need.
I know I can use rank() over partition by but it doesn't seem to provide the format I require.
This is the query I used below, however it only provides the max of one subgroup_2 rather than the max of each:
select group, subgroup_2, max_value_a, value_b, date
from
(
select a.group, a.subgroup_2, a.max_value_a, a.value_b, a.date,
rank() over(partition by a.group, subgroup_2, a.date order by a.max_value_a desc) as rnk
from table_1 a
)s
where rnk=1
You want to use ROW_NUMBER here:
SELECT group, subgroup_2, value_a AS max_value_a, value_b, date
FROM
(
SELECT group, subgroup_2, value_a, value_b, date,
ROW_NUMBER() OVER (PARTITION BY group, subgroup_2 ORDER BY value_a DESC) rn
FROM table_1
) t
WHERE rn = 1;

how to find if one cell appears one time or multiple times

I would like to ask a sql question:
serial_number
produced_date
date_fixed
days_since_fixed
order_number
1000106
2020-07-09
2021-09-17
432
10258678
1000122
2020-05-20
2021-10-05
497
10266059
1000171
2020-05-27
2021-09-06
457
10249739
1000174
2020-05-12
2020-07-28
56
10117509
1000183
2020-08-14
2020-08-20
6
10125927
1000183
2020-08-14
2020-08-22
8
10126417
1000227
2020-05-19
2021-08-26
457
10245064
the table looks like this and would like to check if serial_number is equal to its next one, since the same serial number can appear multiple times and order_number is not the same, which means this serial_number has several fixes. for the serial_number only appear once the new column days_since_last_fix is going to be 0 and if the serial_number show multiple times its going to be date_fixed.next() - date_fixed. Im not sure how to do this..
but something like this:
select t1.* ,case when t1.serial_number = t2.serial_number
--if appear multiple times
then datediff(t2.date_fixed,t1.date_fixed,days)
when t1.serial_number = t1.serial_number
--if appear once
then date_diff(t1.date_fixed,t1.produced_date,days)
end as days_since_last_repair from table t1, table t2
where t1.serial_number = t2.serial_number
and t1.order_number != t2.order_number
from table t1, table t2
thank you
I'm not quite sure what you want to achieve. What do t1 and t2 look like? if you want to look at values on the next row you can use analytic functions check this doc out for some help. Alternative you could group the rows via the serial number like below.
Single row grouped approach
WITH data as
((SELECT 1000106 serial_number, '2020-07-09' produced_date, '2021-09-17' date_fixed, 432 days_since_fixed, 10258678 order_number)
UNION ALL ( SELECT 1000122,'2020-05-20','2021-10-05',497,10266059)
UNION ALL ( SELECT 1000171,'2020-05-27','2021-09-06',457,10249739)
UNION ALL ( SELECT 1000174,'2020-05-12','2020-07-28',56,10117509)
UNION ALL ( SELECT 1000183,'2020-08-14','2020-08-20',6,10125927)
UNION ALL ( SELECT 1000183,'2020-08-14','2020-08-22',8,10126417)
UNION ALL ( SELECT 1000227,'2020-05-19','2021-08-26',457,10245064))
SELECT serial_number, COUNT(serial_number) as fix_count, ARRAY_AGG(order_number) as order_numbers,
CASE
WHEN COUNT(serial_number) > 1 THEN DATE_DIFF(MAX(DATE(date_fixed)), MIN(DATE(date_fixed)), DAY)
WHEN COUNT(serial_number) = 1 THEN 0
END as days_since_last_repair
FROM data
GROUP BY serial_number
Results from above
serial_number
fix_count
order_numbers
days_since_last_repair
1000106
1
10258678
0
1000122
1
10266059
0
1000171
1
10249739
0
1000174
1
10117509
0
1000183
2
10125927
2
10126417
1000227
1
10245064
0
Analytic function approach
The below query uses the functions in the doc from above. It takes the data from the original table partitions by serial number then orders by date_fixed then looks at the previous row for the last date it was fixed replacing nulls with 0's.
WITH data as
((SELECT 1000106 serial_number, '2020-07-09' produced_date, '2021-09-17' date_fixed, 432 days_since_fixed, 10258678 order_number)
UNION ALL ( SELECT 1000122,'2020-05-20','2021-10-05',497,10266059)
UNION ALL ( SELECT 1000171,'2020-05-27','2021-09-06',457,10249739)
UNION ALL ( SELECT 1000174,'2020-05-12','2020-07-28',56,10117509)
UNION ALL ( SELECT 1000183,'2020-08-14','2020-08-20',6,10125927)
UNION ALL ( SELECT 1000183,'2020-08-14','2020-08-22',8,10126417)
UNION ALL ( SELECT 1000227,'2020-05-19','2021-08-26',457,10245064))
SELECT serial_number, produced_date, date_fixed,
IFNULL(DATE_DIFF(DATE(date_fixed), DATE(LAG (date_fixed,1) OVER ( PARTITION BY serial_number ORDER BY date_fixed )), DAY),0) as days_since_fixed
FROM data
Result from the above query.
serial_number
produced_date
date_fixed
days_since_fixed
1000106
2020-07-09
2021-09-17
0
1000122
2020-05-20
2021-10-05
0
1000171
2020-05-27
2021-09-06
0
1000174
2020-05-12
2020-07-28
0
1000183
2020-08-14
2020-08-20
0
1000183
2020-08-14
2020-08-22
2
1000227
2020-05-19
2021-08-26
0

sql/oracle select values seperated by comma with grouping

I have first table like this: table_1
date
group_number
c_id
rate
01.01.2020
A
001
12.0
02.01.2020
A
001
12.0
01.01.2020
A
002
10.0
01.01.2020
B
103
8.0
01.01.2020
B
101
8.0
01.01.2020
C
203
11.0
And have second table_2 with name of group with date of records:
date
group_number
01.01.2020
A
02.02.2020
A
03.03.2020
A
01.01.2020
B
01.02.2020
B
01.01.2020
C
The task is to write to new column in table_2 the rates of each c_id seperated by comma, grouped by group_number. I need to add new column to table_2 as next:
date
group_number
rate_for_groups
01.01.2020
A
12.0, 10.0
02.02.2020
A
12.0, 10.0
03.03.2020
A
12.0, 10.0
01.01.2020
B
8.0, 8.0
01.02.2020
B
8.0, 8.0
01.01.2020
C
11.0
I have tried to do smth like this:
select *,
listagg(rate, ',') within group (order by C_ID) as rates
from table_1
group by group_number
but it raised the error "not a group by expression"
Your query shows only half the task: You are only looking at table_1. With GROUP BY group_number you tell the DBMS to select one row per group_number only. That is fine for that table. But you cannot SELECT * then, because there are several rows per group_number. How is the DBMS supposed to know which row's values to display then for a group_number?
Remove the * from that query to get it valid. select the group_number instead. Then join this result to table_2.
select *
from table_2 t2
left outer join
(
select
group_number,
listagg(rate, ',') within group (order by c_id) as rates
from table_1
group by group_number
) t1 on t1.group_number = t2.group_number
order by t2.group_number, t2.date;

How to get deduped conversions per unique users with a 24 hour window

I need to get deduped conversions for each unique user. The rule here is that I need a column where I only get the count of the first conversion made within a day. So I can trigger 10 conversions for 3/28/2019, but the 'Deduped' column will only pull in the count for 1.
This is my original data in BigQuery:
Date User_ID
3/3/19 1234
3/3/19 1234
3/3/19 1234
3/3/19 12
3/3/19 12
3/4/19 1234
3/4/19 1234
3/5/19 1
I want my final output to look like this:
Date User_ID Total_Conversions Deduped
3/3/19 1234 3 1
3/3/19 12 2 1
3/5/19 1 1 1
3/4/19 1234 2 1
Below is for BigQuery Standard SQL
#standardSQL
SELECT day, user_id,
COUNT(1) total_conversions,
COUNT(DISTINCT user_id) deduped
FROM `project.dataset.table`
GROUP BY day, user_id
If apply above to sample data from your question - result is
Row day user_id total_conversions deduped
1 3/4/19 1234 2 1
2 3/5/19 1 1 1
3 3/3/19 1234 3 1
4 3/3/19 12 2 1
Note: check my comments I left below your question!
How about if we didn't have the date column in the final output, what if the deduped rule was constructed within the code to where the grouping would be based on the uder_id without the 'day' column in the final output?
Below example does this
#standardSQL
WITH `project.dataset.table` AS (
SELECT '3/3/19' day, 1234 user_id UNION ALL
SELECT '3/3/19', 1234 UNION ALL
SELECT '3/3/19', 1234 UNION ALL
SELECT '3/3/19', 12 UNION ALL
SELECT '3/3/19', 12 UNION ALL
SELECT '3/4/19', 1234 UNION ALL
SELECT '3/4/19', 1234 UNION ALL
SELECT '3/5/19', 1
)
SELECT user_id,
COUNT(1) total_conversions,
COUNT(DISTINCT day) deduped
FROM `project.dataset.table`
GROUP BY user_id
with result
Row user_id total_conversions deduped
1 12 2 1
2 1 1 1
3 1234 5 2