View and complex query count distinct locations employee stayed in SQL - sql

I have a view which looks like this view_1:
id Office Begin_dt Last_dt Days
1 Office1 2019-09-02 2019-09-08 6
1 Office2 2019-09-09 2019-09-30 21
1 Office1 2019-10-01 2019-10-31 30
5 Office3 2017-10-01 2017-10-16 15
5 Office2 2017-10-17 2017-10-30 13
5 Office2 2017-11-01 2017-11-31 30
I want to find the office where employee stayed for max time and also the number of Distinct Office locations he stayed in.
Expected output
id Max_time_in_Office Days Distinct_office_locations
1 Office1 36 2
5 Office2 43 2
So id 1 spends 6 and 30, overall 36 days in office 1. Max time is spent in office 1 by him. Distinct locations are 2.
id 5 spends 13 and 30 , 43 days in office. Max time is spent in office 2. Distinct locations are 2.
Code tried
select v.*
from (select v.id, v.office, sum(days) as Max_time_in_Office, count(Office) as Distinct_office_locations,
rank() over (partition by id order by sum(days) desc) as seqnum
from view_1 v
group by id, office
) v
where seqnum = 1;
Output obtained
id Max_time_in_Office Days Distinct_office_locations
1 Office1 36 1
5 Office2 43 1
So I am getting wrong output. Can someone pls help

Close. You want a window function:
select v.*
from (select v.id, v.office, sum(days) as Max_time_in_Office,
count(*) over (partition by id) as Distinct_office_locations,
rank() over (partition by id order by sum(days) desc) as seqnum
from view_1 v
group by id, office
) v
where seqnum = 1;
Basically the window function is counting the number of rows returned after the aggregation -- and there is one row per office.

You could use the apply operator to achieve that:
select V.Id,
T.Max_Time_Office,
T.Days,
Distinct_office_locations = count(distinct V.Office)
from view_1 V
Cross apply
(
Select top 1 Id,
Max_Time_Office = Office,
Days = sum(Days)
From view_1 VG
where V.Id = VG.Id
group by VG.Id, VG.Office
order by sum(Days) desc
) T
group by V.Id, T.Max_Time_Office, T.Days
Basically, you are getting the Office with most days in the order by sum(Days) desc inside the Cross apply, and using that in the outer expression. I then just did a count(distinct V.Office) to get the distinct offices.

Related

Recursive snowflake query of monthly snapshots for time series analysis dropping records prior to change for users who changed department

This is a successor question to this question which explains the objective of this query and provides a sample of the source data.
With help, I have this recursive query running which is more efficient than my non-recursive query, repeated 36 times and unioned together.
The purpose of this query is to know which department an employee was in at the end of each month. The problem with this code is that for employees who changed departments, it is returning only the month end department value for months subsequent to the most recent department change, and no prior records. For employees who changed departments, the output should contain this data:
Month - Department Code
0 - 100
1 - 100
2 - 200
3 - 200
And it is currently returning:
Month - Department Code
0 - 100
1 - 100
Here is the query:
WITH Q AS (
select
row_number() over(order by null) as q_level,
last_day(dateadd(month, -q_level, CURRENT_DATE), month) as last_day_month
from table(generator(ROWCOUNT=>36))
), Q1 AS (
select
q.q_level
,q.last_day_month
,v_dept_history_adj.associate_id
,v_dept_history_adj.home_department_code
,v_dept_history_adj.position_effective_date
,max(position_effective_date) OVER(PARTITION BY v_dept_history_adj.associate_id) AS most_recent_record
from datawarehouse.srctable
,Q
where v_dept_history_adj.position_effective_date <= q.last_day_month
)
select
associate_id
,position_effective_date
,home_department_code
,most_recent_record
,last_day_month AS month
FROM Q1
where position_effective_date = most_recent_record
order by month desc, position_effective_date desc
So no that the larger picture of your questions makes sense:
To get the most resent department per month for each employee, I would write this query like so:
with emp_data(emp_id, dep_id, date) as (
select * from values
(1, 10, '2022-01-01'::date),
(1, 20, '2022-07-10'::date),
(2, 10, '2022-07-14'::date)
), last_36_months as (
select
row_number() over(order by null) as q_level,
last_day(dateadd(month, -q_level, CURRENT_DATE), month) as last_day_month
--from table(generator(ROWCOUNT=>36))
from table(generator(ROWCOUNT=>12))
), month_end_data as (
select
e.emp_id
,e.dep_id
,l.last_day_month as month
from last_36_months as l
join emp_data as e
on e.date <= l.last_day_month
qualify row_number() over(partition by e.emp_id, l.last_day_month order by e.date desc) = 1
)
select *
from month_end_data
order by 1,3 desc;
I reduced 36 to 12, and moved the data to 2022 so the output was less verbose, but it gives:
EMP_ID
DEP_ID
MONTH
1
20
2022-10-31
1
20
2022-09-30
1
20
2022-08-31
1
20
2022-07-31
1
10
2022-06-30
1
10
2022-05-31
1
10
2022-04-30
1
10
2022-03-31
1
10
2022-02-28
1
10
2022-01-31
2
10
2022-10-31
2
10
2022-09-30
2
10
2022-08-31
2
10
2022-07-31
which seems more aligned to what you want, and simpler to read

Group items from the first time + certain time period

I want to group orders from the same customer if they happen within 10 minutes of the first order, then find the next first order and group them and so on.
Ex:
Customer group orders
6 1 3
2 4,5
3 8
7 1 9,10
2 11,12
3 13
id customer time
3 6 2021-05-12 12:14:22.000000
4 6 2021-05-12 12:24:24.000000
5 6 2021-05-12 12:29:16.000000
8 6 2021-05-12 13:01:40.000000
9 7 2021-05-14 12:13:11.000000
10 7 2021-05-14 12:20:01.000000
11 7 2021-05-14 12:45:00.000000
12 7 2021-05-14 12:48:41.000000
13 7 2021-05-14 12:58:16.000000
18 9 2021-05-18 12:22:13.000000
25 15 2021-05-18 13:44:02.000000
26 16 2021-05-17 09:39:02.000000
27 16 2021-05-18 19:38:43.000000
28 17 2021-05-18 15:40:02.000000
29 18 2021-05-19 15:32:53.000000
30 18 2021-05-19 15:45:56.000000
31 18 2021-05-19 16:29:09.000000
34 15 2021-05-24 15:45:14.000000
35 15 2021-05-24 15:45:14.000000
36 19 2021-05-24 17:14:53.000000
Here is what I have currently, I think that it is currently not grouping by customer when case when d.StartTime > dateadd(minute, 10, c.first_time) so it compares StartTime of all orders for all customers.
with
data as (select Customer,StartTime,Id, row_number() over(partition by Customer order by StartTime) rn from orders t),
cte as (
select d.*, StartTime as first_time
from data d
where rn = 1
union all
select d.*,
case when d.StartTime > dateadd(minute, 10, c.first_time)
then d.StartTime
else c.first_time
end
from cte c
inner join data d on d.rn = c.rn + 1
)
select c.*, dense_rank() over(partition by Customer order by first_time) grp
from cte c;'
I have two databases (MySQL & SQL Server) having similar schema so either would work for me.
Try the following on SQL Server:
SELECT customer,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY grp) AS group_no,
STRING_AGG(id, ',') AS orders
FROM
(
SELECT id,customer, [time],
(DATEDIFF(SECOND, MIN([time]) OVER (PARTITION BY CUSTOMER), [time])/60)/10 grp
FROM orders
) T
GROUP BY customer, grp
ORDER BY customer
See a demo.
According to your posted requirement, you are trying to divide the period between the first order date and the last order date into groups (or let's say time frames) each one is 10 minutes long.
What I did in this query: for each customer order, find the difference between the order date and the minimum date (first customer order date) in seconds and then divide it by 10 to get it's time frame number. i.e. for a difference = 599s the frame number = 599/60 =9m /10 = 0. for a difference = 620s the frame number = 620/60 =10m /10 = 1.
After defining the correct groups/time frames for each order you can simply use the STRING_AGG function to get the desired output. Noting that the STRING_AGG function applies to SQL Server 2017 (14.x) and later.

SQLite query - Limit occurrence of value

I have a query that return this result. How can i limit the occurrence of a value from the 4th column.
19 1 _BOURC01 1
20 1 _BOURC01 3 2019-11-18
20 1 _BOURC01 3 2017-01-02
21 1 _BOURC01 6
22 1 _BOURC01 10
23 1 _BOURC01 13 2016-06-06
24 1 _BOURC01 21 2016-09-19
My Query:
SELECT "_44_SpeakerSpeech"."id" AS "id", "_44_SpeakerSpeech"."active" AS "active", "_44_SpeakerSpeech"."id_speaker" AS "id_speaker", "_44_SpeakerSpeech"."Speech" AS "Speech", "34 Program Weekend"."date" AS "date"
FROM "_44_SpeakerSpeech"
LEFT JOIN "_34_programWeekend" "34 Program Weekend" ON "_44_SpeakerSpeech"."Speech" = "34 Program Weekend"."theme_id"
WHERE "id_speaker" = "_BOURC01"
ORDER BY id_speaker, Speech, date DESC
Thanks
I think this is what you want here:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY s.id, s.active, s.id_speaker, s.Speech
ORDER BY p.date DESC) rn
FROM "_44_SpeakerSpeech" s
LEFT JOIN "_34_programWeekend" p ON s.Speech = p.theme_id
WHERE s.id_speaker = '_BOURC01'
)
SELECT id, active, id_speaker, Speech, date
FROM cte
WHERE rn = 1;
This logic assumes that when two or more records all have the same columns values (excluding the date), you want to retain only the latest record.

To subtract a previous row value in SQL Server 2012

This is SQL Query
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) [Sno],
_Date,
SUM(Payment) Payment
FROM
DailyPaymentSummary
GROUP BY
_Date
ORDER BY
_Date
This returns output like this
Sno _Date Payment
---------------------------
1 2017-02-02 46745.80
2 2017-02-03 100101.03
3 2017-02-06 140436.17
4 2017-02-07 159251.87
5 2017-02-08 258807.51
6 2017-02-09 510986.79
7 2017-02-10 557399.09
8 2017-02-13 751405.89
9 2017-02-14 900914.45
How can I get the additional column like below
Sno _Date Payment Diff
--------------------------------------
1 02/02/2017 46745.80 46745.80
2 02/03/2017 100101.03 53355.23
3 02/06/2017 140436.17 40335.14
4 02/07/2017 159251.87 18815.70
5 02/08/2017 258807.51 99555.64
6 02/09/2017 510986.79 252179.28
7 02/10/2017 557399.09 46412.30
8 02/13/2017 751405.89 194006.80
9 02/14/2017 900914.45 149508.56
I have tried the following query but not able to solve the error
WITH cte AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) [Sno],
_Date,
SUM(Payment) Payment
FROM
DailyPaymentSummary
GROUP BY
_Date
ORDER BY
_Date
)
SELECT
t.Payment,
t.Payment - COALESCE(tprev.col, 0) AS diff
FROM
DailyPaymentSummary t
LEFT OUTER JOIN
t tprev ON t.seqnum = tprev.seqnum + 1;
Can anyone help me?
Use a order by with column(s) to get consistent results.
Use lag function to get data from previous row and do the subtraction like this:
with t
as (
select ROW_NUMBER() over (order by _date) [Sno],
_Date,
sum(Payment) Payment
from DailyPaymentSummary
group by _date
)
select *,
Payment - lag(Payment, 1, 0) over (order by [Sno]) diff
from t;
You can use lag() to get previous row values
coalesce(lag(sum_payment_col) OVER (ORDER BY (SELECT 1)),0)

How to calculate median of a numeric sequence in Google BigQuery efficiently?

I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?
Yeah it's possible with PERCENTILE_CONT window function.
Returns values that are based upon linear interpolation between the
values of the group, after ordering them per the ORDER BY clause.
must be between 0 and 1.
This window function requires ORDER BY in the OVER clause.
So an example query would be like (the max() is there just to work across the group by but it's not being used as a math logic, should not confuse you)
SELECT room,
max(median) FROM (SELECT room,
percentile_cont(0.5) OVER (PARTITION BY room
ORDER BY temperature) AS median FROM
(SELECT 1 AS room,
11 AS temperature),
(SELECT 1 AS room,
12 AS temperature),
(SELECT 1 AS room,
14 AS temperature),
(SELECT 1 AS room,
19 AS temperature),
(SELECT 1 AS room,
13 AS temperature),
(SELECT 2 AS room,
20 AS temperature),
(SELECT 2 AS room,
21 AS temperature),
(SELECT 2 AS room,
29 AS temperature),
(SELECT 3 AS room,
30 AS temperature)) GROUP BY room
This returns:
+------+-------------+
| room | temperature |
+------+-------------+
| 1 | 13 |
| 2 | 21 |
| 3 | 30 |
+------+-------------+
Alternative solution, when you don't need absolutely exact results and approximation is fine - you can use combination of NTH and QUANTILES aggregation functions. The advantage of this method is that it is much more scalable than analytic window functions, but the disadvantage is that it gives approximate results.
SELECT room,
NTH(50, QUANTILES(temperature, 101)) FROM
(SELECT 1 AS room,
11 AS temperature),
(SELECT 1 AS room,
12 AS temperature),
(SELECT 1 AS room,
14 AS temperature),
(SELECT 1 AS room,
19 AS temperature),
(SELECT 1 AS room,
13 AS temperature),
(SELECT 2 AS room,
20 AS temperature),
(SELECT 2 AS room,
21 AS temperature),
(SELECT 2 AS room,
29 AS temperature),
(SELECT 3 AS room,
30 AS temperature) GROUP BY room
This returns
room temperature
1 13
2 21
3 30
2018 update with more metrics:
BigQuery SQL: Average, geometric mean, remove outliers, median
For my own memory purposes, working queries with taxi data:
Approximate quantiles:
SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
Gives the same results as PERCENTILE_DISC:
SELECT month, FIRST(median) median
FROM (
SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1
StandardSQL:
#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1