How to partition by a customized sum value? - sql

I have a table with the following columns: customer_id, event_date_time
I'd like to figure out how many times a customer triggers an event every 12 hours from the start of an event. In other words, aggregate the time between events for up to 12 hours by customer.
For example, if a customer triggers an event (in order) at noon, 1:30pm, 5pm, 2am, and 3pm, I would want to return the noon, 2am, and 3pm record.
I've written this query:
select
cust_id,
event_datetime,
nvl(24*(event_datetime - lag(event_datetime) over (partition BY cust_id ORDER BY event_datetime)),0) as difference
from
tbl
I feel like I'm close with this. Is there a way to add something like
over (partition BY cust_id, sum(difference)<12 ORDER BY event_datetime)
EDIT: I'm adding some sample data:
+---------+-----------------+-------------+---+
| cust_id | event_datetime | DIFFERENCE | X |
+---------+-----------------+-------------+---+
| 1 | 6/20/2015 23:35 | 0 | x |
| 1 | 6/21/2015 0:09 | 0.558611111 | |
| 1 | 6/21/2015 0:49 | 0.667777778 | |
| 1 | 6/21/2015 1:30 | 0.688333333 | |
| 1 | 6/21/2015 9:38 | 8.133055556 | |
| 1 | 6/21/2015 10:09 | 0.511111111 | |
| 1 | 6/21/2015 10:45 | 0.600555556 | |
| 1 | 6/21/2015 11:09 | 0.411111111 | |
| 1 | 6/21/2015 11:32 | 0.381666667 | |
| 1 | 6/21/2015 11:55 | 0.385 | x |
| 1 | 6/21/2015 12:18 | 0.383055556 | |
| 1 | 6/21/2015 12:23 | 0.074444444 | |
| 1 | 6/22/2015 10:01 | 21.63527778 | x |
| 1 | 6/22/2015 10:24 | 0.380555556 | |
| 1 | 6/22/2015 10:46 | 0.373611111 | |
+---------+-----------------+-------------+---+
The "x" are the records that should be pulled since they're the first records in the 12 hour block.

If I understand correctly, you want the first record in each 12-hour block where the blocks of time are defined by the first event time.
If so, you need to modify your query to get the difference from the *first * time for each customer. The rest is just arithmetic. The query would look something like this:
with t as (
select cust_id, event_datetime,
(24 * (event_datetime -
coalesce(min(event_datetime) over (partition by cust_id ), 0)
) as difference
from tbl
)
select t.*
from (select t.*,
row_number() over (partition by cust_id, floor(difference / 12)
order by difference) as seqnum
from t
) t
where seqnum = 1;

Related

SQL (Redshift) get start and end values for consecutive data in a given column

I have a table that has the subscription state of users on any given day. The data looks like this
+------------+------------+--------------+
| account_id | date | current_plan |
+------------+------------+--------------+
| 1 | 2019-08-01 | free |
| 1 | 2019-08-02 | free |
| 1 | 2019-08-03 | yearly |
| 1 | 2019-08-04 | yearly |
| 1 | 2019-08-05 | yearly |
| ... | | |
| 1 | 2020-08-02 | yearly |
| 1 | 2020-08-03 | free |
| 2 | 2019-08-01 | monthly |
| 2 | 2019-08-02 | monthly |
| ... | | |
| 2 | 2019-08-31 | monthly |
| 2 | 2019-09-01 | free |
| ... | | |
| 2 | 2019-11-26 | free |
| 2 | 2019-11-27 | monthly |
| ... | | |
| 2 | 2019-12-27 | monthly |
| 2 | 2019-12-28 | free |
+------------+------------+--------------+
I would like to have a table that gives the start and end dats of a subscription. It would look something like this:
+------------+------------+------------+-------------------+
| account_id | start_date | end_date | subscription_type |
+------------+------------+------------+-------------------+
| 1 | 2019-08-03 | 2020-08-02 | yearly |
| 2 | 2019-08-01 | 2019-08-31 | monthly |
| 2 | 2019-11-27 | 2019-12-27 | monthly |
+------------+------------+------------+-------------------+
I started by doing a LAG windown function with a bunch of WHERE statements to grab the "state changes", but this makes it difficult to see when customers float in and out of subscriptions and i'm not sure this is the best method.
lag as (
select *, LAG(tier) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan
, LAG(date) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan_date
from data
)
SELECT *
FROM lag
where (current_plan = 'free' and previous_plan in ('monthly', 'yearly'))
This is a gaps-and-islands problem. I think a difference of row numbers works:
select account_id, current_plan, min(date), max(date)
from (select d.*,
row_number() over (partition by account_id order by date) as seqnum,
row_number() over (partition by account_id, current_plan order by date) as seqnum_2
from data
) d
where current_plan <> free
group by account_id, current_plan, (seqnum - seqnum_2);

SQL - Calculate number of occurrences of previous day?

I want to calculate the number of people who also had occurrence the previous day on a daily basis, but I'm not sure how to do this?
Sample Table:
| ID | Date |
+----+-----------+
| 1 | 1/10/2020 |
| 1 | 1/11/2020 |
| 2 | 2/20/2020 |
| 3 | 2/20/2020 |
| 3 | 2/21/2020 |
| 4 | 2/23/2020 |
| 4 | 2/24/2020 |
| 5 | 2/22/2020 |
| 5 | 2/23/2020 |
| 5 | 2/24/2020 |
+----+-----------+
Desired Output:
| Date | Count |
+-----------+-------+
| 1/11/2020 | 1 |
| 2/21/2020 | 1 |
| 2/23/2020 | 1 |
| 2/24/2020 | 2 |
+-----------+-------+
Edit: Added desired output. The output count should be unique to the ID, not the number of date occurrences. i.e. an ID 5 can appear on this list 10 times for dates 2/23/2020 and 2/24/2020, but that would count as "1".
Use lag():
select date, count(*)
from (select t.*, lag(date) over (partition by id order by date) as prev_date
from t
) t
where prev_date = dateadd(day, -1, date)
group by date;

Sum up following rows with same status

I am trying to sum up following rows if they have the same id and status.
The DB is running on a Windows Server 2016 and is a Microsoft SQL Server 14.
I was thinking about using a self join, but that would only sum up 2 rows, or somehow use lead/lag.
Here is how the table looks like (Duration is the days between this row and the next, sorted by mod_Date, if they have the same id):
+-----+--------------+-------------------------+----------+
| ID | Status | mod_Date | Duration |
+-----+--------------+-------------------------+----------+
| 1 | In Inventory | 2015-04-10 09:11:37.000 | 12 |
| 1 | Deployed | 2015-04-22 10:13:35.000 | 354 |
| 1 | Deployed | 2016-04-10 09:11:37.000 | 30 |
| 1 | In Inventory | 2016-05-10 09:11:37.000 | Null |
| 2 | In Inventory | 2013-04-10 09:11:37.000 | 12 |
| ... | ... | ... | ... |
+-----+--------------+-------------------------+----------+
There can be several rows with the same status and id following each other not only two.
And what I want to get is:
+-----+--------------+-------------------------+----------+
| ID | Status | mod_Date | Duration |
+-----+--------------+-------------------------+----------+
| 1 | In Inventory | 2015-04-10 09:11:37.000 | 12 |
| 1 | Deployed | 2015-04-22 10:13:35.000 | 384 |
| 1 | In Inventory | 2016-05-10 09:11:37.000 | Null |
| 2 | In Inventory | 2013-04-10 09:11:37.000 | 12 |
| ... | ... | ... | ... |
+-----+--------------+-------------------------+----------+
This is an example of gaps and islands. In this case, I think the difference of row numbers suffices:
select id, status, max(mod_date) as mod_date, sum(duration) as duration
from (select t.*,
row_number() over (partition by id, status order by mod_date) as seqnum_is,
row_number() over (partition by id order by mod_date) as seqnum_i
from t
) t
group by id, status, seqnum_i - seqnum_is;
The trick here is that the difference of two increasing sequences identifies "islands" of where the values are the same. This is rather mysterious the first time you see it. But if you run the subquery, you'll probably quickly see how this works.

Counting events only once if an event happens more than once every X minutes

I have a table that is filled everytime a user starts a session in my app. But I dont want to count their session more than once if they make it within 10 minutes. How can I do it?
Here's an example of what is returned from the table
select
*
from table
limit 100
+----------+--------+---------+----------------+
| event_ID | userid | city_id | created_at |
+----------+--------+---------+----------------+
| 1 | a | 1 | 15/08/19 10:10 |
| 2 | b | 1 | 15/08/19 10:11 |
| 3 | a | 1 | 15/08/19 10:14 |
| 4 | a | 1 | 15/08/19 10:25 |
| 5 | b | 1 | 15/08/19 10:27 |
| 6 | c | 1 | 15/08/19 10:30 |
| 7 | c | 1 | 15/08/19 10:35 |
| 8 | d | 1 | 15/08/19 10:40 |
| 9 | d | 1 | 15/08/19 10:49 |
| 10 | c | 1 | 15/08/19 10:55 |
+----------+--------+---------+----------------+
In the end, I would want to count the unique event_ids for each user, based on the premise that a unique event_id is defined by the amount of times it happens every 10 minutes
So it should be something like this in the end:
+--------+------------------+
| userid | unique_event_ids |
+--------+------------------+
| a | 2 |
| b | 2 |
| c | 2 |
| d | 1 |
+--------+------------------+
+--------+------------------+
| Total | 7 |
+--------+------------------+
Any suggestion on how to start?
Use lag() to determine when the previous event was created for the user. Then some date filtering and aggregation:
select userid, count(*)
from (select t.*,
lag(created_at) over (partition by userid order by created_at) as prev_created_at
from t
) t
where prev_created_at is null or prev_created_at < created_at - interval '10 minute'
group by userid
I would do:
select
userid,
sum(case when created_at - interval '10 minute' < prev then 0 else 1 end)
as unique_events_ids
from (
select
*,
lag(created_at) over(partition by userid order by created_at) as prev
from t
) x
group by userid

max(sum(field query in Hive/SQL

I have a table with lots of transactions for users across a month.
I need to take the hour from each day where Sum(cost) is at its highest.
I've tried MAX(SUM(Cost)) but get an error.
How would I go about doing this please?
here is some sample data
+-------------+------+----------+------+
| user id | hour | date | Cost |
+-------------+------+----------+------+
| 343252 | 13 | 20170101 | 21.5 |
| 32532532 | 13 | 20170101 | 22.5 |
| 35325325 | 13 | 20170101 | 30.5 |
| 325325325 | 13 | 20170101 | 10 |
| 64643643 | 12 | 20170101 | 22 |
| 643643643 | 12 | 20170101 | 31 |
| 436325234 | 13 | 20170101 | 15 |
| 213213213 | 13 | 20170101 | 12 |
| 53265436436 | 17 | 20170101 | 19 |
+-------------+------+----------+------+
Expected Output:
I need just one row per day, where it shows the total cost from the 'most expensive' hour. In this case, 13:00 had a total cost of 111.5
select hr
,dt
,total_cost
from (select dt
,hr
,sum(cost) as total_cost
,row_number () over
(
partition by dt
order by sum(cost) desc
) as rn
from mytable
group by dt,hr
) t
where rn = 1
+----+------------+------------+
| hr | dt | total_cost |
+----+------------+------------+
| 13 | 2017-01-01 | 111.5 |
+----+------------+------------+
Try this:
select AVG(hour) as 'Hour',date as 'Date',sum(cost) as 'TotalCost' from dbo.Table_3 group by date