I have a data like the below:
date id process name
2022-01-01 12:23:33 12 security John
2022-01-01 12:25:33 12 security John
2022-01-01 12:27:33 12 security John
2022-01-01 12:29:33 12 security John
2022-01-01 14:04:45 12 security John
2022-01-05 03:53:11 12 Infra Sasha
2022-01-05 03:57:30 12 Infra Sasha
2022-01-06 12:23:33 12 Infra Sasha
The data with date difference of 10 mins for same values in the other fields are basically multi clicks and only the first one needs to be considered.
Expected result:
2022-01-01 12:23:33 12 security John
2022-01-01 14:04:45 12 security John
2022-01-05 03:53:11 12 Infra Sasha
2022-01-06 12:23:33 12 Infra Sasha
I know we can use datediff() but i don't know how to group them by data thats same in the rest of the fields. I don't know where to start this logic.
Can someone help me get this please?
Thanks!
You can use LAG() to peek at a value of a previous row, according to a subgroup an ordering within the each subgroup. For example:
select *
from (
select *, lag(date) over(partition by id, process order by date) as pd from t
) x
where pd is null or date > pd + interval '10 minute'
Result:
date id process name pd
-------------------- --- --------- ------ -------------------
2022-01-05 03:53:11 12 Infra Sasha null
2022-01-06 12:23:33 12 Infra Sasha 2022-01-05 03:57:30
2022-01-01 12:23:33 12 security john null
2022-01-01 14:04:45 12 security john 2022-01-01 12:29:33
See running example at db<>fiddle.
Related
I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1
I have the following table
timestamp ID eur
-----------------------
2022-01-01 A 10
2022-01-02 A 20
2022-01-01 B 30
2022-01-02 B 40
2022-01-03 B 50
2022-01-04 B 60
Now I am interested in all previous information for a specific ID. Then I want to do something with this information, lets say calculating the mean. Here is what I am aiming for:
timestamp ID eur sum_all mean_all
------------------------------------------------
2022-01-01 A 10 10 10
2022-01-02 A 20 30 15
2022-01-01 B 30 30 30
2022-01-02 B 40 70 35
2022-01-03 B 50 120 40
2022-01-04 B 60 180 45
This seems so easy but I just can't get my head around how to do this in SQL.
I appreciate any help. Thanks!
You can use the sum and avg window functions:
select *, sum(eur) over(partition by ID order by timestamp) as sum_all,
avg(eur) over(partition by ID order by timestamp) as mean_all
from table_name
I have 2 dimension tables and 1 fact table as follows:
user_dim
user_id
user_name
user_joining_date
1
Steve
2013-01-04
2
Adam
2012-11-01
3
John
2013-05-05
4
Tony
2012-01-01
5
Dan
2010-01-01
6
Alex
2019-01-01
7
Kim
2019-01-01
bundle_dim
bundle_id
bundle_name
bundle_type
bundle_cost_per_day
101
movies and TV
prime
5.5
102
TV and sports
prime
6.5
103
Cooking
prime
7
104
Sports and news
prime
5
105
kids movie
extra
2
106
kids educative
extra
3.5
107
spanish news
extra
2.5
108
Spanish TV and sports
extra
3.5
109
Travel
extra
2
plans_fact
user_id
bundle_id
bundle_start_date
bundle_end_date
1
101
2019-10-10
2020-10-10
2
107
2020-01-15
(null)
2
106
2020-01-15
2020-12-31
2
101
2020-01-15
(null)
2
103
2020-01-15
2020-02-15
1
101
2020-10-11
(null)
1
107
2019-10-10
2020-10-10
1
105
2019-10-10
2020-10-10
4
101
2021-01-01
2021-02-01
3
104
2020-02-17
2020-03-17
2
108
2020-01-15
(null)
4
102
2021-01-01
(null)
4
103
2021-01-01
(null)
4
108
2021-01-01
(null)
5
103
2020-01-15
(null)
5
101
2020-01-15
2020-02-15
6
101
2021-01-01
2021-01-17
6
101
2021-01-20
(null)
6
108
2021-01-01
(null)
7
104
2020-02-17
(null)
7
103
2020-01-17
2020-01-18
1
102
2020-12-11
(null)
2
106
2021-01-01
(null)
7
107
2020-01-15
(null)
note: NULL bundle_end_date refers to active subscription.
user active days can be calculated as: bundle_end_date - bundle_start_date (for the given bundle)
total revenue per user could be calculated as : total no. of active days * bundle rate per day
I am looking to write a query to find revenue generated per user per year.
Here is what I have for the overall revenue per user:
select pf.user_id
, sum(datediff(day, pf.bundle_start_date, coalesce(pf.bundle_end_date, getdate())) * bd.price_per_day) total_cost_per_bundle
from plans_fact pf
inner join bundle_dim bd on bd.bundle_id = pf.bundle_id
group by pf.user_id
order by pf.user_id;
You need a 'year' table to help parse out each multi-year spanning row into it's seperate years. For each year, you need to also recalculate the start and end dates. That's what I do in the yearParsed cte in the code below. I hard code the years into the join statement that creates y. You probably will do it different but however you get those values will work.
After that, pretty much sum as you did before, just adding the year column to your grouping.
Aside from that, all I did was move the null coalesce logic to the cte to make the overall logic simpler.
with yearParsed as (
select pf.*,
y.year,
startDt = iif(pf.bundle_start_date > y.startDt, pf.bundle_start_date, y.startDt),
endDt = iif(ap.bundle_end_date < y.endDt, ap.bundle_end_date, y.endDt)
from plans_fact pf
cross apply (select bundle_end_date = isnull(pf.bundle_end_date, getdate())) ap
join (values
(2019, '2019-01-01', '2019-12-31'),
(2020, '2020-01-01', '2020-12-31'),
(2021, '2021-01-01', '2021-12-31')
) y (year, startDt, endDt)
on pf.bundle_start_date <= y.endDt
and ap.bundle_end_date >= y.startDt
)
select yp.user_id,
yp.year,
total_cost_per_bundle = sum(datediff(day, yp.startDt, yp.endDt) * bd.bundle_cost_per_day)
from yearParsed yp
join bundle_dim bd on bd.bundle_id = yp.bundle_id
group by yp.user_id,
yp.year
order by yp.user_id,
yp.year;
Now, if this is common, you should probably create a base-table for your 'year' table. But if it's not common, but for this report you don't want to have to keep coming back to hard-code the year information into the y table, you can do this:
declare #yearTable table (
year int,
startDt char(10),
endDt char(10)
);
with y as (
select year = year(min(pf.bundle_start_date))
from #plans_fact pf
union all
select year + 1
from y
where year < year(getdate())
)
insert #yearTable
select year,
startDt = convert(char(4),year) + '-01-01',
endDt = convert(char(4),year) + '-12-31'
from y;
and it will create the appropriate years for you. But you can see why creating a base table may be preferred if you have this or a similar need often.
Is there a way to find the solution so that I need for 2 days, there are 2 UD's because there are June 24 2 times and for the rest there are single days.
I am showing the expected output here:
Primary key UD Date
-------------------------------------------
1 123 2015-06-24 00:00:00.000
6 456 2015-06-24 00:00:00.000
2 123 2015-06-25 00:00:00.000
3 658 2015-06-26 00:00:00.000
4 598 2015-06-27 00:00:00.000
5 156 2015-06-28 00:00:00.000
No of times Number of days
-----------------------------
4 1
2 2
The logic is 4 users are there who used the application on 1 day and there are 2 userd who used the application on 2 days
You can use two levels of aggregation:
select cnt, count(*)
from (select date, count(*) as cnt
from t
group by date
) d
group by cnt
order by cnt desc;
I have following table data for processing.
SYMBOL DATE OPENVALUE CLOSEVALUE
-------------------------------------------------
ABC 2019-01-01 10 15
ABC 2019-01-02 17 19
ABC 2019-01-03 13 20
ABC 2019-01-04 18 30
ABC 2019-01-07 25 45
ABC 2019-01-08 40 50
I want to process and display information as follow
SYMBOL DATE OPENVALUE PREVDAYCLOSINGVALUE
--------------------------------------------------------------
ABC 2019-01-01 10 NA
ABC 2019-01-02 17 15
ABC 2019-01-03 13 19
ABC 2019-01-04 18 20
ABC 2019-01-07 25 30
ABC 2019-01-08 40 45
If anyone can help. I am facing problem with inner joining current date with previous available date data.
You are looking for lag():
select t.*,
lag(closevalue) over (partition by symbol order by date) as prev_closevalue
from t;
Use LAG().
The 3-argument form lets you specify a default value. I would not recommend 'NA', since it does not have the same datatype as the other values (which looks like positive integers), so I used -1.
SELECT
t.*,
LAG(OPENVALUE, 1, -1) OVER(PARTITION BY [SYMBOL] ORDER BY [DATE]) AS PREVDAYCLOSINGVALUE
FROM mytable t