Join sum to closest timestamp once up to interval cap - google-bigquery

I am trying to join a site_interactions table with a store_transactions table. For this, I want that the store_transactions.sales_amount for a given username gets attached to the closest site_interactions.timestamp match, at most one time and up to 7 days of the site_interactions.timestamp variable.
site_interaction table:
username timestamp
John 01.01.2020 15:00:00
John 02.01.2020 11:30:00
Sarah 03.01.2020 12:00:00
store_transactions table:
username timestamp sales_amount
John 02.01.2020 16:00:00 45
John 03.01.2020 16:00:00 70
John 09.01.2020 16:00:00 15
Sarah 02.01.2020 09:00:00 35
Tim 02.01.2020 10:00:00 60
Desired output:
username timestamp sales_amount
John 01.01.2020 15:00:00 NULL
John 02.01.2020 11:30:00 115
Sarah 03.01.2020 12:00:00 NULL
Explanation:
John has 3 entries/transactions in the store_transactions table. The first and the second purchase were realized within the 7 days interval/limit, and the sum of these two transactions (45 + 70 = 115) were attached/joined to the closest and nearest match only once - i.e. to John's second interaction (timestamp = 02.01.2020 11:30:00). John's third transactions was not attached to any site interaction, because it exceeds the 7 days interval (including the time).
Sarah has one transaction realized before her interaction with the site. Thus her sales_amount of 35 was not attached to the site_interaction table.
Last, Tim's transaction was not attached anywhere - because this username does not show in the site_interaction table.
Here a link of the tables: https://rextester.com/RKSUK73038
Thanks in advance!

Below is for BigQuery Standard SQL
#standardSQL
select i.username, i.timestamp,
sum(sales_amount) as sales_amount
from (
select username, timestamp,
ifnull(lead(timestamp) over(partition by username order by timestamp), timestamp_add(timestamp, interval 7 day)) next_timestamp
from `project.dataset.site_interaction`
) i
left join `project.dataset.store_transactions` t
on i.username = t.username
and t.timestamp >= i.timestamp
and t.timestamp < least(next_timestamp, timestamp_add(i.timestamp, interval 7 day))
group by username, timestamp
if to apply to sample data from your question - output is

Related

How to calculate time between rows with condition?

I have a table in SQL Server about how people going in and out of building.
user_id
datetime
direction
1
27.09.2022 10:30
in
1
27.09.2022 12:30
out
1
27.09.2022 14:30
in
1
27.09.2022 15:35
out
2
27.09.2022 11:30
in
2
27.09.2022 13:20
out
2
27.09.2022 15:00
in
2
27.09.2022 15:40
out
3
27.09.2022 11:45
in
3
27.09.2022 11:46
in
3
27.09.2022 15:40
out
3
27.09.2022 15:47
in
3
27.09.2022 18:00
out
I need to calculate how much time each user spent inside the building by days.
For example, on 27th Sep user #1 spent 3 hours 5 minutes. User #2 spent 2 hours 30 minutes.
There is also a bug that may spoil the results - sometimes I may have two 'in' or two 'out' in a row, like in case of user #3. I understand the nature of such bug, and know I only have to keep last of two same rows (in fact user #3 entered in 11:46, not 11:45). Does anyone have an idea how to solve that?
select user_id
,sum(time_spent) as time_spent_minutes
from (
select *
,datediff(minute, lag(case when direction = 'in' then datetime end) over(partition by user_id order by datetime), datetime) as time_spent
from t
) t
group by user_id
user_id
time_spent_minutes
1
185
2
150
Fiddle
The window functions would be a nice fit here.
Example or Updated dbFiddle
Select user_id
,Duration = convert(time(0),dateadd(second,sum(Secs),0))
From (
Select user_id
,Secs = datediff(second,case when direction ='in'
and lead([direction],1) over (partition by user_id order by datetime)='out'
then [datetime]
end
,lead([datetime],1) over (partition by user_id order by datetime))
From YourTable
) A
Group By user_id
Results
user_id Duration
1 03:05:00 -- << Check your desired results
2 02:30:00
3 06:07:00

How do I calculate the amount of time between multiple datetimes in multiple rows in sql

I've done a search but I can't find any that are exactly what I need. I need to be able to calculate the amount of time that someone has been in the building over time in a sql query (T-SQL on SQL Server). The data looks like this:
UserId Clocking Status
------------------------------
1 01/12/2020 09:00 In
2 01/12/2020 09:12 In
1 01/12/2020 09:25 Out
3 01/12/2020 10:00 In
2 01/12/2020 10:45 Out
3 01/12/2020 13:11 Out
1 03/12/2020 11:14 In
2 03/12/2020 15:56 In
1 03/12/2020 16:04 Out
2 03/12/2020 17:00 Out
I want the output to look like this:
UserId TimeInBuilding
----------------------
1 03:35
2 05:25
3 03:11
Assuming that the ins/outs are perfectly interleaved, you can do this by assigning the next "out" time to the "in" time and aggregating:
select userid,
sum(datediff(second, clocking, out_time)) / (60.0 * 60) as decimal_hours
from (select t.*,
lead(clocking) over (partition by userid order by clocking) as out_time
from t
) t
where status = 'In'
group by userid;
You can convert this to HH:MM format using:
select userid,
convert(varchar(5),
convert(time,
dateadd(second,
sum(datediff(second, clocking, out_time),
0)
)
) as hhmm
from (select t.*,
lead(clocking) over (partition by userid order by clocking) as out_time
from t
) t
where status = 'In'
group by userid;
Here is a db<>fiddle.

How can I extract the values of the last aggregation date in sql

I have the following table.
id user time_stamp
1 Mike 2020-02-13 00:00:00 UTC
2 John 2020-02-13 00:00:00 UTC
3 Levy 2020-02-12 00:00:00 UTC
4 Sam 2020-02-12 00:00:00 UTC
5 Frodo 2020-02-11 00:00:00 UTC
Let's say 2020-02-13 00:00:00 UTC is the last day and I would like to query this table to only display last days results? I want to create a view in Bigquery so that I only and always get the last day's results?
So that in the end I get something like this (For last day which is 2020-02-13 00:00:00 UTC )
id user time_stamp
1 Mike 2020-02-13 00:00:00 UTC
2 John 2020-02-13 00:00:00 UTC
You can use window functions:
select t.* except (seqnum)
from (select t.*,
dense_rank() over (order by time_stamp) as seqnum
from t
) t
where seqnum = 1;
This may not work well on a large amount of data -- because of the way that BQ implements window functions with no partitioning. So, you might find that this works better (especially if the above runs out of resources):
select t.*
from t join
(select max(time_stamp) as max_time_stamp
from t
) tt
on t.time_stamp = max_time_stamp;
Also, if the timestamps actually have date components, then you will want to convert to a date or remove the time component somehow.

How to have the rolling distinct count of each day for past three days in Oracle SQL?

I searched for this a lot, but I couldn't find the solution yet. let me explain my question by sample data and my desired output.
sample data:
datetime customer
---------- --------
2018-10-21 09:00 Ryan
2018-10-21 10:00 Sarah
2018-10-21 20:00 Sarah
2018-10-22 09:00 Peter
2018-10-22 10:00 Andy
2018-10-23 09:00 Sarah
2018-10-23 10:00 Peter
2018-10-24 10:00 Andy
2018-10-24 20:00 Andy
my desired output is to have the distinctive number of customers for past three days relative to each day:
trunc(datetime) progressive count distinct customer
--------------- -----------------------------------
2018-10-21 2
2018-10-22 4
2018-10-23 4
2018-10-24 3
explanation: for 21th, because we have only Ryan and Sarah the count is 2 (also because we have no other records before 21th); for 22th Andy and Peter are added to the distinct list, so it's 4. for 23th, no new customer is added so it would be 4. for 24th, however, as we only should consider past 3 days (as per business logic), we should only take 24th,23th and 22th; so the distinct customers would be Sarah, Andy and Peter. so the count is 3.
I believe it is called the progressive count, or moving count or rolling up count. but I couldn't implement it in Oracle 11g SQL. Obviously it's easy by using PL-SQL programming (Stored-Procedure/Function). but, preferably I wonder if we can have it by a single SQL query.
What you seem to want is:
select date,
count(distinct customer) over (order by date rows between 2 preceding and current row)
from (select distinct trunc(datetime) as date, customer
from t
) t
group by date;
However, Oracle does not support window frames with count(distinct).
One rather brute force approach is a correlated subquery:
select date,
(select count(distinct t2.customer)
from t t2
where t2.datetime >= t.date - 2
) as running_3
from (select distinct trunc(datetime) as date
from t
) t;
This should have reasonable performance for a small number of dates. As the number of dates increases, the performance will degrade linearly.

I need to calculate the time between dates in different lines. (PLSQL)

I have a table where I store all status changes and the time that it has been made. So, when I search the order number on the table of times I get all the dates of my changes, but what I realy want is the time (hours/minutes) that the order was in each status.
The table of time seems like this
ID_ORDER | Status | Date
1 Waiting 27/09/2017 12:00:00
1 Late 27/09/2017 14:00:00
1 In progress 28/09/2017 08:00:00
1 Validating 30/09/2017 14:00:00
1 Completed 30/09/2017 14:00:00
Thanks!
Use lead():
select t.*,
(lead(date) over (partition by id_order order by date) - date) as time_in_order
from t;