Aggregate time series by change events on BigQuery - google-bigquery

On BigQuery I have a timeseries of data that represents snapshots of an DEX pool on Ethereum. Each row has a timestamp, a pool address, and a balance. I need a query that returns the list of rows only when the balance changes.
So for example having the following rows:
ts | pool | balance
------------------------------
1 | 0x123 | 100
2 | 0x123 | 100
3 | 0x123 | 80
4 | 0x123 | 80
5 | 0x123 | 100
The query would return:
ts | pool | balance
------------------------------
1 | 0x123 | 100
3 | 0x123 | 80
5 | 0x123 | 100
Can I get some help?

Consider below option
select * from pools where true
qualify ifnull(balance != lag(balance) over win, true)
window win as (partition by pool order by ts)
if applied to sample data in your question - output is

As I was writing and simplifying my problem I end up solving it myself :)
So here's the query I wrote, hope it will help you solve similar problems:
WITH pools AS (
SELECT 1 as ts, "a" as pool, 100 as balance UNION ALL
SELECT 2 as ts, "a" as pool, 100 as balance UNION ALL
SELECT 3 as ts, "a" as pool, 80 as balance UNION ALL
SELECT 4 as ts, "a" as pool, 80 as balance UNION ALL
SELECT 5 as ts, "a" as pool, 100 as balance
),
data AS (
SELECT pool, ts, balance, (LAG(ts) OVER (PARTITION BY pool ORDER BY ts ASC)) AS prev_ts,
(LAG(balance) OVER (PARTITION BY pool ORDER BY ts ASC)) AS prev_balance
FROM pools
ORDER BY ts
)
SELECT ts, pool, balance, prev_balance
FROM data
WHERE balance != prev_balance or prev_balance is NULL

Related

Special SQL windows function that works like a loop

I’m looking for some kind of SQL window function that calculate values based on a calculated value from a previous iteration when looping over the window. I’m not looking for ‘lag’ which will just take the original value of the previous row.
Here is the case: We have web analytics sessions. We would like to attribute to each session to the last relevant channel. There are 3 channels: direct, organic and paid. However, they have different priorities: paid will always be relevant. Organic will only be relevant if there was no paid channel in the last 30 days and direct would only be relevant if there was not paid or organic channel in the last 30 days
So in the example table we would like to calculate the values in column ‘attributed’ based on channel and the date columns. Note, the data is there for several users so this should be calculated per user.
+─────────────+───────+──────────+─────────────+
| date | user | channel | attributed |
+─────────────+───────+──────────+─────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | paid |
| 2022-03-08 | 123 | direct | direct |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────+
So in the table above row 1 is attributed direct because it’s the first line. The second then is paid as this has priority to direct. It stays paid for the next 2 sessions as direct has lower priority, then it switches to organic as the paid attribution is older than 30 days. The last one is then paid again as it has a higher priority than organic.
I would know how to solve it if you could decide whether a new channel needs to be attributed only based on the current row and the previous. I added here the SQL to do it:
with source as ( -- example data
select cast("2022-01-01" as date) as date, 123 as user, "direct" as channel
union all
select "2022-01-14", 123, "paid"
union all
select "2022-02-01", 123, "direct"
union all
select "2022-02-12", 123, "direct"
union all
select "2022-02-13", 123, "organic"
union all
select "2022-03-08", 123, "direct"
union all
select "2022-03-10", 123, "paid"
),
flag_new_channel as( -- flag sessions that would override channel informaton ; this only works statically here
select *,
case
when lag(channel) over (partition by user order by date) is null then 1
when date_diff(date,lag(date) over (partition by user order by date),day)>30 then 1
when channel = "paid" then 1
when channel = "organic" and lag(channel) over (partition by user order by date)!='paid' then 1
else 0
end flag
from source
qualify flag=1
)
select s.*,
f.channel attributed_channel,
row_number() over (partition by s.user, s.date order by f.date desc) rn -- number of flagged previous sessions
from source s
left join flag_new_channel f
on s.date>=f.date
qualify rn=1 --only keep the last flagged session at or before the current session
However, this would for example set "organic" in row 5 because it doesn't know "paid" is still relevant.
+─────────────+───────+──────────+─────────────────────+
| date | user | channel | attributed_channel |
+─────────────+───────+──────────+─────────────────────+
| 2022-01-01 | 123 | direct | direct |
| 2022-01-14 | 123 | paid | paid |
| 2022-02-01 | 123 | direct | paid |
| 2022-02-12 | 123 | direct | paid |
| 2022-02-13 | 123 | organic | organic |
| 2022-03-08 | 123 | direct | organic |
| 2022-03-10 | 123 | paid | paid |
+─────────────+───────+──────────+─────────────────────+
Any ideas? Not sure recursive queries can help or udfs. I’m using BigQuery usually but if you know solutions in other dialects it would still be interesting to know.
Here's one approach:
Updated: Corrected. I lost track of your correct / expected result, due to the confusing story.
For PostgreSQL, we can do something like this (with CTE and window functions):
The fiddle for PG 14
pri - provides a table of (channel, priority) pairs
cte0 - provides the test data
cte1 - determines the minimum priority over the last 30 days per user
final - the final query expression obtains the attributed channel name
WITH pri (channel, pri) AS (
VALUES ('paid' , 1)
, ('organic' , 2)
, ('direct' , 3)
)
, cte0 (date, xuser, channel) AS (
VALUES
('2022-01-01'::date, 123, 'direct')
, ('2022-01-14' , 123, 'paid')
, ('2022-02-01' , 123, 'direct')
, ('2022-02-12' , 123, 'direct')
, ('2022-02-13' , 123, 'organic')
, ('2022-03-08' , 123, 'direct')
, ('2022-03-10' , 123, 'paid')
)
, cte1 AS (
SELECT cte0.*
, pri.pri
, MIN(pri) OVER (PARTITION BY xuser ORDER BY date
RANGE BETWEEN INTERVAL '30' DAY PRECEDING AND CURRENT ROW
) AS mpri
FROM cte0
JOIN pri
ON pri.channel = cte0.channel
)
SELECT cte1.*
, pri.channel AS attributed
FROM cte1
JOIN pri
ON pri.pri = cte1.mpri
;
The result:
date
xuser
channel
pri
mpri
attributed
2022-01-01
123
direct
3
3
direct
2022-01-14
123
paid
1
1
paid
2022-02-01
123
direct
3
1
paid
2022-02-12
123
direct
3
1
paid
2022-02-13
123
organic
2
1
paid
2022-03-08
123
direct
3
2
organic
2022-03-10
123
paid
1
1
paid

Is contract active on a given date

In my table contracts I do have all contracts and orders (order belong to particular contracts defined by parent id pid). Contracts and orders are distinguished by id_type;
1 = contract (active at the beginning)
2 = deactivation order (contract becomes inactive)
3 = reactivation order (contract becomes active again)
Contracts can be deactivated or reactivated many times. Also, contracts can be deactivated and never reactivated again.
table of records:
id | pid | id_type | start_date
===+=====+=========+===========
20 | | 1 | 2021-01-01 --> contract 20 started and active
38 | 20 | 2 | 2021-02-15 --> contract 20 temporarily deactivated
42 | 20 | 3 | 2021-02-25 --> contract 20 activated again
54 | 20 | 2 | 2021-04-01 --> contract 20 temporarily deactivated
95 | 20 | 3 | 2021-04-15 --> contract 20 activated again
30 | | 1 | 2021-01-12 --> contract 30 started and active
I need SQL query which will return whether the contract is active or deactivated on a given date.
For example for date 2021-02-20 I should get that contract 20 is inactive.
I made some tries with LAG/LEAD functions but without success.
You can get the most recent row on or before a particular date using:
select t.*
from (select t.*,
row_number() over (partition by coalesce(pid, id) order by start_date desc) as seqnum
from t
where start_date <= date '2021-02-20'
) t
where seqnum = 1;
If you only want the status and date, then you can also use group by and keep:
select coalesce(pid, id), max(start_date),
max(id_type) keep (dense_rank first order by start_date desc) as id_type
from t
where start_date <= date '2021-02-20'
group by coalesce(pid, id)

How to pull data based on current and last update?

Our data table looks like this:
Machine Name
Lot Number
Qty
Load TxnDate
Unload TxnDate
M123
ABC
500
10/1/2020
10/2/2020
M741
DEF
325
10/1/2020
M123
ZZZ
100
10/5/2020
10/7/2020
M951
AAA
550
10/5/2020
10/9/2020
M123
BBB
550
10/7/2020
I need to create an SQL query that shows the currently loaded Lot number - Machines with no Unload TxnDate - and the last loaded Lot number based on the unload TxnDate.
So in the example, when I run a query for M123, the result will show:
Machine Name
Lot Number
Qty
Load TxnDate
Unload TxnDate
M123
ZZZ
100
10/5/2020
10/7/2020
M123
BBB
550
10/7/2020
As you can see although Machine Name has 3 records, the results only show the currently loaded and the last loaded. Is there anyway to replicate this? The Machine Name is dynamic, so my user can enter the Machine Name and see the results the machine based on the missing Unload TxnDate and the last Unload Txn Date
You seem to want the last two rows. That would be something like this:
select t.*
from t
where machine_name = 'M123'
order by load_txn_date desc
fetch first 2 rows only;
Note: not all databases support the first first clause. Some spell it limit, or select top, or even something else.
If you want two rows per machine, one option uses window functions:
select *
from (
select t.*,
row_number() over(
partition by machine_name, (case when unload_txn_date is null then 0 else 1 end)
order by coalesce(unload_txn_date, load_txn_date) desc
) rn
from mytable t
) t
where rn = 1
The idea is to separate rows between those that have an unload date, and those that do not. We can then bring the top record per group.
For your sample data, this returns:
Machine_Name | Lot_Number | Qty | Load_Txn_Date | Unload_Txn_Date | rn
:----------- | :--------- | --: | :------------ | :-------------- | -:
M123 | BBB | 550 | 2020-10-07 | null | 1
M123 | ZZZ | 100 | 2020-10-05 | 2020-10-07 | 1
M741 | DEF | 325 | 2020-10-01 | null | 1
M951 | AAA | 550 | 2020-10-05 | 2020-10-09 | 1
You might use the following query, presuming that you're on a database having Window(or Analytic) Function
WITH t AS
(
SELECT COALESCE(Unload_Txn_Date -
LAG(Load_Txn_Date) OVER
(PARTITION BY Machine_Name ORDER BY Load_Txn_Date DESC),0) AS lg,
MAX(CASE WHEN Unload_Txn_Date IS NULL THEN Load_Txn_Date END) OVER
(PARTITION BY Machine_Name) AS mx,
t.*
FROM tab t
), t2 AS
(
SELECT DENSE_RANK() OVER (ORDER BY mx DESC NULLS LAST) AS dr, t.*
FROM t
WHERE mx IS NOT NULL
)
SELECT Machine_Name,Lot_Number,Qty,Load_Txn_Date,Unload_Txn_Date
FROM t2
WHERE dr = 1 AND lg = 0
ORDER BY Load_Txn_Date
where if previous row's Unload_Txn_Date is equal to the current Load_Txn_Date, then it's accepted that there's no interruption will occur for the job, while determining the last Unload Txn Dates with no unload date values per each machine. And then, the result set returns through filtering by the values derived from the window functions within the penultimate query.
Demo

Workload distribution in Oracle SQL

I try to make a Workload distribution in SQL but it's seems hard.
My data are :
work-station | workload
------------------------
Station1 | 500
Station2 | 450
Station3 | 50
Station4 | 600
Station5 | 2
Station6 | 350
And :
Real Worker Number : 5
My needs are the following :
I required the exact match between real worker number than theoretical worker number (distribution)
I don't want to put someone in a station if it's not required (example : station5)
I don't want to know if my workers will be able to finish the complete workload
I want the best theoretical placement of my workers to have the best productivity
Is it possible to make this WorkLoad Distribution in a sql Request ?
Possible result :
work-station | workload | theoretical worker distribution
------------------------
Station1 | 500 | 1
Station2 | 450 | 1
Station3 | 50 | 0
Station4 | 600 | 2
Station5 | 2 | 0
Station6 | 350 | 1
Here is a very simplistic way to do it by prorating the workers by the percentage of total work assigned to each station.
The complexity comes from making sure that an integer number of workers is assigned and that the total number of assigned workers equals the number of workers that are available. Here is the query that does that:
with params as ( SELECT 5 total_workers FROM DUAL),
info ( station, workload) AS (
SELECT 'Station1', 500 FROM DUAL UNION ALL
SELECT 'Station2', 450 FROM DUAL UNION ALL
SELECT 'Station3', 50 FROM DUAL UNION ALL
SELECT 'Station4', 600 FROM DUAL UNION ALL
SELECT 'Station5', 2 FROM DUAL UNION ALL
SELECT 'Station6', 350 FROM DUAL ),
targets as (
select station,
workload,
-- What % of total work is assigned to station?
workload/sum(workload) over ( partition by null) pct_work,
-- How many workers (target_workers) would we assign if we could assign fractional workers?
total_workers * (workload/sum(workload) over ( partition by null)) target_workers,
-- Take the integer part of target_workers
floor(total_workers * (workload/sum(workload) over ( partition by null))) target_workers_floor,
-- Take the fractional part of target workers
mod(total_workers * (workload/sum(workload) over ( partition by null)),1) target_workers_frac
from params, info )
select t.station,
t.workload,
-- Start with the integer part of target workers
target_workers_floor +
-- Order the stations by the fractional part of target workers and assign 1 additional worker to each station until
-- the total number of workers assigned = the number of workers we have available.
case when row_number() over ( partition by null order by target_workers_frac desc )
<= total_workers - sum(target_workers_floor) over ( partition by null) THEN 1 ELSE 0 END target_workers
from params, targets t
order by station;
+----------+----------+----------------+--+
| STATION | WORKLOAD | TARGET_WORKERS | |
+----------+----------+----------------+--+
| Station1 | 500 | 1 | |
+----------+----------+----------------+--+
| Station2 | 450 | 1 | |
+----------+----------+----------------+--+
| Station3 | 50 | 0 | |
+----------+----------+----------------+--+
| Station4 | 600 | 2 | |
+----------+----------+----------------+--+
| Station5 | 2 | 0 | |
+----------+----------+----------------+--+
| Station6 | 350 | 1 | |
+----------+----------+----------------+--+
Below query should work:
First I divide workers to the stations that has more workload than mean work load.
Then I divide the rest of the workers to the stations in the same order of their remaining workloads.
http://sqlfiddle.com/#!4/55491/12
5 represent the number of workers.
SELECT
workload,
station,
SUM (worker_count)
FROM
(
SELECT workload, station, floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) worker_count -- divide workers to the stations have more workload then mean
FROM
work_station works
UNION ALL
SELECT t_table.*, 1
FROM ( SELECT workload, station
FROM work_station
ORDER BY
( workload - floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) * ( SELECT SUM (workload) / 5 FROM work_station )
) DESC
) t_table
WHERE
rownum < ( 5 - ( SELECT SUM ( floor( workload / ( SELECT SUM (workload) / 5 FROM work_station ) ) ) FROM work_station ) + 1
) -- count of the rest of the workers
) table_sum
GROUP BY
workload,
station
ORDER BY
station

Reporting task complementation status with only create and operation_date params

I have two tables that the first one stores task data (task name, create date, assign_to etc) and the second table stores task history data e.g operation_date,task completed, task rejected etc. (Task and Task_history tables)
Company creates tasks and assign them to employees, then employees accepted tasks and complete them.
Task create_date column specify the sequence of the task to do, both operation_date and completed status columns specify the sequence of the task complementation.
I need a query for reporting in employee detail that Does An Employee complete the tasks in a sequence specified at the beginning ? How many tasks completed accordance with the given sequence ?
I tried a query for status completed tasks that order tables for task_creation and operation_date for an employee for a given day. Then, add the rownum for select queries then join two tables. If rownums are equals, employee completes the task for given sequence else not. But the query result was not like what I expected. Rownums displaying like that, r_h--> 1,2,3 ; r_t--> 1,15,17
SELECT *
FROM (SELECT W.id, w.create_date, ROWNUM as r_t
FROM wfm_task_1 W where W.task_status = 3
ORDER BY W.create_date ASC) TASK_SEQ LEFT OUTER JOIN
( SELECT H.wfm_task, H.record_date, ROWNUM as r_h
FROM wfm_task_history H
WHERE H.task_status = 3
AND H.record_date BETWEEN (TO_DATE ('12.07.2013',
'DD.MM.YYYY')
- 1)
AND (TO_DATE ('12.07.2013',
'DD.MM.YYYY')
+ 1)
ORDER BY H.record_date ASC) HISTORY_SEQ
ON TASK_SEQ.id = HISTORY_SEQ.wfm_task
Sample dataset
wfm_task (ID, CREATION_DATE, TASK_NAME)
49361 | 06.07.2013 11:50:00 | missionx
49404 | 10.07.2013 13:01:00 | missiony
49407 | 11.07.2013 11:02:00 | missiona
49108 | 01.07.2013 21:02:00 | missionb
task_history (ID,WFM_TASK,OP_DATE, STATUS)
98 | 49361 | 12.07.2013 15:19:19 | 3
92 | 49404 | 12.07.2013 11:10:50 | 3
90 | 49407 | 12.07.2013 11:06:58 | 3
78 | 49108 | 03.07.2013 11:02:00 | 1
result (WFM_TASK,RECORD_DATE,R_H,ID,CREATE_DATE,R_T)
49361 | 12.07.2013 15:19:19 | 3 | 49361 | 06.07.2013 11:50:00 | 15
49404 | 12.07.2013 11:10:50 | 2 | 49404 | 10.07.2013 13:01:00 | 17
49407 | 12.07.2013 11:06:58 | 1 | 49407 | 11.07.2013 11:02:00 | 1
Status 3 = completed. I want to find that are the tasks completed by an order. I check that task complete order is likely to task creation order.
You'll probably have to use ROW_NUMBER function instead of ROWNUM.
SELECT a.id, a.create_date,
row_number() over (order by a.create_date) r_t,
b.record_date,
row_number() over (order by b.record_date) r_h
from wfm_task a left outer join task_history b
on a.id = b.wfm_task
where b.status = 3
and b.record_date between date'2013-07-12' - 1 and date'2013-07-12' + 1
Demo here.