DATE_DIFF() in BigQuery to calculate time between rows - sql

I would like to calculate the time delay between several customer purchases. However, each purchase is saved in an individual row. The data set looks similar to the following:
customer | order_id | purchase_date | product | sequencen| ... |
customer1 | 1247857 | 2020-01-30 | ProdA, ProdB | 1 | ... |
customer2 | 4454874 | 2020-02-07 | ProdA | 1 | ... |
customer3 | 3424556 | 2020-02-28 | ProdA | 1 | ... |
customer4 | 5678889 | 2020-03-14 | ProdB | 1 | ... |
customer3 | 5853778 | 2020-03-22 | ProdA, ProdB | 2 | ... |
customer4 | 7578345 | 2020-03-30 | ProdA, ProdB | 2 | ... |
customer2 | 4892978 | 2020-05-10 | ProdA | 2 | ... |
customer5 | 4834789 | 2020-07-05 | ProdA, ProdB | 1 | ... |
customer5 | 9846726 | 2020-07-27 | ProdB | 2 | ... |
customer1 | 1774783 | 2020-12-12 | ProdB | 2 | ... |
Per customer, I would like to end up with a table that calculates the time-difference (in days) between a certain purchase and the purchase that came before. Basically, I would like to know the time delay (latency) between a customers first and second purchase, second and third purchase, and so on. The result should look like the following:
customer | order_id | purchase_date | product | sequencen| ... | purchase_latency
customer1 | 1247857 | 2020-01-30 | ProdA, ProdB | 1 | ... |
customer1 | 1774783 | 2020-12-12 | ProdB | 2 | ... | 317
customer2 | 4454874 | 2020-02-07 | ProdA | 1 | ... |
customer2 | 4892978 | 2020-05-10 | ProdA | 2 | ... | 93
customer3 | 3424556 | 2020-02-28 | ProdA | 1 | ... |
customer3 | 5853778 | 2020-03-22 | ProdA, ProdB | 2 | ... | 23
customer4 | 5678889 | 2020-03-14 | ProdB | 1 | ... |
customer4 | 7578345 | 2020-03-30 | ProdA, ProdB | 2 | ... | 16
customer5 | 4834789 | 2020-07-05 | ProdA, ProdB | 1 | ... |
customer5 | 9846726 | 2020-07-27 | ProdB | 2 | ... | 22
I am struggling to add the purchase_latency calculation to my current query, as the calculation would require me to do a calculation over rows. Any ideas how I could add this to my current query?:
SELECT
order_id
max(customer) as customer,
max(purchase_date) as purchase_date,
STRING_AGG(product, ",") as product,
...,
FROM SELECT(
od.order_number as order_id,
od.customer_email as customer,
od.order_date as purchase_date
dd.sku as product,
ROW_NUMBER() OVER (PARTITION BY od.customer_email ORDER BY od.order_date ) as sequencen
FROM orders_data od
JOIN detail_data dd
ON od.order_number = dd.order_number
where od.price> 0 AND
od.sku in ("ProdA","ProdB"))
GROUP BY order_id

Did you try row navigation functions like LAG?
WITH finishers AS
(SELECT 'Sophia Liu' as name,
TIMESTAMP '2016-10-18 2:51:45' as finish_time,
'F30-34' as division
UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39'
UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34'
UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39'
UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39'
UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39'
UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34'
UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34'
UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29'
UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34')
SELECT name,
finish_time,
division,
LAG(name)
OVER (PARTITION BY division ORDER BY finish_time ASC) AS preceding_runner
FROM finishers;
+-----------------+-------------+----------+------------------+
| name | finish_time | division | preceding_runner |
+-----------------+-------------+----------+------------------+
| Carly Forte | 03:08:58 | F25-29 | NULL |
| Sophia Liu | 02:51:45 | F30-34 | NULL |
| Nikki Leith | 02:59:01 | F30-34 | Sophia Liu |
| Jen Edwards | 03:06:36 | F30-34 | Nikki Leith |
| Meghan Lederer | 03:07:41 | F30-34 | Jen Edwards |
| Lauren Reasoner | 03:10:14 | F30-34 | Meghan Lederer |
| Lisa Stelzner | 02:54:11 | F35-39 | NULL |
| Lauren Matthews | 03:01:17 | F35-39 | Lisa Stelzner |
| Desiree Berry | 03:05:42 | F35-39 | Lauren Matthews |
| Suzy Slane | 03:06:24 | F35-39 | Desiree Berry |
+-----------------+-------------+----------+------------------+

Related

SQL - Multiple conditionals with row_number()

+------------+------------------+-------+-----------------+------------------------------+
| product_id | date | STOCK | REAL_STOCK-DIFF | Couting one time? |
+------------+------------------+-------+-----------------+------------------------------+
| 1ab7 | 10/18/2022 18:30 | 6009 | 495 | 495 |
| 1ab7 | 10/18/2022 20:10 | 6003 | 495 | 0 |
| 1ab7 | 10/20/2022 10:05 | 5514 | 495 | 0 |
| 1ab7 | 10/20/2022 11:05 | 23856 | 0 | 0 |
| 1ab7 | 10/20/2022 12:05 | 25850 | 0 | 0 |
| 1ab7 | 10/20/2022 13:05 | 44160 | 0 | 0 |
| 1ab7 | 10/20/2022 14:05 | 48205 | 130 | 130 |
| 1ab7 | 10/20/2022 17:05 | 48122 | 130 | 0 |
| 1ab7 | 10/20/2022 18:05 | 48075 | 130 | 0 |
| 1ab7 | 10/20/2022 19:05 | 17438 | 128 | 128 |
| 1ab7 | 10/21/2022 1:38 | 17310 | 128 | 0 |
| 2ab7 | 10/18/2022 18:30 | 85692 | 0 | 0 |
| 2ab7 | 10/20/2022 14:05 | 84498 | | SUM DIF STOCK == 495+130+128 |
| 2ab7 | 10/20/2022 15:05 | 84477 | | |
| 2ab7 | 10/20/2022 16:05 | 0 | | |
| 2ab7 | 10/20/2022 23:38 | 0 | | |
| 2ab7 | 10/21/2022 0:05 | 0 | | |
+------------+------------------+-------+-----------------+------------------------------+
This data shows the select that I tried to do with partition by in SQL, I'm controlling the stock and I have to show the stock difference over the products. That said, I was doing something like partition by max - min, however several conditions are applied, like: The stock can suddenly grow or decrease and even be removed completely (stock = 0), the partition by clause won't solve it purely.
My real stock is the third column "STOCK" in black, as you can see it was dropping from 6009 to the minimum 5514, then it grew up to 23856 until 48205, then was deducted to 17438.
Here we could do something like if 23856 > last number (5514) then new minimum and maximum = 23856, Just don't know how to partition it though. For the 17438, something like if the previous row_number > 17438*1.2 (20%) then new minimum = 17438
The SQL that I made was giving me the "Dif_Stock" column, which is wrongly displaying 42691 as the difference.
All that I'm trying to do is achieve the values that I inserted here in this column "Couting one time?":
"SUM DIF STOCK == 495+130+128"
My SQL code:
SELECT DISTINCT
product_id,
date,
stock,
MaxStock,
MinStock,
(MaxStock - MinStock) AS Dif_Stock
FROM (SELECT product_id,
date,
stock,
MAX(stock) OVER (PARTITION BY product_id) AS MaxStock,
MIN(stock) OVER (PARTITION BY product_id) AS MinStock,
ROW_NUMBER() OVER (PARTITION BY product_id ORDER BY product_id) AS ROWN
FROM (SELECT product_id,
category,
product_name,
vol,
price,
CAST(stock AS int) AS stock,
date
FROM stock_control
WHERE 1 = 1) STOCK
GROUP BY date,
product_id,
category,
product_name,
vol,
price,
stock,
date
--HAVING STOCK != 0
) STOCK_2
--ORDER BY (MaxStock - MinStock) DESC
ORDER BY product_id,
date ASC;
Real image bellow from select * from mytable

Query to reorganize dates

I need to do a transformation of a Postgres database table and I don't know where to start.
This is the table:
| Customer Code | Activity | Start Date |
|:---------------:|:--------:|:----------:|
| 100 | A | 01/05/2017 |
| 100 | A | 19/07/2017 |
| 100 | B | 18/09/2017 |
| 100 | C | 07/12/2017 |
| 101 | A | 11/02/2018 |
| 101 | B | 02/04/2018 |
| 101 | B | 14/06/2018 |
| 100 | A | 13/07/2018 |
| 100 | B | 14/08/2018 |
Customers can perform activities A, B and C, always in that order.
To carry out activity B he/she has to carry out activity A. To carry out C, he/she has to carry out activity A, then to B.
An activity or cycle can be performed more than once by the same customer.
I need to reorganize the table in this way, placing the beginning and end of each step:
| Customer Code | Activity | Start Date | End Date |
|:---------------:|:--------:|:----------:|:----------:|
| 100 | A | 01/05/2017 | 18/09/2017 |
| 100 | B | 18/09/2017 | 07/12/2017 |
| 100 | C | 07/12/2017 | 13/07/2018 |
| 101 | A | 11/02/2018 | 02/04/2018 |
| 101 | B | 02/04/2018 | |
| 100 | A | 13/07/2018 | 14/08/2018 |
| 100 | B | 14/08/2018 | |
Here is approach at this gaps-and-islands problem:
select
customer_code,
activity,
start_date,
case when (activity, lead(activity) over(partition by customer_code order by start_date))
in (('A', 'B'), ('B', 'C'), ('C', 'A'))
then lead(start_date) over(partition by customer_code order by start_date)
end end_date
from (
select
t.*,
lead(activity) over(partition by customer_code order by start_date) lead_activity
from mytable t
) t
where activity is distinct from lead_activity
The query starts by removing consecutive rows that have the same customer_code and activity. Then, we use conditional logic to bring in the start_date of the next row when the activty is in sequence.
Demo on DB Fiddle:
customer_code | activity | start_date | end_date
------------: | :------- | :--------- | :---------
100 | A | 2017-07-19 | 2017-09-18
100 | B | 2017-09-18 | 2017-12-07
100 | C | 2017-12-07 | 2018-07-13
100 | A | 2018-07-13 | 2018-08-14
100 | B | 2018-08-14 | null
101 | A | 2018-02-11 | 2018-06-14
101 | B | 2018-06-14 | null

How to Do Data-Grouping in BigQuery?

I have list of database that needed to be grouped. I've successfully done this by using R, yet now I have to do this by using BigQuery. The data is shown as per following table
| category | sub_category | date | day | timestamp | type | cpc | gmv |
|---------- |-------------- |----------- |----- |------------- |------ |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:36 PM | BI | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:39 PM | RT | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:38:29 PM | RT | 1.58 | 205,041 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:14 AM | BI | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:18 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:52 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:06:33 AM | BI | 1.55 | 201,354 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:55:47 PM | PP | 1 | 129,282 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:56:23 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:19 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:34 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:58:46 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:27 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:51 PM | RT | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:00:57 AM | BI | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:01:11 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:03:01 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:12:42 AM | RT | 1.19 | 154,886 |
I wanted to group the rows. A row that has <= 8 minutes timestamp-difference with the next row will be grouped as one row with below output example:
| category | sub_category | date | day | time | start_timestamp | end_timestamp | type | cpc | gmv |
|---------- |-------------- |----------------------- |--------- |---------- |--------------------- |--------------------- |---------- |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 23:37:36 | (02/17/20 23:37:36) | (02/17/20 23:38:29) | BI|RT | 1.82 | 236,542 |
| ABC | ABC-1 | 2/18/2020 | Tue | 0:05:14 | (02/18/20 00:05:14) | (02/18/20 00:06:33) | BI|RT | 1.59 | 206,636 |
| XYZ | XYZ-1 | 02/17/2020|02/18/2020 | Mon|Tue | 0:06:21 | (02/17/20 23:55:47) | (02/18/20 00:12:42) | PP|RT|BI | 0.95 | 123,815 |
There were some new-generated fields as per below:
| fields | definition |
|----------------- |-------------------------------------------------------- |
| day | Day of the row (combination if there's different days) |
| time | Start of timestamp |
| start_timestamp | Start timestamp of the first row in group |
| end_timestamp | Start timestamp of the last row in group |
| type | Type of Row (combination if there's different types) |
| cpc | Average CPC of the Group |
| gwm | Average GMV of the Group |
Could anyone help me to make the query as per above requirements?
Thank you
This is a gaps and island problem. Here is a solution that uses lag() and a cumulative sum() to define groups of adjacent records with less than 8 minutes gap; the rest is aggregation.
select
category,
sub_category,
string_agg(distinct day, '|' order by dt) day,
min(dt) start_dt,
max(dt) end_dt,
string_agg(distinct type, '|' order by dt) type,
avg(cpc) cpc,
avg(gwm) gwm
from (
select
t.*,
sum(case when dt <= datetime_add(lag_dt, interval 8 minute) then 0 else 1 end)
over(partition by category, sub_category order by dt) grp
from (
select
t.*,
lag(dt) over(partition by category, sub_category order by dt) lag_dt
from (
select t.*, datetime(date, timestamp) dt
from mytable t
) t
) t
) t
) t
group by category, sub_category, grp
Note that you should not be storing the date and time parts of your timestamps in separated columns: this makes the logic more complicated when you need to combine them (I added another level of nesting to avoid repeated conversions, which would have obfuscated the code).

Aggregate data from days into a month

I have data that is presented by the day and I want to the data into a monthly report. The data looks like this.
INVOICE_DATE GROSS_REVENUE NET_REVENUE
2018-06-28 ,1623.99 ,659.72
2018-06-27 ,112414.65 ,38108.13
2018-06-26 ,2518.74 ,1047.14
2018-06-25 ,475805.92 ,172193.58
2018-06-22 ,1151.79 ,478.96
How do I go about creating a report where it gives me the total gross revenue and net revenue for the month of June, July, August etc where the data is reported by the day?
So far this is what I have
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY invoice_date
I would simply group by year and month.
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY year(invoice_date), month(invoice_date)
Since I don't know if you have access to the year and month functions, another solution would be to cast the date as a varchar and group by the left-most 7 characters (year+month)
SELECT left(cast(invoice_date as varchar(50)),7) AS invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY left(cast(invoice_date as varchar(50)),7)
You could try a ROLLUP. Sample illustration below:
Table data:
mysql> select * from wc_revenue;
+--------------+---------------+-------------+
| invoice_date | gross_revenue | net_revenue |
+--------------+---------------+-------------+
| 2018-06-28 | 1623.99 | 659.72 |
| 2018-06-27 | 112414.65 | 38108.13 |
| 2018-06-26 | 2518.74 | 1047.14 |
| 2018-06-25 | 475805.92 | 172193.58 |
| 2018-06-22 | 1151.79 | 478.96 |
| 2018-07-02 | 150.00 | 100.00 |
| 2018-07-05 | 350.00 | 250.00 |
| 2018-08-07 | 600.00 | 400.00 |
| 2018-08-09 | 900.00 | 600.00 |
+--------------+---------------+-------------+
mysql> SELECT month(invoice_date) as MTH, invoice_date, SUM(gross_revenue) AS gross_revenue, SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY MTH, invoice_date WITH ROLLUP;
+------+--------------+---------------+-------------+
| MTH | invoice_date | gross_revenue | net_revenue |
+------+--------------+---------------+-------------+
| 6 | 2018-06-22 | 1151.79 | 478.96 |
| 6 | 2018-06-25 | 475805.92 | 172193.58 |
| 6 | 2018-06-26 | 2518.74 | 1047.14 |
| 6 | 2018-06-27 | 112414.65 | 38108.13 |
| 6 | 2018-06-28 | 1623.99 | 659.72 |
| 6 | NULL | 593515.09 | 212487.53 |
| 7 | 2018-07-02 | 150.00 | 100.00 |
| 7 | 2018-07-05 | 350.00 | 250.00 |
| 7 | NULL | 500.00 | 350.00 |
| 8 | 2018-08-07 | 600.00 | 400.00 |
| 8 | 2018-08-09 | 900.00 | 600.00 |
| 8 | NULL | 1500.00 | 1000.00 |
| NULL | NULL | 595515.09 | 213837.53 |
+------+--------------+---------------+-------------+

Boolean was amount ever greater than x?

Interesting question for you all. Here's a sample of my dataset (see below). I have warehouses, dates, and the change in inventory level at that specific date for a given warehouse.
Ex: Assuming 1/1/2018 is first date, warehouse 1 starts out with 100 in inventory, then 600, then 300, then 500...etc.
My question I'd like to answer in SQL: By warehouse ID, did each warehouse ever have inventory of more than 750 (yes/no)?
I can't sum the entire column, because the ending inventory (sum of column by warehouse) is likely lower than a past inventory level. Any help is appreciated!!
+--------------+------------+---------------+
| Warehouse_id | Date | Inventory_Amt |
+--------------+------------+---------------+
| 1 | 1/1/2018 | +100 |
| 1 | 6/1/2018 | +500 |
| 1 | 6/15/2018 | -300 |
| 1 | 7/1/2018 | +200 |
| 1 | 8/1/2018 | -400 |
| 1 | 12/15/2018 | +100 |
| 2 | 1/1/2018 | +10 |
| 2 | 6/1/2018 | +50 |
| 2 | 6/15/2018 | -30 |
| 2 | 7/1/2018 | +20 |
| 2 | 8/1/2018 | -40 |
| 2 | 12/15/2018 | +10 |
| 3 | 1/1/2018 | +100 |
| 3 | 6/1/2018 | +500 |
| 4 | 6/15/2018 | +300 |
| 4 | 7/1/2018 | +200 |
| 4 | 8/1/2018 | -400 |
| 4 | 12/15/2018 | +100 |
+--------------+------------+---------------+
You want a cumulative sum and then filtering:
select i.*
from (select i.*, sum(inventory_amt) over (partition by warehouse_id order by date) as inventory
from inventory i
) i
where inventory_amt > 750