How to Do Data-Grouping in BigQuery? - sql

I have list of database that needed to be grouped. I've successfully done this by using R, yet now I have to do this by using BigQuery. The data is shown as per following table
| category | sub_category | date | day | timestamp | type | cpc | gmv |
|---------- |-------------- |----------- |----- |------------- |------ |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:36 PM | BI | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:39 PM | RT | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:38:29 PM | RT | 1.58 | 205,041 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:14 AM | BI | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:18 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:52 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:06:33 AM | BI | 1.55 | 201,354 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:55:47 PM | PP | 1 | 129,282 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:56:23 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:19 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:34 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:58:46 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:27 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:51 PM | RT | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:00:57 AM | BI | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:01:11 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:03:01 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:12:42 AM | RT | 1.19 | 154,886 |
I wanted to group the rows. A row that has <= 8 minutes timestamp-difference with the next row will be grouped as one row with below output example:
| category | sub_category | date | day | time | start_timestamp | end_timestamp | type | cpc | gmv |
|---------- |-------------- |----------------------- |--------- |---------- |--------------------- |--------------------- |---------- |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 23:37:36 | (02/17/20 23:37:36) | (02/17/20 23:38:29) | BI|RT | 1.82 | 236,542 |
| ABC | ABC-1 | 2/18/2020 | Tue | 0:05:14 | (02/18/20 00:05:14) | (02/18/20 00:06:33) | BI|RT | 1.59 | 206,636 |
| XYZ | XYZ-1 | 02/17/2020|02/18/2020 | Mon|Tue | 0:06:21 | (02/17/20 23:55:47) | (02/18/20 00:12:42) | PP|RT|BI | 0.95 | 123,815 |
There were some new-generated fields as per below:
| fields | definition |
|----------------- |-------------------------------------------------------- |
| day | Day of the row (combination if there's different days) |
| time | Start of timestamp |
| start_timestamp | Start timestamp of the first row in group |
| end_timestamp | Start timestamp of the last row in group |
| type | Type of Row (combination if there's different types) |
| cpc | Average CPC of the Group |
| gwm | Average GMV of the Group |
Could anyone help me to make the query as per above requirements?
Thank you

This is a gaps and island problem. Here is a solution that uses lag() and a cumulative sum() to define groups of adjacent records with less than 8 minutes gap; the rest is aggregation.
select
category,
sub_category,
string_agg(distinct day, '|' order by dt) day,
min(dt) start_dt,
max(dt) end_dt,
string_agg(distinct type, '|' order by dt) type,
avg(cpc) cpc,
avg(gwm) gwm
from (
select
t.*,
sum(case when dt <= datetime_add(lag_dt, interval 8 minute) then 0 else 1 end)
over(partition by category, sub_category order by dt) grp
from (
select
t.*,
lag(dt) over(partition by category, sub_category order by dt) lag_dt
from (
select t.*, datetime(date, timestamp) dt
from mytable t
) t
) t
) t
) t
group by category, sub_category, grp
Note that you should not be storing the date and time parts of your timestamps in separated columns: this makes the logic more complicated when you need to combine them (I added another level of nesting to avoid repeated conversions, which would have obfuscated the code).

Related

How to perform aggregation(sum) group by month field using HiveQL?

Below is my data where am looking to generate sum of revenues per month basis using columns event_time and price.
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| oct_data.event_time | oct_data.event_type | oct_data.product_id | oct_data.category_id | oct_data.category_code | oct_data.brand | oct_data.price | oct_data.user_id | oct_data.user_session |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| 2019-10-01 00:00:00 UTC | cart | 5773203 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:03 UTC | cart | 5773353 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:07 UTC | cart | 5881589 | 2151191071051219817 | | lovely | 13.48 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:07 UTC | cart | 5723490 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:15 UTC | cart | 5881449 | 1487580013522845895 | | lovely | 0.56 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:16 UTC | cart | 5857269 | 1487580005134238553 | | runail | 2.62 | 430174032 | 73dea1e7-664e-43f4-8b30-d32b9d5af04f |
| 2019-10-01 00:00:19 UTC | cart | 5739055 | 1487580008246412266 | | kapous | 4.75 | 377667011 | 81326ac6-daa4-4f0a-b488-fd0956a78733 |
| 2019-10-01 00:00:24 UTC | cart | 5825598 | 1487580009445982239 | | | 0.56 | 467916806 | 2f5b5546-b8cb-9ee7-7ecd-84276f8ef486 |
| 2019-10-01 00:00:25 UTC | cart | 5698989 | 1487580006317032337 | | | 1.27 | 385985999 | d30965e8-1101-44ab-b45d-cc1bb9fae694 |
| 2019-10-01 00:00:26 UTC | view | 5875317 | 2029082628195353599 | | | 1.59 | 474232307 | 445f2b74-5e4c-427e-b7fa-6e0a28b156fe |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
I have used the below query but the sum does not seem to occur. Please suggest best approaches to generate the desired output.
select date_format(event_time,'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time,'MM')
order by Month;
Note: event_time field is in TIMESTAMP format.
First convert the timestamp to date and then apply date_format():
select date_format(cast(event_time as date),'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(cast(event_time as date),'MM')
order by Month;
This will work if all the dates are of the same year.
If not then you should also group by year.
Your code should work -- unless you are using an old version of Hive. date_format() has accepted a timestamp argument since 1.1.2 -- released in early 2016. That said, I would strongly suggest that you include the year:
select date_format(event_time, 'yyyy-MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time, 'yyyy-MM')
order by Month;

how to get the 3rd report to combine the customer and order data

I have a question about retention rate.
I have 2 tables, including the customer data and the order data.
DISTRIBUTOR as d
+---------+-----------+--------------+--------------------+
| ID | SETUP_DT | REINSTATE_DT | LOCAL_REINSTATE_DT |
+---------+-----------+--------------+--------------------+
| C111111 | 2018/1/1 | Null | Null |
| C111112 | 2015/12/9 | 2018/10/25 | 2018/10/25 |
| C111113 | 2018/10/1 | Null | Null |
| C111114 | 2018/10/6 | 2018/12/14 | 2018/12/14 |
+---------+-----------+--------------+--------------------+
ORDER as o, please noted that the data is for reference...
+---------+----------+-----+
| ID | ORD_DT | OAL |
+---------+----------+-----+
| C111111 | 2018/1/1 | 112 |
| C111111 | 2018/1/1 | 100 |
| C111111 | 2018/1/1 | 472 |
| C111111 | 2018/1/1 | 452 |
| C111111 | 2018/1/1 | 248 |
| C111111 | 2018/1/1 | 996 |
+---------+----------+-----+
The 3rd Table in my mind to create the retention rate report
+---------+-----------+-----------+---------------+-----------+
| ID | APP_MON | ORDER_MON | TimeDiff(Mon) | TTL AMT |
+---------+-----------+-----------+---------------+-----------+
| C111111 | 2018/1/1 | 2018/1/1 | - | 25,443 |
| C111111 | 2018/1/1 | 2018/2/1 | 1 | 7,610 |
| C111111 | 2018/1/1 | 2018/3/1 | 2 | 20,180 |
| C111111 | 2018/1/1 | 2018/4/1 | 3 | 22,265 |
| C111111 | 2018/1/1 | 2018/5/1 | 4 | 34,118 |
| C111111 | 2018/1/1 | 2018/6/1 | 5 | 19,523 |
| C111111 | 2018/1/1 | 2018/7/1 | 6 | 20,220 |
| C111111 | 2018/1/1 | 2018/8/1 | 7 | 2,006 |
| C111111 | 2018/1/1 | 2018/9/1 | 8 | 15,813 |
| C111111 | 2018/1/1 | 2018/10/1 | 9 | 16,733 |
| C111111 | 2018/1/1 | 2018/11/1 | 10 | 20,973 |
| C111112 | 2018/10/1 | 2017/11/1 | - | 516 |
| C111112 | 2018/10/1 | 2018/10/1 | - | 1 |
| C111113 | 2018/10/1 | Null | - | Null |
| C111114 | 2018/12/1 | Null | - | Null |
+---------+-----------+-----------+---------------+-----------+
Definition:
- APP_MON: the month that the customer joined, which is the max date from the start date of [d.SETUP_DT], [d.REINSTATE_DT] and [d.LOCAL_REINSTATE_DT]
- ORD_MON: the month that the customer purchased, which is the start date of the order date month
- TimeDiff: The duration by month between APP_MON and ORD_MON, e.g. if A's ODR_MON is 2018/1/1 and A'S APP_MON is 2018/2/1, the duration is 1.
- TTL_AMT: the total order amount that the customer bought in the related order date month
I tried to get the data from 3rd table.
But I run the code below and it's very slow... I need a more effective way since I have millions of data...
Thanks.
I don't think you need to use unpivot. To get the latest date you can just use the greatest() function.
This solution has two subqueries, one to calculate the app_mon for each new customer and the other to calculate the earliest order date for all customers who placed an order in the last two years. This may not be the most performative approach but your first priority should be to get the correct outcome; once you have that you can tune it if necessary:
with cust as
(
select d.dist_id as id
, greatest(d.setup_dt, d.reinstate_dt, d.local_reinstate_dt) as app_mo
from mjensen_dev.gc_distributor d
where d.setup_dt >= date '2017-01-01'
or d.reinstate_dt >= date '2017-01-01'
or d.local_reinstate_dt >= date '2017-01-01'
) , ord as
(
select o.dist_id as id
, min(o.ord_dt) as ord_mon
, sum(o.oal) as ord_amt
from gc_orders o
where o.ord_dt >= date '2017-01-01'
group by o.dist_id
, trunc(o.ord_dt,'mm')
)
select cust.dist_id as id
, cust.app_mon
, ord.ord_mon
, floor(months_between(ord.ord_mon, cust.app_mon ) as mon_diff
, sum(o.oal) as ord_amt
from cust
inner join gc_orders o on cust.id = o.dist_id
order by 1, 2
/
You may wish to tweak at my calculation of mon_diff. This calculation treats 2018/2/1 - 2018/1/1 as one month difference. Because it seems odd to me that a customer who places an order on the day they joined would have a mon_diff of 1 rather than zero. But if your statement of the business rule is correct you would need to add 1 to the calculation. Likewise I have not included the trunc() in the processing of the dates but you may wish to reinstate it.

How to show all date from a certain Month?

I have database table in postgreSQL name as "time" like:
| Name | | StartDate | | EndDate |
----------------------------------------
| Oct-18 | | 2018-10-01| | 2018-10-31|
| Nov-18 | | 2018-11-01| | 2018-11-30|
| Dec-18 | | 2018-12-01| | 2018-12-31|
I want the result for each month like
| Date | | Name |
-------------------------
| 2018-10-01| | Oct-18 |
| 2018-10-02| | Oct-18 |
| 2018-10-03| | Oct-18 |
| 2018-10-04| | Oct-18 |
| 2018-10-05| | Oct-18 |
| 2018-10-06| | Oct-18 |
.....
| 2018-10-31| | Oct-18 |
I think generate_series() does what you want:
select generate_series(t.start_date, t.end_date, interval '1 day') as date, name
from t;

Aggregate data from days into a month

I have data that is presented by the day and I want to the data into a monthly report. The data looks like this.
INVOICE_DATE GROSS_REVENUE NET_REVENUE
2018-06-28 ,1623.99 ,659.72
2018-06-27 ,112414.65 ,38108.13
2018-06-26 ,2518.74 ,1047.14
2018-06-25 ,475805.92 ,172193.58
2018-06-22 ,1151.79 ,478.96
How do I go about creating a report where it gives me the total gross revenue and net revenue for the month of June, July, August etc where the data is reported by the day?
So far this is what I have
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY invoice_date
I would simply group by year and month.
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY year(invoice_date), month(invoice_date)
Since I don't know if you have access to the year and month functions, another solution would be to cast the date as a varchar and group by the left-most 7 characters (year+month)
SELECT left(cast(invoice_date as varchar(50)),7) AS invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY left(cast(invoice_date as varchar(50)),7)
You could try a ROLLUP. Sample illustration below:
Table data:
mysql> select * from wc_revenue;
+--------------+---------------+-------------+
| invoice_date | gross_revenue | net_revenue |
+--------------+---------------+-------------+
| 2018-06-28 | 1623.99 | 659.72 |
| 2018-06-27 | 112414.65 | 38108.13 |
| 2018-06-26 | 2518.74 | 1047.14 |
| 2018-06-25 | 475805.92 | 172193.58 |
| 2018-06-22 | 1151.79 | 478.96 |
| 2018-07-02 | 150.00 | 100.00 |
| 2018-07-05 | 350.00 | 250.00 |
| 2018-08-07 | 600.00 | 400.00 |
| 2018-08-09 | 900.00 | 600.00 |
+--------------+---------------+-------------+
mysql> SELECT month(invoice_date) as MTH, invoice_date, SUM(gross_revenue) AS gross_revenue, SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY MTH, invoice_date WITH ROLLUP;
+------+--------------+---------------+-------------+
| MTH | invoice_date | gross_revenue | net_revenue |
+------+--------------+---------------+-------------+
| 6 | 2018-06-22 | 1151.79 | 478.96 |
| 6 | 2018-06-25 | 475805.92 | 172193.58 |
| 6 | 2018-06-26 | 2518.74 | 1047.14 |
| 6 | 2018-06-27 | 112414.65 | 38108.13 |
| 6 | 2018-06-28 | 1623.99 | 659.72 |
| 6 | NULL | 593515.09 | 212487.53 |
| 7 | 2018-07-02 | 150.00 | 100.00 |
| 7 | 2018-07-05 | 350.00 | 250.00 |
| 7 | NULL | 500.00 | 350.00 |
| 8 | 2018-08-07 | 600.00 | 400.00 |
| 8 | 2018-08-09 | 900.00 | 600.00 |
| 8 | NULL | 1500.00 | 1000.00 |
| NULL | NULL | 595515.09 | 213837.53 |
+------+--------------+---------------+-------------+

SQL window excluding current group?

I'm trying to provide rolled up summaries of the following data including only the group in question as well as excluding the group. I think this can be done with a window function, but I'm having problems with getting the syntax down (in my case Hive SQL).
I want the following data to be aggregated
+------------+---------+--------+
| date | product | rating |
+------------+---------+--------+
| 2018-01-01 | A | 1 |
| 2018-01-02 | A | 3 |
| 2018-01-20 | A | 4 |
| 2018-01-27 | A | 5 |
| 2018-01-29 | A | 4 |
| 2018-02-01 | A | 5 |
| 2017-01-09 | B | NULL |
| 2017-01-12 | B | 3 |
| 2017-01-15 | B | 4 |
| 2017-01-28 | B | 4 |
| 2017-07-21 | B | 2 |
| 2017-09-21 | B | 5 |
| 2017-09-13 | C | 3 |
| 2017-09-14 | C | 4 |
| 2017-09-15 | C | 5 |
| 2017-09-16 | C | 5 |
| 2018-04-01 | C | 2 |
| 2018-01-13 | D | 1 |
| 2018-01-14 | D | 2 |
| 2018-01-24 | D | 3 |
| 2018-01-31 | D | 4 |
+------------+---------+--------+
Aggregated results:
+------+-------+---------+----+------------+------------------+----------+
| year | month | product | ct | avg_rating | avg_rating_other | other_ct |
+------+-------+---------+----+------------+------------------+----------+
| 2018 | 1 | A | 5 | 3.4 | 2.5 | 4 |
| 2018 | 2 | A | 1 | 5 | NULL | 0 |
| 2017 | 1 | B | 4 | 3.6666667 | NULL | 0 |
| 2017 | 7 | B | 1 | 2 | NULL | 0 |
| 2017 | 9 | B | 1 | 5 | 4.25 | 4 |
| 2017 | 9 | C | 4 | 4.25 | 5 | 1 |
| 2018 | 4 | C | 1 | 2 | NULL | 0 |
| 2018 | 1 | D | 4 | 2.5 | 3.4 | 5 |
+------+-------+---------+----+------------+------------------+----------+
I've also considered producing two aggregates, one with the product in question and one without, but having trouble with creating the appropriate joining key.
You can do:
select year(date), month(date), product,
count(*) as ct, avg(rating) as avg_rating,
sum(count(*)) over (partition by year(date), month(date)) - count(*) as ct_other,
((sum(sum(rating)) over (partition by year(date), month(date)) - sum(rating)) /
(sum(count(*)) over (partition by year(date), month(date)) - count(*))
) as avg_other
from t
group by year(date), month(date), product;
The rating for the "other" is a bit tricky. You need to add everything up and subtract out the current row -- and calculate the average by doing the sum divided by the count.