Aggregate data from days into a month - sql

I have data that is presented by the day and I want to the data into a monthly report. The data looks like this.
INVOICE_DATE GROSS_REVENUE NET_REVENUE
2018-06-28 ,1623.99 ,659.72
2018-06-27 ,112414.65 ,38108.13
2018-06-26 ,2518.74 ,1047.14
2018-06-25 ,475805.92 ,172193.58
2018-06-22 ,1151.79 ,478.96
How do I go about creating a report where it gives me the total gross revenue and net revenue for the month of June, July, August etc where the data is reported by the day?
So far this is what I have
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY invoice_date

I would simply group by year and month.
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY year(invoice_date), month(invoice_date)
Since I don't know if you have access to the year and month functions, another solution would be to cast the date as a varchar and group by the left-most 7 characters (year+month)
SELECT left(cast(invoice_date as varchar(50)),7) AS invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY left(cast(invoice_date as varchar(50)),7)

You could try a ROLLUP. Sample illustration below:
Table data:
mysql> select * from wc_revenue;
+--------------+---------------+-------------+
| invoice_date | gross_revenue | net_revenue |
+--------------+---------------+-------------+
| 2018-06-28 | 1623.99 | 659.72 |
| 2018-06-27 | 112414.65 | 38108.13 |
| 2018-06-26 | 2518.74 | 1047.14 |
| 2018-06-25 | 475805.92 | 172193.58 |
| 2018-06-22 | 1151.79 | 478.96 |
| 2018-07-02 | 150.00 | 100.00 |
| 2018-07-05 | 350.00 | 250.00 |
| 2018-08-07 | 600.00 | 400.00 |
| 2018-08-09 | 900.00 | 600.00 |
+--------------+---------------+-------------+
mysql> SELECT month(invoice_date) as MTH, invoice_date, SUM(gross_revenue) AS gross_revenue, SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY MTH, invoice_date WITH ROLLUP;
+------+--------------+---------------+-------------+
| MTH | invoice_date | gross_revenue | net_revenue |
+------+--------------+---------------+-------------+
| 6 | 2018-06-22 | 1151.79 | 478.96 |
| 6 | 2018-06-25 | 475805.92 | 172193.58 |
| 6 | 2018-06-26 | 2518.74 | 1047.14 |
| 6 | 2018-06-27 | 112414.65 | 38108.13 |
| 6 | 2018-06-28 | 1623.99 | 659.72 |
| 6 | NULL | 593515.09 | 212487.53 |
| 7 | 2018-07-02 | 150.00 | 100.00 |
| 7 | 2018-07-05 | 350.00 | 250.00 |
| 7 | NULL | 500.00 | 350.00 |
| 8 | 2018-08-07 | 600.00 | 400.00 |
| 8 | 2018-08-09 | 900.00 | 600.00 |
| 8 | NULL | 1500.00 | 1000.00 |
| NULL | NULL | 595515.09 | 213837.53 |
+------+--------------+---------------+-------------+

Related

DATE_DIFF() in BigQuery to calculate time between rows

I would like to calculate the time delay between several customer purchases. However, each purchase is saved in an individual row. The data set looks similar to the following:
customer | order_id | purchase_date | product | sequencen| ... |
customer1 | 1247857 | 2020-01-30 | ProdA, ProdB | 1 | ... |
customer2 | 4454874 | 2020-02-07 | ProdA | 1 | ... |
customer3 | 3424556 | 2020-02-28 | ProdA | 1 | ... |
customer4 | 5678889 | 2020-03-14 | ProdB | 1 | ... |
customer3 | 5853778 | 2020-03-22 | ProdA, ProdB | 2 | ... |
customer4 | 7578345 | 2020-03-30 | ProdA, ProdB | 2 | ... |
customer2 | 4892978 | 2020-05-10 | ProdA | 2 | ... |
customer5 | 4834789 | 2020-07-05 | ProdA, ProdB | 1 | ... |
customer5 | 9846726 | 2020-07-27 | ProdB | 2 | ... |
customer1 | 1774783 | 2020-12-12 | ProdB | 2 | ... |
Per customer, I would like to end up with a table that calculates the time-difference (in days) between a certain purchase and the purchase that came before. Basically, I would like to know the time delay (latency) between a customers first and second purchase, second and third purchase, and so on. The result should look like the following:
customer | order_id | purchase_date | product | sequencen| ... | purchase_latency
customer1 | 1247857 | 2020-01-30 | ProdA, ProdB | 1 | ... |
customer1 | 1774783 | 2020-12-12 | ProdB | 2 | ... | 317
customer2 | 4454874 | 2020-02-07 | ProdA | 1 | ... |
customer2 | 4892978 | 2020-05-10 | ProdA | 2 | ... | 93
customer3 | 3424556 | 2020-02-28 | ProdA | 1 | ... |
customer3 | 5853778 | 2020-03-22 | ProdA, ProdB | 2 | ... | 23
customer4 | 5678889 | 2020-03-14 | ProdB | 1 | ... |
customer4 | 7578345 | 2020-03-30 | ProdA, ProdB | 2 | ... | 16
customer5 | 4834789 | 2020-07-05 | ProdA, ProdB | 1 | ... |
customer5 | 9846726 | 2020-07-27 | ProdB | 2 | ... | 22
I am struggling to add the purchase_latency calculation to my current query, as the calculation would require me to do a calculation over rows. Any ideas how I could add this to my current query?:
SELECT
order_id
max(customer) as customer,
max(purchase_date) as purchase_date,
STRING_AGG(product, ",") as product,
...,
FROM SELECT(
od.order_number as order_id,
od.customer_email as customer,
od.order_date as purchase_date
dd.sku as product,
ROW_NUMBER() OVER (PARTITION BY od.customer_email ORDER BY od.order_date ) as sequencen
FROM orders_data od
JOIN detail_data dd
ON od.order_number = dd.order_number
where od.price> 0 AND
od.sku in ("ProdA","ProdB"))
GROUP BY order_id
Did you try row navigation functions like LAG?
WITH finishers AS
(SELECT 'Sophia Liu' as name,
TIMESTAMP '2016-10-18 2:51:45' as finish_time,
'F30-34' as division
UNION ALL SELECT 'Lisa Stelzner', TIMESTAMP '2016-10-18 2:54:11', 'F35-39'
UNION ALL SELECT 'Nikki Leith', TIMESTAMP '2016-10-18 2:59:01', 'F30-34'
UNION ALL SELECT 'Lauren Matthews', TIMESTAMP '2016-10-18 3:01:17', 'F35-39'
UNION ALL SELECT 'Desiree Berry', TIMESTAMP '2016-10-18 3:05:42', 'F35-39'
UNION ALL SELECT 'Suzy Slane', TIMESTAMP '2016-10-18 3:06:24', 'F35-39'
UNION ALL SELECT 'Jen Edwards', TIMESTAMP '2016-10-18 3:06:36', 'F30-34'
UNION ALL SELECT 'Meghan Lederer', TIMESTAMP '2016-10-18 3:07:41', 'F30-34'
UNION ALL SELECT 'Carly Forte', TIMESTAMP '2016-10-18 3:08:58', 'F25-29'
UNION ALL SELECT 'Lauren Reasoner', TIMESTAMP '2016-10-18 3:10:14', 'F30-34')
SELECT name,
finish_time,
division,
LAG(name)
OVER (PARTITION BY division ORDER BY finish_time ASC) AS preceding_runner
FROM finishers;
+-----------------+-------------+----------+------------------+
| name | finish_time | division | preceding_runner |
+-----------------+-------------+----------+------------------+
| Carly Forte | 03:08:58 | F25-29 | NULL |
| Sophia Liu | 02:51:45 | F30-34 | NULL |
| Nikki Leith | 02:59:01 | F30-34 | Sophia Liu |
| Jen Edwards | 03:06:36 | F30-34 | Nikki Leith |
| Meghan Lederer | 03:07:41 | F30-34 | Jen Edwards |
| Lauren Reasoner | 03:10:14 | F30-34 | Meghan Lederer |
| Lisa Stelzner | 02:54:11 | F35-39 | NULL |
| Lauren Matthews | 03:01:17 | F35-39 | Lisa Stelzner |
| Desiree Berry | 03:05:42 | F35-39 | Lauren Matthews |
| Suzy Slane | 03:06:24 | F35-39 | Desiree Berry |
+-----------------+-------------+----------+------------------+

How to perform aggregation(sum) group by month field using HiveQL?

Below is my data where am looking to generate sum of revenues per month basis using columns event_time and price.
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| oct_data.event_time | oct_data.event_type | oct_data.product_id | oct_data.category_id | oct_data.category_code | oct_data.brand | oct_data.price | oct_data.user_id | oct_data.user_session |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| 2019-10-01 00:00:00 UTC | cart | 5773203 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:03 UTC | cart | 5773353 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:07 UTC | cart | 5881589 | 2151191071051219817 | | lovely | 13.48 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:07 UTC | cart | 5723490 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:15 UTC | cart | 5881449 | 1487580013522845895 | | lovely | 0.56 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:16 UTC | cart | 5857269 | 1487580005134238553 | | runail | 2.62 | 430174032 | 73dea1e7-664e-43f4-8b30-d32b9d5af04f |
| 2019-10-01 00:00:19 UTC | cart | 5739055 | 1487580008246412266 | | kapous | 4.75 | 377667011 | 81326ac6-daa4-4f0a-b488-fd0956a78733 |
| 2019-10-01 00:00:24 UTC | cart | 5825598 | 1487580009445982239 | | | 0.56 | 467916806 | 2f5b5546-b8cb-9ee7-7ecd-84276f8ef486 |
| 2019-10-01 00:00:25 UTC | cart | 5698989 | 1487580006317032337 | | | 1.27 | 385985999 | d30965e8-1101-44ab-b45d-cc1bb9fae694 |
| 2019-10-01 00:00:26 UTC | view | 5875317 | 2029082628195353599 | | | 1.59 | 474232307 | 445f2b74-5e4c-427e-b7fa-6e0a28b156fe |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
I have used the below query but the sum does not seem to occur. Please suggest best approaches to generate the desired output.
select date_format(event_time,'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time,'MM')
order by Month;
Note: event_time field is in TIMESTAMP format.
First convert the timestamp to date and then apply date_format():
select date_format(cast(event_time as date),'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(cast(event_time as date),'MM')
order by Month;
This will work if all the dates are of the same year.
If not then you should also group by year.
Your code should work -- unless you are using an old version of Hive. date_format() has accepted a timestamp argument since 1.1.2 -- released in early 2016. That said, I would strongly suggest that you include the year:
select date_format(event_time, 'yyyy-MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time, 'yyyy-MM')
order by Month;

How to Do Data-Grouping in BigQuery?

I have list of database that needed to be grouped. I've successfully done this by using R, yet now I have to do this by using BigQuery. The data is shown as per following table
| category | sub_category | date | day | timestamp | type | cpc | gmv |
|---------- |-------------- |----------- |----- |------------- |------ |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:36 PM | BI | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:39 PM | RT | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:38:29 PM | RT | 1.58 | 205,041 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:14 AM | BI | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:18 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:52 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:06:33 AM | BI | 1.55 | 201,354 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:55:47 PM | PP | 1 | 129,282 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:56:23 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:19 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:34 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:58:46 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:27 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:51 PM | RT | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:00:57 AM | BI | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:01:11 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:03:01 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:12:42 AM | RT | 1.19 | 154,886 |
I wanted to group the rows. A row that has <= 8 minutes timestamp-difference with the next row will be grouped as one row with below output example:
| category | sub_category | date | day | time | start_timestamp | end_timestamp | type | cpc | gmv |
|---------- |-------------- |----------------------- |--------- |---------- |--------------------- |--------------------- |---------- |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 23:37:36 | (02/17/20 23:37:36) | (02/17/20 23:38:29) | BI|RT | 1.82 | 236,542 |
| ABC | ABC-1 | 2/18/2020 | Tue | 0:05:14 | (02/18/20 00:05:14) | (02/18/20 00:06:33) | BI|RT | 1.59 | 206,636 |
| XYZ | XYZ-1 | 02/17/2020|02/18/2020 | Mon|Tue | 0:06:21 | (02/17/20 23:55:47) | (02/18/20 00:12:42) | PP|RT|BI | 0.95 | 123,815 |
There were some new-generated fields as per below:
| fields | definition |
|----------------- |-------------------------------------------------------- |
| day | Day of the row (combination if there's different days) |
| time | Start of timestamp |
| start_timestamp | Start timestamp of the first row in group |
| end_timestamp | Start timestamp of the last row in group |
| type | Type of Row (combination if there's different types) |
| cpc | Average CPC of the Group |
| gwm | Average GMV of the Group |
Could anyone help me to make the query as per above requirements?
Thank you
This is a gaps and island problem. Here is a solution that uses lag() and a cumulative sum() to define groups of adjacent records with less than 8 minutes gap; the rest is aggregation.
select
category,
sub_category,
string_agg(distinct day, '|' order by dt) day,
min(dt) start_dt,
max(dt) end_dt,
string_agg(distinct type, '|' order by dt) type,
avg(cpc) cpc,
avg(gwm) gwm
from (
select
t.*,
sum(case when dt <= datetime_add(lag_dt, interval 8 minute) then 0 else 1 end)
over(partition by category, sub_category order by dt) grp
from (
select
t.*,
lag(dt) over(partition by category, sub_category order by dt) lag_dt
from (
select t.*, datetime(date, timestamp) dt
from mytable t
) t
) t
) t
) t
group by category, sub_category, grp
Note that you should not be storing the date and time parts of your timestamps in separated columns: this makes the logic more complicated when you need to combine them (I added another level of nesting to avoid repeated conversions, which would have obfuscated the code).

SQL percent of total and weighted average

I have the following postgreSql table stock, there the structure is the following:
| column | pk |
+--------+-----+
| date | yes |
| id | yes |
| type | yes |
| qty | |
| fee | |
The table looks like this:
| date | id | type | qty | cost |
+------------+-----+------+------+------+
| 2015-01-01 | 001 | CB04 | 500 | 2 |
| 2015-01-01 | 002 | CB04 | 1500 | 3 |
| 2015-01-01 | 003 | CB04 | 500 | 1 |
| 2015-01-01 | 004 | CB04 | 100 | 5 |
| 2015-01-01 | 001 | CB02 | 800 | 6 |
| 2015-01-02 | 002 | CB03 | 3100 | 1 |
I want to create a view or query, so that the result looks like this.
The table will show the t_qty, % of total Qty, and weighted fee for each day and each type:
% of total Qty = qty / t_qty
weighted fee = fee * % of total Qty
| date | id | type | qty | cost | t_qty | % of total Qty | weighted fee |
+------------+-----+------+------+------+-------+----------------+--------------+
| 2015-01-01 | 001 | CB04 | 500 | 2 | 2600 | 0.19 | 0.38 |
| 2015-01-01 | 002 | CB04 | 1500 | 3 | 2600 | 0.58 | 1.73 |
| 2015-01-01 | 003 | CB04 | 500 | 1 | 2600 | 0.19 | 0.192 |
| 2015-01-01 | 004 | CB04 | 100 | 5 | 2600 | 0.04 | 0.192 |
| | | | | | | | |
I could do this in Excel, but the dataset is too big to process.
You can use SUM with windows function and some Calculation to make it.
SELECT *,
SUM(qty) OVER (PARTITION BY date ORDER BY date) t_qty,
qty::numeric/SUM(qty) OVER (PARTITION BY date ORDER BY date) ,
fee * (qty::numeric/SUM(qty) OVER (PARTITION BY date ORDER BY date))
FROM T
If you want to Rounding you can use ROUND function.
SELECT *,
SUM(qty) OVER (PARTITION BY date ORDER BY date) t_qty,
ROUND(qty::numeric/SUM(qty) OVER (PARTITION BY date ORDER BY date),3) "% of total Qty",
ROUND(fee * (qty::numeric/SUM(qty) OVER (PARTITION BY date ORDER BY date)),3) "weighted fee"
FROM T
sqlfiddle
[Results]:
| date | id | type | qty | fee | t_qty | % of total Qty | weighted fee |
|------------|-----|------|------|-----|-------|----------------|--------------|
| 2015-01-01 | 001 | CB04 | 500 | 2 | 2600 | 0.192 | 0.385 |
| 2015-01-01 | 002 | CB04 | 1500 | 3 | 2600 | 0.577 | 1.731 |
| 2015-01-01 | 003 | CB04 | 500 | 1 | 2600 | 0.192 | 0.192 |
| 2015-01-01 | 004 | CB04 | 100 | 5 | 2600 | 0.038 | 0.192 |
| 2015-01-02 | 002 | CB03 | 3100 | 1 | 3100 | 1 | 1 |

How do you join records one to one with multiple possible matches ...?

I have a table of transactions like the following
| ID | Trans Type | Date | Qty | Total | Item Number | Work Order |
-------------------------------------------------------------------------
| 1 | Issue | 11/27/2012 | 3 | 3.50 | NULL | 10 |
| 2 | Issue | 11/27/2012 | 3 | 3.50 | NULL | 11 |
| 3 | Issue | 11/25/2012 | 1 | 1.25 | NULL | 12 |
| 4 | ID Issue | 11/27/2012 | -3 | -3.50 | 100 | NULL |
| 5 | ID Issue | 11/27/2012 | -3 | -3.50 | 102 | NULL |
| 6 | ID Issue | 11/25/2012 | -1 | -1.25 | 104 | NULL |
These transactions are duplicates where the 'Issue's have a work order ID while the 'ID Issue' transactions have the item number. I would like to update the [Item Number] field for the 'Issue' transactions to include the Item Number. When I do a join on the Date, Qty, and Total I get something like this
| ID | Trans Type | Date | Qty | Total | Item Number | Work Order |
-------------------------------------------------------------------------
| 1 | Issue | 11/27/2012 | 3 | 3.50 | 100 | 10 |
| 1 | Issue | 11/27/2012 | 3 | 3.50 | 102 | 10 |
| 2 | Issue | 11/27/2012 | 3 | 3.50 | 100 | 11 |
| 2 | Issue | 11/27/2012 | 3 | 3.50 | 102 | 11 |
| 3 | Issue | 11/25/2012 | 1 | 1.25 | 104 | 12 |
The duplicates are multiplied! I would like this
| ID | Trans Type | Date | Qty | Total | Item Number | Work Order |
-------------------------------------------------------------------------
| 1 | Issue | 11/27/2012 | 3 | 3.50 | 100 | 10 |
| 2 | Issue | 11/27/2012 | 3 | 3.50 | 102 | 11 |
| 3 | Issue | 11/25/2012 | 1 | 1.25 | 104 | 12 |
Or this (Item Number is switched for the two matches)
| ID | Trans Type | Date | Qty | Total | Item Number | Work Order |
-------------------------------------------------------------------------
| 1 | Issue | 11/27/2012 | 3 | 3.50 | 102 | 10 |
| 2 | Issue | 11/27/2012 | 3 | 3.50 | 100 | 11 |
| 3 | Issue | 11/25/2012 | 1 | 1.25 | 104 | 12 |
Either would be fine. What would be a simple solution?
Use SELECT DISTINCT to filter same results out or you could partition your results to get the first item in each grouping.
UPDATE
Here's the code to illustrate the partition approach.
SELECT ID, [Trans Type], [Date], [Qty], [Total], [Item Number], [Work Order]
FROM
(
SELECT
ID, [Trans Type], [Date], [Qty], [Total], [Item Number], [Work Order], ROW_NUMBER() OVER
(PARTITION BY ID, [Trans Type], [Date], [Qty], [Total]
ORDER BY [Item Number]) AS ItemRank
FROM YourTable
) AS SubQuery
WHERE ItemRank = 1