How to perform aggregation(sum) group by month field using HiveQL? - sql

Below is my data where am looking to generate sum of revenues per month basis using columns event_time and price.
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| oct_data.event_time | oct_data.event_type | oct_data.product_id | oct_data.category_id | oct_data.category_code | oct_data.brand | oct_data.price | oct_data.user_id | oct_data.user_session |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
| 2019-10-01 00:00:00 UTC | cart | 5773203 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:03 UTC | cart | 5773353 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:07 UTC | cart | 5881589 | 2151191071051219817 | | lovely | 13.48 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:07 UTC | cart | 5723490 | 1487580005134238553 | | runail | 2.62 | 463240011 | 26dd6e6e-4dac-4778-8d2c-92e149dab885 |
| 2019-10-01 00:00:15 UTC | cart | 5881449 | 1487580013522845895 | | lovely | 0.56 | 429681830 | 49e8d843-adf3-428b-a2c3-fe8bc6a307c9 |
| 2019-10-01 00:00:16 UTC | cart | 5857269 | 1487580005134238553 | | runail | 2.62 | 430174032 | 73dea1e7-664e-43f4-8b30-d32b9d5af04f |
| 2019-10-01 00:00:19 UTC | cart | 5739055 | 1487580008246412266 | | kapous | 4.75 | 377667011 | 81326ac6-daa4-4f0a-b488-fd0956a78733 |
| 2019-10-01 00:00:24 UTC | cart | 5825598 | 1487580009445982239 | | | 0.56 | 467916806 | 2f5b5546-b8cb-9ee7-7ecd-84276f8ef486 |
| 2019-10-01 00:00:25 UTC | cart | 5698989 | 1487580006317032337 | | | 1.27 | 385985999 | d30965e8-1101-44ab-b45d-cc1bb9fae694 |
| 2019-10-01 00:00:26 UTC | view | 5875317 | 2029082628195353599 | | | 1.59 | 474232307 | 445f2b74-5e4c-427e-b7fa-6e0a28b156fe |
+--------------------------+----------------------+----------------------+-----------------------+-------------------------+-----------------+-----------------+-------------------+---------------------------------------+
I have used the below query but the sum does not seem to occur. Please suggest best approaches to generate the desired output.
select date_format(event_time,'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time,'MM')
order by Month;
Note: event_time field is in TIMESTAMP format.

First convert the timestamp to date and then apply date_format():
select date_format(cast(event_time as date),'MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(cast(event_time as date),'MM')
order by Month;
This will work if all the dates are of the same year.
If not then you should also group by year.

Your code should work -- unless you are using an old version of Hive. date_format() has accepted a timestamp argument since 1.1.2 -- released in early 2016. That said, I would strongly suggest that you include the year:
select date_format(event_time, 'yyyy-MM') as Month,
sum(price) as Monthly_Revenue
from oct_data_new
group by date_format(event_time, 'yyyy-MM')
order by Month;

Related

Time Series Downsampling/Upsampling

I am trying to downsample and upsample time series data on MonetDB.
Time series database systems (TSDS) usually have an option to make the downsampling and upsampling with an operator like SAMPLE BY (1h).
My time series data looks like the following:
sql>select * from datapoints limit 5;
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
| time | id_station | temperature | discharge | ph | oxygen | oxygen_saturation |
+============================+============+==========================+==========================+==========================+==========================+==========================+
| 2019-03-01 00:00:00.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.14 | 12.14 |
| 2019-03-01 00:00:10.000000 | 0 | 407.052 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:20.000000 | 0 | 407.051 | 0.954 | 7.79 | 12.13 | 12.13 |
| 2019-03-01 00:00:30.000000 | 0 | 407.051 | 0.953 | 7.79 | 12.12 | 12.12 |
| 2019-03-01 00:00:40.000000 | 0 | 407.051 | 0.952 | 7.78 | 12.12 | 12.12 |
+----------------------------+------------+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
I tried the following query but the results are obtained by aggregating all the values from different days, which is not what I am looking for:
sql>SELECT EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "hour";
| hour | avg_ph |
+======+==========================+
| 0 | 8.041121283524923 |
| 1 | 8.041086970785418 |
| 2 | 8.041152801724111 |
| 3 | 8.04107828783526 |
| 4 | 8.041060110153223 |
| 5 | 8.041167286877407 |
| ... | ... |
| 23 | 8.041219444444451 |
I tried then to aggregate the time series data first based on the day then on the hour:
SELECT EXTRACT(DATE FROM time) AS "day", EXTRACT(HOUR FROM time) AS "hour",
AVG(pH) AS avg_ph
FROM datapoints
GROUP BY "day", "hour";
But I am getting the following exception:
syntax error, unexpected sqlDATE in: "select extract(date"
My question: how could I aggregate/downsample the data to a specific period of time (e.g. obtain an aggregated value every 2 days or 12 hours)?

SQL (Redshift) get start and end values for consecutive data in a given column

I have a table that has the subscription state of users on any given day. The data looks like this
+------------+------------+--------------+
| account_id | date | current_plan |
+------------+------------+--------------+
| 1 | 2019-08-01 | free |
| 1 | 2019-08-02 | free |
| 1 | 2019-08-03 | yearly |
| 1 | 2019-08-04 | yearly |
| 1 | 2019-08-05 | yearly |
| ... | | |
| 1 | 2020-08-02 | yearly |
| 1 | 2020-08-03 | free |
| 2 | 2019-08-01 | monthly |
| 2 | 2019-08-02 | monthly |
| ... | | |
| 2 | 2019-08-31 | monthly |
| 2 | 2019-09-01 | free |
| ... | | |
| 2 | 2019-11-26 | free |
| 2 | 2019-11-27 | monthly |
| ... | | |
| 2 | 2019-12-27 | monthly |
| 2 | 2019-12-28 | free |
+------------+------------+--------------+
I would like to have a table that gives the start and end dats of a subscription. It would look something like this:
+------------+------------+------------+-------------------+
| account_id | start_date | end_date | subscription_type |
+------------+------------+------------+-------------------+
| 1 | 2019-08-03 | 2020-08-02 | yearly |
| 2 | 2019-08-01 | 2019-08-31 | monthly |
| 2 | 2019-11-27 | 2019-12-27 | monthly |
+------------+------------+------------+-------------------+
I started by doing a LAG windown function with a bunch of WHERE statements to grab the "state changes", but this makes it difficult to see when customers float in and out of subscriptions and i'm not sure this is the best method.
lag as (
select *, LAG(tier) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan
, LAG(date) OVER (PARTITION BY account_id ORDER BY date ASC) AS previous_plan_date
from data
)
SELECT *
FROM lag
where (current_plan = 'free' and previous_plan in ('monthly', 'yearly'))
This is a gaps-and-islands problem. I think a difference of row numbers works:
select account_id, current_plan, min(date), max(date)
from (select d.*,
row_number() over (partition by account_id order by date) as seqnum,
row_number() over (partition by account_id, current_plan order by date) as seqnum_2
from data
) d
where current_plan <> free
group by account_id, current_plan, (seqnum - seqnum_2);

How to Do Data-Grouping in BigQuery?

I have list of database that needed to be grouped. I've successfully done this by using R, yet now I have to do this by using BigQuery. The data is shown as per following table
| category | sub_category | date | day | timestamp | type | cpc | gmv |
|---------- |-------------- |----------- |----- |------------- |------ |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:36 PM | BI | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:37:39 PM | RT | 1.94 | 252,293 |
| ABC | ABC-1 | 2/17/2020 | Mon | 11:38:29 PM | RT | 1.58 | 205,041 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:14 AM | BI | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:18 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:05:52 AM | RT | 1.6 | 208,397 |
| ABC | ABC-1 | 2/18/2020 | Tue | 12:06:33 AM | BI | 1.55 | 201,354 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:55:47 PM | PP | 1 | 129,282 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:56:23 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:19 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:57:34 PM | PP | 0.98 | 126,928 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:58:46 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:27 PM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 11:59:51 PM | RT | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:00:57 AM | BI | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:01:11 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:03:01 AM | PP | 0.89 | 116,168 |
| XYZ | XYZ-1 | 2/17/2020 | Mon | 12:12:42 AM | RT | 1.19 | 154,886 |
I wanted to group the rows. A row that has <= 8 minutes timestamp-difference with the next row will be grouped as one row with below output example:
| category | sub_category | date | day | time | start_timestamp | end_timestamp | type | cpc | gmv |
|---------- |-------------- |----------------------- |--------- |---------- |--------------------- |--------------------- |---------- |------ |--------- |
| ABC | ABC-1 | 2/17/2020 | Mon | 23:37:36 | (02/17/20 23:37:36) | (02/17/20 23:38:29) | BI|RT | 1.82 | 236,542 |
| ABC | ABC-1 | 2/18/2020 | Tue | 0:05:14 | (02/18/20 00:05:14) | (02/18/20 00:06:33) | BI|RT | 1.59 | 206,636 |
| XYZ | XYZ-1 | 02/17/2020|02/18/2020 | Mon|Tue | 0:06:21 | (02/17/20 23:55:47) | (02/18/20 00:12:42) | PP|RT|BI | 0.95 | 123,815 |
There were some new-generated fields as per below:
| fields | definition |
|----------------- |-------------------------------------------------------- |
| day | Day of the row (combination if there's different days) |
| time | Start of timestamp |
| start_timestamp | Start timestamp of the first row in group |
| end_timestamp | Start timestamp of the last row in group |
| type | Type of Row (combination if there's different types) |
| cpc | Average CPC of the Group |
| gwm | Average GMV of the Group |
Could anyone help me to make the query as per above requirements?
Thank you
This is a gaps and island problem. Here is a solution that uses lag() and a cumulative sum() to define groups of adjacent records with less than 8 minutes gap; the rest is aggregation.
select
category,
sub_category,
string_agg(distinct day, '|' order by dt) day,
min(dt) start_dt,
max(dt) end_dt,
string_agg(distinct type, '|' order by dt) type,
avg(cpc) cpc,
avg(gwm) gwm
from (
select
t.*,
sum(case when dt <= datetime_add(lag_dt, interval 8 minute) then 0 else 1 end)
over(partition by category, sub_category order by dt) grp
from (
select
t.*,
lag(dt) over(partition by category, sub_category order by dt) lag_dt
from (
select t.*, datetime(date, timestamp) dt
from mytable t
) t
) t
) t
) t
group by category, sub_category, grp
Note that you should not be storing the date and time parts of your timestamps in separated columns: this makes the logic more complicated when you need to combine them (I added another level of nesting to avoid repeated conversions, which would have obfuscated the code).

How to show all date from a certain Month?

I have database table in postgreSQL name as "time" like:
| Name | | StartDate | | EndDate |
----------------------------------------
| Oct-18 | | 2018-10-01| | 2018-10-31|
| Nov-18 | | 2018-11-01| | 2018-11-30|
| Dec-18 | | 2018-12-01| | 2018-12-31|
I want the result for each month like
| Date | | Name |
-------------------------
| 2018-10-01| | Oct-18 |
| 2018-10-02| | Oct-18 |
| 2018-10-03| | Oct-18 |
| 2018-10-04| | Oct-18 |
| 2018-10-05| | Oct-18 |
| 2018-10-06| | Oct-18 |
.....
| 2018-10-31| | Oct-18 |
I think generate_series() does what you want:
select generate_series(t.start_date, t.end_date, interval '1 day') as date, name
from t;

Aggregate data from days into a month

I have data that is presented by the day and I want to the data into a monthly report. The data looks like this.
INVOICE_DATE GROSS_REVENUE NET_REVENUE
2018-06-28 ,1623.99 ,659.72
2018-06-27 ,112414.65 ,38108.13
2018-06-26 ,2518.74 ,1047.14
2018-06-25 ,475805.92 ,172193.58
2018-06-22 ,1151.79 ,478.96
How do I go about creating a report where it gives me the total gross revenue and net revenue for the month of June, July, August etc where the data is reported by the day?
So far this is what I have
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY invoice_date
I would simply group by year and month.
SELECT invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY year(invoice_date), month(invoice_date)
Since I don't know if you have access to the year and month functions, another solution would be to cast the date as a varchar and group by the left-most 7 characters (year+month)
SELECT left(cast(invoice_date as varchar(50)),7) AS invoice_date,
SUM(gross_revenue) AS gross_revenue,
SUM(net_revenue) AS net_revenue
FROM wc_revenue GROUP BY left(cast(invoice_date as varchar(50)),7)
You could try a ROLLUP. Sample illustration below:
Table data:
mysql> select * from wc_revenue;
+--------------+---------------+-------------+
| invoice_date | gross_revenue | net_revenue |
+--------------+---------------+-------------+
| 2018-06-28 | 1623.99 | 659.72 |
| 2018-06-27 | 112414.65 | 38108.13 |
| 2018-06-26 | 2518.74 | 1047.14 |
| 2018-06-25 | 475805.92 | 172193.58 |
| 2018-06-22 | 1151.79 | 478.96 |
| 2018-07-02 | 150.00 | 100.00 |
| 2018-07-05 | 350.00 | 250.00 |
| 2018-08-07 | 600.00 | 400.00 |
| 2018-08-09 | 900.00 | 600.00 |
+--------------+---------------+-------------+
mysql> SELECT month(invoice_date) as MTH, invoice_date, SUM(gross_revenue) AS gross_revenue, SUM(net_revenue) AS net_revenue
FROM wc_revenue
GROUP BY MTH, invoice_date WITH ROLLUP;
+------+--------------+---------------+-------------+
| MTH | invoice_date | gross_revenue | net_revenue |
+------+--------------+---------------+-------------+
| 6 | 2018-06-22 | 1151.79 | 478.96 |
| 6 | 2018-06-25 | 475805.92 | 172193.58 |
| 6 | 2018-06-26 | 2518.74 | 1047.14 |
| 6 | 2018-06-27 | 112414.65 | 38108.13 |
| 6 | 2018-06-28 | 1623.99 | 659.72 |
| 6 | NULL | 593515.09 | 212487.53 |
| 7 | 2018-07-02 | 150.00 | 100.00 |
| 7 | 2018-07-05 | 350.00 | 250.00 |
| 7 | NULL | 500.00 | 350.00 |
| 8 | 2018-08-07 | 600.00 | 400.00 |
| 8 | 2018-08-09 | 900.00 | 600.00 |
| 8 | NULL | 1500.00 | 1000.00 |
| NULL | NULL | 595515.09 | 213837.53 |
+------+--------------+---------------+-------------+