Select most popular hour per country based on number of sales - sql

I want to get the most popular hour for each country based on max value of count(id) which tells how many purchases were made.
I've tried getting the max value of purchases and converted the timestamp into hours, but it always returns each hour for each country when I want only a single hour (the one with most purchases) per country.
The table is like:
id | country | time
1 | AE | 19:20:00.00000
1 | AE | 20:13:00.00000
3 | GB | 23:17:00.00000
4 | IN | 10:23:00.00000
6 | IN | 02:01:00.00000
7 | RU | 05:54:00.00000
2 | RU | 16:34:00.00000
SELECT max(purchases), country, tss
FROM (
SELECT time_trunc(time, hour) AS tss,
count(id) as purchases,
country
FROM spending
WHERE dt > date_sub(current_date(), interval 30 DAY)
GROUP BY tss, country
)
GROUP BY tss, country
Expected output:
amount of purchases | Country | Most popular Hour
34 | GB | 16:00
445 | US | 21:00

You can use window functions along with group by. Notice that it uses RANK function so, for example, if one particular country has same amount of sales at 11AM and 2PM it'll return both hours for that country.
WITH cte AS (
SELECT country
, time_trunc(time, hour) AS hourofday
, COUNT(id) AS purchases
, RANK() OVER(PARTITION BY country ORDER BY COUNT(id) DESC) AS rnk
FROM t
GROUP BY country, time_trunc(time, hour)
)
SELECT *
FROM cte
WHERE rnk = 1

Related

How to aggregate over date including all prior dates

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.
| ID | Mrc | insert_dt |
|----|-----|------------|
| 1 | 40 | 2022-01-01 |
| 2 | 30 | 2022-01-01 |
| 3 | 50 | 2022-01-01 |
| 4 | 20 | 2022-02-01 |
| 5 | 45 | 2022-02-01 |
| 6 | 55 | 2022-03-01 |
Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:
| Mrc | insert_dt |
|-----|------------|
| 40 | 2022-01-01 |
| 37 | 2022-02-01 |
| 40 | 2022-03-01 |
How do I write a query to do that?
I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.
The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.
--Enter the sample data you gave as a CTE for testing
;with cteSample as (
SELECT * FROM ( VALUES
(1, 40, CONVERT(date,'2022-01-01'))
, ('2', '30', '2022-01-01')
, ('3', '50', '2022-01-01')
, ('4', '20', '2022-02-01')
, ('5', '45', '2022-02-01')
, ('6', '55', '2022-03-01')
) as TabA(ID, Mrc, insert_dt)
)--Solution begins here, find the total and count for each date
--because window can only handle a single "last row"
, cteGrouped as (
SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
FROM cteSample
GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as (
SELECT insert_dt
, SUM(MRCSum) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
, SUM(MRCCount) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
FROM cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T
This gives the following output
insert_dt
MRCAverage
2022-01-01
40
2022-02-01
37
2022-03-01
40
Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from the_table
) t
where rn = 1
order by insert_dt;
DB-fiddle demo
Update If the_table has no id column then use a CTE to add one.
with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from t_id
) t
where rn = 1
order by insert_dt;

Start SUM aggregation at a certain threshold in bigquery

The energy usage of a device is logged hourly:
+--------------+-----------+-----------------------+
| energy_usage | device_id | timestamp |
+--------------+-----------+-----------------------+
| 10 | 1 | 2019-02-12T01:00:00 |
| 16 | 2 | 2019-02-12T01:00:00 |
| 26 | 1 | 2019-03-12T02:00:00 |
| 24 | 2 | 2019-03-12T02:00:00 |
+--------------+-----------+-----------------------+
My goal is:
Create two columns, one for energy_usage_day (8am-8pm) and another for energy_usage_night (8pm-8am)
Create a monthly aggregate, group by device_id and sum up the energy usage
So the result might look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 80 | 30 | 50 | 1 | 2 | 2019 |
| 130 | 60 | 70 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
Following query produces such results:
SELECT SUM(energy_usage) energy_usage
, SUM(IF(EXTRACT(HOUR FROM timestamp) BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_day
, SUM(IF(EXTRACT(HOUR FROM timestamp) NOT BETWEEN 8 AND 19, energy_usage, 0)) energy_usage_night
, device_id
, EXTRACT(MONTH FROM timestamp) month, EXTRACT(YEAR FROM timestamp) year
FROM `data`
GROUP BY device_id, month, year
Say I am only interested in energy usage aggregates above a certain threshold, e.g. 50. I want to start the SUM at a total energy usage of 50. The result should look like this:
+--------------+------------------+--------------------+-----------+---------+------+
| energy_usage | energy_usage_day | energy_usage_night | device_id | month | year |
+--------------+------------------+--------------------+-----------+---------+------+
| 30 | 10 | 20 | 1 | 2 | 2019 |
| 80 | 50 | 30 | 2 | 3 | 2019 |
+--------------+------------------+--------------------+-----------+---------+------+
In other words: the query should start summing up energy_usage, energy_usage_day and energy_usage_night only when energy_usage reaches the threshold of 50.
Is this possible in bigquery?
Below is for BigQuery Standard SQL and logic is that it starts aggregate usage ONLY after it reaches 50 (per device per month)
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
)
SELECT SUM(energy_usage) energy_usage,
SUM(IF(day_hour, energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp
WHERE qualified
GROUP BY device_id, month, year
Say the current SUM of usage is 49 and the next usage entry has a value of 2. The SUM will be 51. As a result usage of 2 will be added to the SUM. Instead only half of 1 should've been added. Can we solve such problem in BigQuery SQL?
#standardSQL
WITH temp AS (
SELECT *, SUM(energy_usage) OVER(win) > 50 qualified,
SUM(energy_usage) OVER(win) - 50 rolling_sum,
EXTRACT(HOUR FROM `timestamp`) BETWEEN 8 AND 20 day_hour,
EXTRACT(MONTH FROM `timestamp`) month,
EXTRACT(YEAR FROM `timestamp`) year
FROM `project.dataset.table`
WINDOW win AS (PARTITION BY device_id, TIMESTAMP_TRUNC(`timestamp`, MONTH) ORDER BY `timestamp`)
), temp_with_adjustments AS (
SELECT *,
IF(
ROW_NUMBER() OVER(PARTITION BY device_id, month, year ORDER BY `timestamp`) = 1,
rolling_sum,
energy_usage
) AS adjusted_energy_usage
FROM temp
WHERE qualified
)
SELECT SUM(adjusted_energy_usage) energy_usage,
SUM(IF(day_hour, adjusted_energy_usage, 0)) energy_usage_day,
SUM(IF(NOT day_hour, adjusted_energy_usage, 0)) energy_usage_night,
device_id,
month,
year
FROM temp_with_adjustments
GROUP BY device_id, month, year
As you can see, I've just added logic for temp_with_adjustments (and rolling_sum in the temp to support this) - the rest is the same

Get sum over last entries per day per article

Let's say there is a table structured like this:
ID | article_id | article_count | created_at
---|------------------------------------------
1 | 1 | 10 | 2019-03-20T18:20:03.685059Z
2 | 1 | 22 | 2019-03-20T19:20:03.685059Z
3 | 2 | 32 | 2019-03-20T18:20:03.685059Z
4 | 2 | 20 | 2019-03-20T19:20:03.685059Z
5 | 1 | 3 | 2019-03-21T18:20:03.685059Z
6 | 1 | 15 | 2019-03-21T19:20:03.685059Z
7 | 2 | 3 | 2019-03-21T18:20:03.685059Z
8 | 2 | 30 | 2019-03-21T19:20:03.685059Z
The goal now is to sum over all article_count of all article_ids for the last entries per day and give back this total count per day. So in the case above I'd like to get a result showing:
total | date
--------|------------
42 | 2019-03-20
45 | 2019-03-21
So far, I tried something like:
SELECT SUM(article_count), DATE_TRUNC('day', created_at)
FROM myTable
WHERE created_at IN
(
SELECT DISTINCT ON (a.created_at::date, article_id::int) created_at
FROM myTable a
ORDER BY created_at::date DESC, article_id, created_at DESC
)
GROUP BY DATE_TRUNC('day', created_at)
In the distinct query I tried to pull only the latest entries per day per article_id and then match the created_at to sum up all the article_count values.
This does not work - it still outputs the sum of the whole day instead of sum up over the last entries.
Besides that I am quite sure that there might be a more elegant way than the where condition.
Thanks in advance (as well for any explanation).
I think you just want to filter down to the last entry per day for each article:
SELECT DATE_TRUNC('day', created_at), SUM(article_count)
FROM (SELECT DISTINCT ON (a.created_at::date, article_id::int) a.*
FROM myTable a
ORDER BY article_id, created_at::date DESC, created_at DESC
) a
GROUP BY DATE_TRUNC('day', created_at);
You are looking for rank function:
WITH cte
AS (SELECT article_id,
article_count,
Date_trunc('day', created_at) AS some_date,
Row_number ()
OVER (
partition BY article_id, Date_trunc( 'day', created_at)
ORDER BY created_at DESC ) AS n
FROM mytable)
SELECT Sum(article_count) AS total,
some_date
FROM cte
WHERE n = 1
GROUP BY some_date
Just add the first of each day / article.
Check it at https://rextester.com/INODNS67085

Calculate 7, 14 and 30 day moving average in bigquery

I am playing around with bigquery. I have IoT uptime recordings as input:
+---------------+-------------+----------+------------+
| device_id | reference | uptime | timestamp |
+---------------+-------------+----------+------------+
| 1 | 1000-5 | 0.7 | 2019-02-12 |
| 2 | 1000-6 | 0.9 | 2019-02-12 |
| 1 | 1000-5 | 0.8 | 2019-02-11 |
| 2 | 1000-6 | 0.95 | 2019-02-11 |
+---------------+-------------+----------+------------+
I want to calculate the 7, 14 and 30 day moving average of the uptime grouped by device. The output should look as follows:
+---------------+-------------+---------+--------+--------+
| device_id | reference | avg_7 | avg_14 | avg_30 |
+---------------+-------------+---------+--------+--------+
| 1 | 1000-5 | 0.7 | .. | .. |
| 2 | 1000-6 | 0.9 | .. | .. |
+---------------+-------------+---------+--------+--------+
What I have tried:
SELECT
device_id,
AVG(uptime) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, uptime, UNIX_DATE(DATE(timestamp)) as day FROM `uptime_recordings`
)
GROUP BY device_id, uptime, day
I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?
I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?
Instead of GROUP BY device_id, uptime, day do GROUP BY device_id, day.
A full working query:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title='Chile'
)
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
Edit: As per requested in the comments, not sure what's the goal of summarizing all of the 7d averages:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title IN ('Chile', 'Saozneg')
)
SELECT device_id, AVG(avg_7d) avg_avg_7d
FROM (
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
)
GROUP BY device_id

Find max, min, avg, percentile of count(*) per mmdd PostgreSQL

Postgres version 9.4.18, PostGIS Version 2.2.
Here are the tables I'm working with (and can unlikely make significant changes to the table structure):
Table ltg_data (spans 1988 to 2018):
Column | Type | Modifiers
----------+--------------------------+-----------
intensity | integer | not null
time | timestamp with time zone | not null
lon | numeric(9,6) | not null
lat | numeric(8,6) | not null
ltg_geom | geometry(Point,4269) |
Indexes:
"ltg_data2_ltg_geom_idx" gist (ltg_geom)
"ltg_data2_time_idx" btree ("time")
Size of ltg_data (~800M rows):
ltg=# select pg_relation_size('ltg_data');
pg_relation_size
------------------
149729288192
Table counties:
Column | Type | Modifiers
-----------+-----------------------------+--------------------------------- -----------------------
gid | integer | not null default
nextval('counties_gid_seq'::regclass)
objectid_1 | integer |
objectid | integer |
state | character varying(2) |
cwa | character varying(9) |
countyname | character varying(24) |
fips | character varying(5) |
time_zone | character varying(2) |
fe_area | character varying(2) |
lon | double precision |
lat | double precision |
the_geom | geometry(MultiPolygon,4269) |
Indexes:
"counties_pkey" PRIMARY KEY, btree (gid)
"counties_gix" gist (the_geom)
"county_cwa_idx" btree (cwa)
"countyname_cwa_idx" btree (countyname)
I have a query that calculates the total number of rows per day of the year (month-day) spanning the 30 years. With the help of Stackoverflow, the query to get these counts is working fine. Here's the query and results, using the following function.
Function:
CREATE FUNCTION f_mmdd(date) RETURNS int LANGUAGE sql IMMUTABLE AS
$$SELECT to_char($1, 'MMDD')::int$$;
Query:
SELECT d.mmdd, COALESCE(ct.ct, 0) AS total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd -- ignoring the year
FROM generate_series(timestamp '2018-01-01' -- any dummy year
, timestamp '2018-12-31'
, interval '1 day') d
) d
LEFT JOIN (
SELECT f_mmdd(time::date) AS mmdd, count(*) AS ct
FROM counties c
JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
) ct USING (mmdd)
ORDER BY 1;
Results:
mmdd total_count
725 | 2126
726 | 558
727 | 2
728 | 2
729 | 2
730 | 0
731 | 0
801 | 0
802 | 10
Desired Results: I'm trying to find other statistical information about the counts for the days of the year. For instance, I know on July 25 (725 in the table below) that the total count over the many years that are in the table is 2126. What I'm looking for is the max daily count for July 25 (725), percent of years that that day is not zero, the min, percent years where count(*) is not zero, percentiles (10th percentile, 25th percentile, 50th percentile, 75th percentile, 90th percentile, and stdev would be useful too). It would be good to see what year the max_daily occurred. I guess if there haven't been any counts for that day in all the years, the year_max_daily would be blank or zero.
mmdd total_count max daily year_max_daily percent_years_count_not_zero 10th percentile_daily 90th percentile_daily
725 | 2126 1000 1990 30 15 900
726 | 558 120 1992 20 10 80
727 | 2 1 1991 2 0 1
728 | 2 1 1990 2 0 1
729 | 2 1 1989 2 0 1
730 | 0 0 0 0 0
731 | 0 0 0 0 0
801 | 0 0 0 0 0
802 | 10 10 1990 0 1 8
What I've tried thus far just isn't working. It returns the same results as total. I think it's because I'm just trying to get an avg after the totals have already been calculated, so I'm not really looking at the counts for each day of each year and finding the average.
Attempt:
SELECT AVG(CAST(total_count as FLOAT)), day
FROM
(
SELECT d.mmdd as day, COALESCE(ct.ct, 0) as total_count
FROM (
SELECT f_mmdd(d::date) AS mmdd
FROM generate_series(timestamp '2018-01-01', timestamp '2018-12-31', interval '1 day') d
) d
LEFT JOIN (
SELECT mmdd, avg(q.ct) FROM (
SELECT f_mmdd((time at time zone 'utc+12')::date) as mmdd, count(*) as ct
FROM counties c
JOIN ltg_data d on ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY 1
)
) as q
ct USING (mmdd)
ORDER BY 1
Thanks for any help!
I haven't included calculations for all requested stats - there is too much in one question, but I hope that you'd be able to extend the query below and add extra stats that you need.
I'm using CTE below to make to query readable. If you want, you can put it all in one huge query. I'd recommend to run the query step-by-step, CTE-by-CTE and examine intermediate results to understand how it works.
CTE_Dates is a simple list of all possible dates for 30 years.
CTE_DailyCounts is a list of basic counts for each day for 30 years (I took your existing query for that).
CTE_FullStats is again a list of all dates together with some stats calculated for each (month,day) using window functions with partitioning by month,day. ROW_NUMBER there is used to get a date where the count was the largest for each year.
Final query selects only one row with the largest count for the year along with the rest of the information.
I didn't try to run the query, because the question doesn't have sample data, so there may be some typos.
WITH
CTE_Dates
AS
(
SELECT
d::date AS dt
,EXTRACT(MONTH FROM d::date) AS dtMonth
,EXTRACT(DAY FROM d::date) AS dtDay
,EXTRACT(YEAR FROM d::date) AS dtYear
FROM
generate_series(timestamp '1988-01-01', timestamp '2018-12-31', interval '1 day') AS d
-- full range of possible dates
)
,CTE_DailyCounts
AS
(
SELECT
time::date AS dt
,count(*) AS ct
FROM
counties c
INNER JOIN ltg_data d ON ST_contains(c.the_geom, d.ltg_geom)
WHERE cwa = 'MFR'
GROUP BY time::date
)
,CTE_FullStats
AS
(
SELECT
CTE_Dates.dt
,CTE_Dates.dtMonth
,CTE_Dates.dtDay
,CTE_Dates.dtYear
,CTE_DailyCounts.ct
,SUM(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS total_count
,MAX(CTE_DailyCounts.ct) OVER (PARTITION BY dtMonth, dtDay) AS max_daily
,SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay) AS nonzero_day_count
,COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS years_count
,100.0 * SUM(CASE WHEN CTE_DailyCounts.ct > 0 THEN 1 ELSE 0 END) OVER (PARTITION BY dtMonth, dtDay)
/ COUNT(*) OVER (PARTITION BY dtMonth, dtDay) AS percent_years_count_not_zero
,ROW_NUMBER() OVER (PARTITION BY dtMonth, dtDay ORDER BY CTE_DailyCounts.ct DESC) AS rn
FROM
CTE_Dates
LEFT JOIN CTE_DailyCounts ON CTE_DailyCounts.dt = CTE_Dates.dt
)
SELECT
dtMonth
,dtDay
,total_count
,max_daily
,dtYear AS year_max_daily
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE
rn = 1
ORDER BY
dtMonth
,dtDay
;