Get sum over last entries per day per article - sql

Let's say there is a table structured like this:
ID | article_id | article_count | created_at
---|------------------------------------------
1 | 1 | 10 | 2019-03-20T18:20:03.685059Z
2 | 1 | 22 | 2019-03-20T19:20:03.685059Z
3 | 2 | 32 | 2019-03-20T18:20:03.685059Z
4 | 2 | 20 | 2019-03-20T19:20:03.685059Z
5 | 1 | 3 | 2019-03-21T18:20:03.685059Z
6 | 1 | 15 | 2019-03-21T19:20:03.685059Z
7 | 2 | 3 | 2019-03-21T18:20:03.685059Z
8 | 2 | 30 | 2019-03-21T19:20:03.685059Z
The goal now is to sum over all article_count of all article_ids for the last entries per day and give back this total count per day. So in the case above I'd like to get a result showing:
total | date
--------|------------
42 | 2019-03-20
45 | 2019-03-21
So far, I tried something like:
SELECT SUM(article_count), DATE_TRUNC('day', created_at)
FROM myTable
WHERE created_at IN
(
SELECT DISTINCT ON (a.created_at::date, article_id::int) created_at
FROM myTable a
ORDER BY created_at::date DESC, article_id, created_at DESC
)
GROUP BY DATE_TRUNC('day', created_at)
In the distinct query I tried to pull only the latest entries per day per article_id and then match the created_at to sum up all the article_count values.
This does not work - it still outputs the sum of the whole day instead of sum up over the last entries.
Besides that I am quite sure that there might be a more elegant way than the where condition.
Thanks in advance (as well for any explanation).

I think you just want to filter down to the last entry per day for each article:
SELECT DATE_TRUNC('day', created_at), SUM(article_count)
FROM (SELECT DISTINCT ON (a.created_at::date, article_id::int) a.*
FROM myTable a
ORDER BY article_id, created_at::date DESC, created_at DESC
) a
GROUP BY DATE_TRUNC('day', created_at);

You are looking for rank function:
WITH cte
AS (SELECT article_id,
article_count,
Date_trunc('day', created_at) AS some_date,
Row_number ()
OVER (
partition BY article_id, Date_trunc( 'day', created_at)
ORDER BY created_at DESC ) AS n
FROM mytable)
SELECT Sum(article_count) AS total,
some_date
FROM cte
WHERE n = 1
GROUP BY some_date
Just add the first of each day / article.
Check it at https://rextester.com/INODNS67085

Related

How to aggregate over date including all prior dates

I am working with a table in Databricks Delta lake. It gets new records appended every month. The field insert_dt indicates when the records are inserted.
| ID | Mrc | insert_dt |
|----|-----|------------|
| 1 | 40 | 2022-01-01 |
| 2 | 30 | 2022-01-01 |
| 3 | 50 | 2022-01-01 |
| 4 | 20 | 2022-02-01 |
| 5 | 45 | 2022-02-01 |
| 6 | 55 | 2022-03-01 |
Now I want to aggregate by insert_dt and calculate the average of Mrc. For each date, the average is done not just for the records of that date but all records with date prior to that. In this example, there are 3 rows for 2022-01-01, 5 rows for 2022-02-01 and 6 rows for 2022-03-01. The expected results would look like this:
| Mrc | insert_dt |
|-----|------------|
| 40 | 2022-01-01 |
| 37 | 2022-02-01 |
| 40 | 2022-03-01 |
How do I write a query to do that?
I checked the documentation for Delta-lake databricks (https://docs.databricks.com/sql/language-manual/sql-ref-window-functions.html ) and it looks like TSQL so I think this will work for you, but you may need to tweak slightly.
The approach is to condense each day to a single point and then use window functions to get the running totals. Note that any given day may have a different count, so you can't just average the averages.
--Enter the sample data you gave as a CTE for testing
;with cteSample as (
SELECT * FROM ( VALUES
(1, 40, CONVERT(date,'2022-01-01'))
, ('2', '30', '2022-01-01')
, ('3', '50', '2022-01-01')
, ('4', '20', '2022-02-01')
, ('5', '45', '2022-02-01')
, ('6', '55', '2022-03-01')
) as TabA(ID, Mrc, insert_dt)
)--Solution begins here, find the total and count for each date
--because window can only handle a single "last row"
, cteGrouped as (
SELECT insert_dt, SUM(Mrc) as MRCSum, COUNT(*) as MRCCount
FROM cteSample
GROUP BY insert_dt
)--Now use the window function to get the totals "up to today"
, cteTotals as (
SELECT insert_dt
, SUM(MRCSum) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcSum
, SUM(MRCCount) OVER (ORDER BY insert_dt RANGE
BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS MrcCount
FROM cteGrouped as G
) --Now divide out to get the average to date
SELECT insert_dt, MrcSum/MrcCount as MRCAverage
FROM cteTotals as T
This gives the following output
insert_dt
MRCAverage
2022-01-01
40
2022-02-01
37
2022-03-01
40
Calculate a running average using a window function (the inner subquery) and then pick only one row per insert_dt - the one with the highest id. I only tested this on PostgreSQL 13 so not sure how far does delta-lake support the SQL standard and will it work there or not though.
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from the_table
) t
where rn = 1
order by insert_dt;
DB-fiddle demo
Update If the_table has no id column then use a CTE to add one.
with t_id as (select *, row_number() over (order by insert_dt) id from the_table)
select mrc, insert_dt from
(
select avg(mrc) over (order by insert_dt, id) mrc, insert_dt,
row_number() over (partition by insert_dt order by id desc) rn
from t_id
) t
where rn = 1
order by insert_dt;

Calculate 7 Day Retention with SQL

Given the following tables,
users page_views
+-----------------+-----------+ +----------+-----------+
| user_id |varchar| <----+ | pv_id | varchar |
| reg_ts |timestamp| | pv_ts | timestamp |
| reg_device |varchar| +----> | user_id | varchar |
| mktg_channel |varchar| | url | varchar |
+-----------------+-----------+ | device | varchar |
+----------+-----------+
Table "users" has one row per registered user.
Table "page_views" has one row per page view event.
What % of users who first visit on a given day came back again 1 week later?
I'm currently using SQLlite and created a sample database but my output is off...
Below is what I have so far:
-- day 1 active users
SELECT *
FROM page_views
LEFT JOIN page_views AS future_page_views
ON page_views.user_id = future_page_views.user_id
AND page_views.pv_ts = future_page_views.pv_ts - datetime(future_page_views.pv_ts, '+7 day')
-- day 7 retained users
SELECT
future_page_views.pv_ts,
COUNT(DISTINCT page_views.user_id) as active_users,
COUNT(DISTINCT future_page_views.user_id) as retained_users,
CAST(COUNT(DISTINCT future_page_views.user_id) / COUNT(DISTINCT page_views.user_id) AS float) retention
FROM page_views
LEFT JOIN page_views as future_page_views
ON page_views.user_id = future_page_views.user_id
AND page_views.pv_ts = future_page_views.pv_ts - datetime(page_views.pv_ts, '+7 day')
GROUP BY 1
Not sure if I should use Strftime function (DATEDIFF) in this instance to capture the 7 day. Open to any suggestions and feedback, thanks in advance.
EDIT**
Sample data below, based on the below data set,
I expect only user_id (8) to show up as 7 day retained (first day 2020-01-02) (last day 2020-01-09)
Desired Output:
User_ID
p.pv_ts as First_Day
f.pv_ts as Last_Day
Retention Days (i.e 1,2,3,4,5 days...)
% of users who visited and came back on day 7
You can look at just the first two page visits and then aggregate. This gives
select user_id, min(pv_ts) as first_ts,
nullif(max(pv_ts), min(pv_ts)) as second_ts
from (select pv.*,
row_number() over (partition by user_id order by pv_ts) as seqnum
from page_views pv
) pv
where seqnum <= 2
group by user_id;
Then to get the totals:
select count(*),
sum(case when second_ts < datetime(first_ts, '+7day') then 1 else 0 end)
from (select user_id, min(pv_ts) as first_ts,
nullif(max(pv_ts), min(pv_ts)) as second_ts
from (select pv.*,
row_number() over (partition by user_id order by pv_ts) as seqnum
from page_views pv
) pv
where seqnum <= 2
group by user_id
) u;

Select most popular hour per country based on number of sales

I want to get the most popular hour for each country based on max value of count(id) which tells how many purchases were made.
I've tried getting the max value of purchases and converted the timestamp into hours, but it always returns each hour for each country when I want only a single hour (the one with most purchases) per country.
The table is like:
id | country | time
1 | AE | 19:20:00.00000
1 | AE | 20:13:00.00000
3 | GB | 23:17:00.00000
4 | IN | 10:23:00.00000
6 | IN | 02:01:00.00000
7 | RU | 05:54:00.00000
2 | RU | 16:34:00.00000
SELECT max(purchases), country, tss
FROM (
SELECT time_trunc(time, hour) AS tss,
count(id) as purchases,
country
FROM spending
WHERE dt > date_sub(current_date(), interval 30 DAY)
GROUP BY tss, country
)
GROUP BY tss, country
Expected output:
amount of purchases | Country | Most popular Hour
34 | GB | 16:00
445 | US | 21:00
You can use window functions along with group by. Notice that it uses RANK function so, for example, if one particular country has same amount of sales at 11AM and 2PM it'll return both hours for that country.
WITH cte AS (
SELECT country
, time_trunc(time, hour) AS hourofday
, COUNT(id) AS purchases
, RANK() OVER(PARTITION BY country ORDER BY COUNT(id) DESC) AS rnk
FROM t
GROUP BY country, time_trunc(time, hour)
)
SELECT *
FROM cte
WHERE rnk = 1

Calculate 7, 14 and 30 day moving average in bigquery

I am playing around with bigquery. I have IoT uptime recordings as input:
+---------------+-------------+----------+------------+
| device_id | reference | uptime | timestamp |
+---------------+-------------+----------+------------+
| 1 | 1000-5 | 0.7 | 2019-02-12 |
| 2 | 1000-6 | 0.9 | 2019-02-12 |
| 1 | 1000-5 | 0.8 | 2019-02-11 |
| 2 | 1000-6 | 0.95 | 2019-02-11 |
+---------------+-------------+----------+------------+
I want to calculate the 7, 14 and 30 day moving average of the uptime grouped by device. The output should look as follows:
+---------------+-------------+---------+--------+--------+
| device_id | reference | avg_7 | avg_14 | avg_30 |
+---------------+-------------+---------+--------+--------+
| 1 | 1000-5 | 0.7 | .. | .. |
| 2 | 1000-6 | 0.9 | .. | .. |
+---------------+-------------+---------+--------+--------+
What I have tried:
SELECT
device_id,
AVG(uptime) OVER (ORDER BY day RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, uptime, UNIX_DATE(DATE(timestamp)) as day FROM `uptime_recordings`
)
GROUP BY device_id, uptime, day
I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?
I have recordings for 1000 distinct devices and 200k readings. The grouping does not work and the query returns 200k records instead of 1000. Any ideas whats wrong?
Instead of GROUP BY device_id, uptime, day do GROUP BY device_id, day.
A full working query:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title='Chile'
)
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
Edit: As per requested in the comments, not sure what's the goal of summarizing all of the 7d averages:
WITH data
AS (
SELECT title device_id, views uptime, datehour timestamp
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-09'
AND wiki='br'
AND title IN ('Chile', 'Saozneg')
)
SELECT device_id, AVG(avg_7d) avg_avg_7d
FROM (
SELECT device_id, day
, AVG(uptime) OVER (PARTITION BY device_id ORDER BY UNIX_DATE(day) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d
FROM (
SELECT device_id, AVG(uptime) uptime, (DATE(timestamp)) as day
FROM `data`
GROUP BY device_id, day
)
)
GROUP BY device_id

get the id based on condition in group by

I'm trying to create a sql query to merge rows where there are equal dates. the idea is to do this based on the highest amount of hours, so that i in the end gets the corresponding id for each date with the highest amount of hours. i've been trying to do with a simple group by, but does not seem to work, since i CANT just put a aggregate function on id column, since it should be based the hours condition
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 2 | 2012-01-01 | 10 |
| 3 | 2012-01-01 | 5 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
desired result
+------+-------+--------------------------------------+
| id | date | hours |
+------+-------+--------------------------------------+
| 1 | 2012-01-01 | 37 |
| 4 | 2012-01-02 | 37 |
+------+-------+--------------------------------------+
If you want exactly one row -- even if there are ties -- then use row_number():
select t.*
from (select t.*, row_number() over (partition by date order by hours desc) as seqnum
from t
) t
where seqnum = 1;
Ironically, both Postgres and Oracle (the original tags) have what I would consider to be better ways of doing this, but they are quite different.
Postgres:
select distinct on (date) t.*
from t
order by date, hours desc;
Oracle:
select date, max(hours) as hours,
max(id) keep (dense_rank first over order by hours desc) as id
from t
group by date;
Here's one approach using row_number:
select id, dt, hours
from (
select id, dt, hours, row_number() over (partition by dt order by hours desc) rn
from yourtable
) t
where rn = 1
You can use subquery with correlation approach :
select t.*
from table t
where id = (select t1.id
from table t1
where t1.date = t.date
order by t1.hours desc
limit 1);
In Oracle you can use fetch first 1 row only in subquery instead of LIMIT clause.