Postgres insert function by aggregating data from multiple tables - sql

I have got two tables, 'page_visits' and 'comments', which store new webpage visits and new comments, respectively.
PAGE_VISITS
id
page_id
created_at
1
1111
2021-12-02T04:55:26.779Z
2
1442
2021-12-02T02:25:32.219Z
3
1111
2021-12-02T04:55:26.214Z
COMMENTS
id
page_id
...
created_at
1
1024
...
2021-12-02T04:55:26.779Z
2
1111
...
2021-12-02T02:25:32.219Z
3
3849
...
2021-12-02T04:55:26.214Z
I want to aggregate the data from both the tables in the past 1 hour to use for analytics, such that it looks like the table below.
PAGE_DATA
page_id
visit_count
comment_count
created_at
1024
14
3
2021-12-02T04:55:26.779Z
1111
11
8
2021-12-02T02:25:32.219Z
3849
1
0
2021-12-02T04:55:26.214Z
2412
0
1
2021-12-02T04:55:26.779Z
SELECT page_visits.page_id
, COUNT(page_visits.id) AS visitCount
, COALESCE(cmts.cmt_cnt,0) AS commentCount
FROM page_visits
LEFT OUTER
JOIN ( SELECT page_id
, COUNT(*) AS cmt_cnt
FROM comments
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
GROUP
BY page_id
) AS cmts
ON cmts.page_id = page_visits.page_id
WHERE page_visits.created_at >= NOW() - INTERVAL '1 HOUR'
GROUP
BY page_visits.page_id, cmts.cmt_cnt;
I have the above code as of now, however, it only prints the row when comment_count is null, but it does not do the same when visit_count is 0 and comment_count is > 0.
My first question is, how do I get it to print even when visit_count results as 0.
Because someone could have gone on to the page the hour before but only made a comment later on.
Secondly, I am trying to run this code every hour with the use of pg_cron and I know that I can run a function directly in a cron scheduler, however, I am unable to turn the above code into a working postgres function that inserts a new row into the 'page_data' table each time its called.
Could someone help me out with these 2 issues? Thank you.

Consider full join on two aggregates
SELECT page_visits.page_id
, COALESCE(vsts.vst_cnt, 0) AS visitCount
, COALESCE(cmts.cmt_cnt, 0) AS commentCount
FROM (
SELECT page_id
, COUNT(*) AS vst_cnt
FROM page_visits
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
GROUP BY page_id
) AS vsts
FULL OUTER JOIN (
SELECT page_id
, COUNT(*) AS cmt_cnt
FROM comments
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
GROUP BY page_id
) AS cmts
ON cmts.page_id = vsts.page_id
Alternatively, aggregate a UNION query of both tables:
SELECT page_id
, SUM(vst_n) AS vst_cnt
, SUM(cmt_n) AS cmt_cnt
FROM (
SELECT page_id, 1 AS vst_n, 0 AS cmt_n
FROM page_visits
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
UNION ALL
SELECT page_id, 0 AS vst_n, 1 AS cmt_n
FROM comments
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
) AS sub
GROUP BY page_id
Regarding last question, if I understand you, simply run an insert-select query from above query. Not quite sure how you aggregated created_at, but add a MIN or MAX to above aggregations and include an additional column below:
INSERT INTO page_data (page_id, visit_count, comment_count)
SELECT ...above query...

Related

PostgreSQL - Get count of items in a table grouped by a datetime column for N intervals

I have a User table, where there are the following fields.
| id | created_at | username |
I want to filter this table so that I can get the number of users who have been created in a datetime range, separated into N intervals. e.g. for users having created_at in between 2019-01-01T00:00:00 and 2019-01-02T00:00:00 separated into 2 intervals, I will get something like this.
_______________________________
| dt | count |
-------------------------------
| 2019-01-01T00:00:00 | 6 |
| 2019-01-01T12:00:00 | 7 |
-------------------------------
Is it possible to do so in one hit? I am currently using my Django ORM to create N date ranges and then making N queries, which isn't very efficient.
Generate the times you want and then use left join and aggregation:
select gs.ts, count(u.id)
from generate_series('2019-01-01T00:00:00'::timestamp,
'2019-01-01T12:00:00'::timestamp,
interval '12 hour'
) gs(ts) left join
users u
on u.created_at >= gs.ts and
u.created_at < gs.ts + interval '12 hour'
group by 1
order by 1;
EDIT:
If you want to specify the number of rows, you can use something similar:
from generate_series(1, 10, 1) as gs(n) cross join lateral
(values ('2019-01-01T00:00:00'::timestamp + (gs.n - 1) * interval '12 hour')
) v(ts) left join
users u
on u.created_at >= v.ts and
u.created_at < v.ts + interval '12 hour'
In Postgres, there is a dedicated function for this (several overloaded variants, really): width_bucket().
One additional difficulty: it does not work on type timestamp directly. But you can work with extracted epoch values like this:
WITH cte(min_ts, max_ts, buckets) AS ( -- interval and nr of buckets here
SELECT timestamp '2019-01-01T00:00:00'
, timestamp '2019-01-02T00:00:00'
, 2
)
SELECT width_bucket(extract(epoch FROM t.created_at)
, extract(epoch FROM c.min_ts)
, extract(epoch FROM c.max_ts)
, c.buckets) AS bucket
, count(*) AS ct
FROM tbl t
JOIN cte c ON t.created_at >= min_ts -- incl. lower
AND t.created_at < max_ts -- excl. upper
GROUP BY 1
ORDER BY 1;
Empty buckets (intervals with no rows in it) are not returned at all. Your
comment seems to suggest you want that.
Notably, this accesses the table once - as requested and as opposed to generating intervals first and then joining to the table (repeatedly).
See:
How to reduce result rows of SQL query equally in full range?
Aggregating (x,y) coordinate point clouds in PostgreSQL
That does not yet include effective bounds, just bucket numbers. Actual bounds can be added cheaply:
WITH cte(min_ts, max_ts, buckets) AS ( -- interval and nr of buckets here
SELECT timestamp '2019-01-01T00:00:00'
, timestamp '2019-01-02T00:00:00'
, 2
)
SELECT b.*
, min_ts + ((c.max_ts - c.min_ts) / c.buckets) * (bucket-1) AS lower_bound
FROM (
SELECT width_bucket(extract(epoch FROM t.created_at)
, extract(epoch FROM c.min_ts)
, extract(epoch FROM c.max_ts)
, c.buckets) AS bucket
, count(*) AS ct
FROM tbl t
JOIN cte c ON t.created_at >= min_ts -- incl. lower
AND t.created_at < max_ts -- excl. upper
GROUP BY 1
ORDER BY 1
) b, cte c;
Now you only change input values in the CTE to adjust results.
db<>fiddle here

Subtraction of counts of 2 tables

I have 2 different tables, A and B. A is something like created and b is removed
I want to obtain the nett difference of the counts per week in an SQL query.
Currently I have
SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS Week,
Count(id) AS "A - New"
FROM table_name.A
GROUP BY 1
ORDER BY 1
This gets me the count per week for table A only. How could I incorporate the logic of subtracting the same Count(id) from B, for the same timeframe?
Thanks! :)
The potential issue here is that for any week you might only have additions or removals, so to align a count from the 2 tables - by week - an approach would be to use a full outer join, like this:
SELECT COALESECE(A.week, b.week) as week
, count_a
, count_b
, COALESECE(count_a,0) - COALESECE(count_b,0) net
FROM (
SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS week
, Count(*) AS count_A
FROM table_a
GROUP BY DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08')
) a
FUUL OUTER JOIN (
SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS week
, Count(*) AS count_b
FROM table_b
GROUP BY DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08')
) b on a.week = b.week
The usual syntex for substracting values from 2 queries is as follows
Select (Query1) - (Query2) from dual;
Assuming both the tables have same number of id in 'id' column and your given query works for tableA, following query will subtract the count(id) from both tables.
select(SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS Week,
Count(id) AS "A - New" FROM table_name.A GROUP BY 1 ORDER BY 1) - (SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS Week,
Count(id) AS "B - New" FROM table_name.B GROUP BY 1 ORDER BY 1) from dual
Or you can also try the following approach
Select c1-c2 from(Query1 count()as c1),(Query2 count() as c2);
So your query will be like
Select c1-c2 from (SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS Week, Count(id) AS c1 FROM table_name.A GROUP BY 1 ORDER BY 1),(SELECT DATE_TRUNC('week', TIMESTAMP AT time ZONE '+08') AS Week, Count(id) AS c2 FROM table_name.B GROUP BY 1 ORDER BY 1);

Optimizing Max Value query

I wanted to ask for advice on how I could optimize my query? I hope to make it run faster as I feel it takes away from the UX with the speed.
My program collects data every hour and I want to optimize my query which takes the latest data and creates the top 100 people for a specific event,
SELECT a.user_id as user, nickname, value, s.created_on
FROM stats s,accounts a
WHERE a.user_id = s.user_id AND event_id = 1 AND s.created_on in
(SELECT created_on FROM stats WHERE created_on >= NOW() - '1 hour'::INTERVAL)
ORDER BY value desc
LIMIT 100
The query I have returns the top 100 from the last hour for event_id = 1 but I wish to optimize it and I believe the subquery is the root cause of the problem. I've tried other queries but they end up with either duplicates or the result is not from the latest dataset
Thank you
EDIT::
The table account contains [user_id, nickname]
the stats table contains [user_id, event_id, value, created_on]
NOW() - '1 hour'::INTERVAL in not MySQL syntax; perhaps you meant NOW() - INTERVAL 1 HOUR?
IN ( SELECT ... ) optimizes very poorly.
Not knowing the relationship between accounts and stats (1:1, 1:many, etc), I can only guess at what might work:
SELECT a.user_id as user, nickname, value, s.created_on
FROM stats s,accounts a
WHERE a.user_id = s.user_id
AND event_id = 1
AND s.created_on >= NOW() - INTERVAL 1 HOUR)
ORDER BY value desc
LIMIT 100
INDEX(event_id, value) -- if they are both in `a`
INDEX(user_id, created_on)
or...
SELECT user_id as user, nickname, value,
( SELECT MAX(created_on) FROM stats
WHERE user_id = a.user_id ) AS created_on
FROM accounts
AND event_id = 1
AND EXISTS
( SELECT *
FROM stats
WHERE created_on >= NOW() - INTERVAL 1 HOUR
AND user_id = a.user_id
)
ORDER BY value desc
LIMIT 100
INDEX(user_id, created_on)
INDEX(event_id, value)
Please provide
SHOW CREATE TABLE for each table
EXPLAIN SELECT ...; for any reasonable candidates

How to find the number of purchases over time intervals SQL

I'm using Redshift (Postgres), and Pandas to do my work. I'm trying to get the number of user actions, lets say purchases to make it easier to understand. I have a table, purchases that holds the following data:
user_id, timestamp , price
1, , 2015-02-01, 200
1, , 2015-02-02, 50
1, , 2015-02-10, 75
ultimately I would like the number of purchases over a certain timestamp. Such as
userid, 28-14_days, 14-7_days, 7
Here is what I have so far, I'm aware I don't have an upper limit on the dates:
SELECT DISTINCT x_days.user_id, SUM(x_days.purchases) AS x_num, SUM(y_days.purchases) AS y_num,
x_days.x_date, y_days.y_date
FROM
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as x_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(x_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id
) AS x_days
JOIN
(
SELECT purchases.user_id, COUNT(purchases.user_id) as purchases,
DATE(purchases.timestamp) as y_date
FROM purchases
WHERE purchases.timestamp > (current_date - INTERVAL '%(y_days_ago)s day') AND
purchases.max_value > 200
GROUP BY DATE(purchases.timestamp), purchases.user_id) AS y_days
ON
x_days.user_id = y_days.user_id
GROUP BY
x_days.user_id, x_days.x_date, y_days.y_date
params={'x_days_ago':x_days_ago, 'y_days_ago':y_days_ago}
where these are set in python/pandas
x_days_ago = 14
y_days_ago = 7
But this didn't work out exactly as planned:
user_id x_num y_num x_date y_date
0 5451772 1 1 2015-02-10 2015-02-10
1 5026678 1 1 2015-02-09 2015-02-09
2 6337993 2 1 2015-02-14 2015-02-13
3 6204432 1 3 2015-02-10 2015-02-11
4 3417539 1 1 2015-02-11 2015-02-11
Even though I don't have an upper date to look between (so x is effectively searching from 14 days to now and y is 7 days to now, meaning overlap), in some cases y is higher.
Can anyone help me either fix this or give me a better way?
Thanks!
It might not be the most efficient answer, but you can generate each sum with a sub-select:
WITH
summed AS (
SELECT user_id, day, COUNT(1) AS purchases
FROM (SELECT user_id, DATE(timestamp) AS day FROM purchases) AS _
GROUP BY user_id, day
),
users AS (SELECT DISTINCT user_id FROM purchases)
SELECT user_id,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval ' 7 days')) AS days_7,
(SELECT SUM(purchases) FROM summed
WHERE summed.user_id = users.user_id
AND day >= DATE(NOW() - interval '14 days')) AS days_14
FROM users;
(This was tested in Postgres, not in Redshift; but the Redshift documentation suggests that both WITH and DISTINCT are supported.) I would have liked to do this with a window, to obtain rolling sums; but it's a little onerous without generate_series().

How do I combine 3 SQL queries into 1?

This is written to count how many people have visited within the last day. I want to also include how many have visited in the last week and year and have it output altogether without doing 3 separate queries.
SELECT COUNT(updated_at) AS 'TODAY'
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 1 day)
GROUP BY parts_user_id;
SELECT DAY(updated_at), WEEK(updated_at), COUNT(*) AS visits
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 1 year)
GROUP BY
DAY(updated_at), WEEK(updated_at) WITH ROLLUP
This will count visits within a year, grouping them by day, week, and total.
If you just want to select visits for a day, week and a year in three columns, use this:
SELECT (
SELECT COUNT(*)
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 1 DAY)
) AS last_day,
(
SELECT COUNT(*)
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 7 DAY)
) AS last_week,
(
SELECT COUNT(*)
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 1 YEAR)
) AS last_year
The SQL UNION Operator
http://www.w3schools.com/sql/sql_union.asp
If you want two more rows, then use UNION ALL. You still kind of have 3 queries but executed as one.
If you want two more columns, then use SUM(CASE(...)). Basically you more your WHERE clause to the CASE clause 3 times each with own condition.
No need to join or subselect from the table more than once.
SELECT parts_user_id,
SUM( IF( updated_at >= DATE_SUB( NOW(), INTERVAL 1 DAY ), 1, 0 ) )
as day_visits,
SUM( IF( updated_at >= DATE_SUB( NOW(), INTERVAL 7 DAY ), 1, 0 ) )
as week_visits,
count(*) as year_visits
FROM parts_development.page_views
WHERE updated_at >= DATE_SUB( NOW(),INTERVAL 1 year )
GROUP BY parts_user_id
SELECT COUNT(updated_at) AS 'TODAY'
FROM parts_development.page_views day
INNER JOIN (SELECT COUNT(updated_at) AS 'WEEK', parts_user_id as userid
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 1 week)
GROUP BY parts_user_id) week
ON day.parts_user_id = week.userid
INNER JOIN (SELECT COUNT(updated_at) AS 'YEAR', parts_user_id as userweek
FROM parts_development.page_views p
WHERE updated_at >= DATE_SUB(NOW(),INTERVAL 1 year)
GROUP BY parts_user_id) year
ON day.parts_user_id = year.userid
WHERE day.updated_at >= DATE_SUB(NOW(),INTERVAL 1 day)
GROUP BY day.parts_user_id
Don't quote my on the "INTERVAL" syntax, I didn't look it up, I'm a TSQL guy myself. This could also be accomplished with unions. You could also replace the where clauses with predicates in the joins.
how about
SELECT count(*), IsToday(), IsThisWeek()
FROM whatever
WHERE IsThisYear()
GROUP BY IsToday(), IsThisWeek()
where the Is*() functions are boolean functions (or expressions)