Optimizing Max Value query - sql

I wanted to ask for advice on how I could optimize my query? I hope to make it run faster as I feel it takes away from the UX with the speed.
My program collects data every hour and I want to optimize my query which takes the latest data and creates the top 100 people for a specific event,
SELECT a.user_id as user, nickname, value, s.created_on
FROM stats s,accounts a
WHERE a.user_id = s.user_id AND event_id = 1 AND s.created_on in
(SELECT created_on FROM stats WHERE created_on >= NOW() - '1 hour'::INTERVAL)
ORDER BY value desc
LIMIT 100
The query I have returns the top 100 from the last hour for event_id = 1 but I wish to optimize it and I believe the subquery is the root cause of the problem. I've tried other queries but they end up with either duplicates or the result is not from the latest dataset
Thank you
EDIT::
The table account contains [user_id, nickname]
the stats table contains [user_id, event_id, value, created_on]

NOW() - '1 hour'::INTERVAL in not MySQL syntax; perhaps you meant NOW() - INTERVAL 1 HOUR?
IN ( SELECT ... ) optimizes very poorly.
Not knowing the relationship between accounts and stats (1:1, 1:many, etc), I can only guess at what might work:
SELECT a.user_id as user, nickname, value, s.created_on
FROM stats s,accounts a
WHERE a.user_id = s.user_id
AND event_id = 1
AND s.created_on >= NOW() - INTERVAL 1 HOUR)
ORDER BY value desc
LIMIT 100
INDEX(event_id, value) -- if they are both in `a`
INDEX(user_id, created_on)
or...
SELECT user_id as user, nickname, value,
( SELECT MAX(created_on) FROM stats
WHERE user_id = a.user_id ) AS created_on
FROM accounts
AND event_id = 1
AND EXISTS
( SELECT *
FROM stats
WHERE created_on >= NOW() - INTERVAL 1 HOUR
AND user_id = a.user_id
)
ORDER BY value desc
LIMIT 100
INDEX(user_id, created_on)
INDEX(event_id, value)
Please provide
SHOW CREATE TABLE for each table
EXPLAIN SELECT ...; for any reasonable candidates

Related

BigQuery: iterating groups within a window of 28days before a start_date column using _TABLE_SUFFIX

I got a table like this:
group_id
start_date
end_date
19335
20220613
20220714
19527
20220620
20220719
19339
20220614
20220720
19436
20220616
20220715
20095
20220711
20220809
I am trying to retrieve data from another table that is partitioned, and data should be access with _TABLE_SUFFIX BETWEEN start_date AND end_date.
Each group_id contains different user_id within the period [start_date, end_date]. What I need is to retrieve data of users of a column/metric of the last 28D prior to the start_date of each group_id.
My idea is to:
Retrieve distinct user_id per group_id within the period [start_date, end_date]
Retrieve previous 28d metric data prior to the start date of each group_id
A snippet code on how to retrieve data from a single group_id is the following:
WITH users_per_group AS (
SELECT
users_metadata.user_id,
users_metadata.group_id,
FROM
`my_table_users_*` users_metadata
WHERE
_TABLE_SUFFIX BETWEEN '20220314' --start_date
AND '20220413' --end_date
AND experiment_id = 16709
GROUP BY
1,
2
)
SELECT
_TABLE_SUFFIX AS date,
user_id,
SUM(
COALESCE(metric, 0)
) AS metric,
FROM
users_per_group
JOIN `my_metric_table*` metric USING (user_id)
WHERE
_TABLE_SUFFIX BETWEEN FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 28 DAY
)
) -- 28 days before it starts
AND FORMAT_TIMESTAMP(
'%Y%m%d',
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y%m%d', '20220314'), --start_date
INTERVAL 1 DAY
)
) -- 1 day before it starts
GROUP BY
1,
2
ORDER BY
date ASC
Also, I want to avoid retrieving all data (considering all dates) from that metric, as the table is huge and it will take very long time to retrieve it.
Is there an easy way to retrieve the metric data of each user across groups and considering the previous 28 days to the start data of each group_id?
I can think of 2 approaches.
Join all the tables and then perform your query.
Create dynamic queries for each of your users.
Both approaches will require search_from and search_to to be available beforehand i.e you need to calculate each user's search range before you do anything.
EG:
WITH users_per_group AS (
SELECT
user_id, group_id
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
FROM TableName
)
Once you have this kind of table then you can use any of the mentioned approaches.
Since I don't have your data and don't know about your table names I am giving an example using a public dataset.
Approach 1
-- consider this your main table which contains user,grp,start_date,end_date
with maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
),
--then calculate search from-to date for every user and group
user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from --change interval as per your need
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
)
select visit_from,_TABLE_SUFFIX date,count(visitId) total_visits from
user_per_grp ug
left join `bigquery-public-data.google_analytics_sample.ga_sessions_*` as pub on pub.geoNetwork.country = ug.visit_from
where _TABLE_SUFFIX between format_date("%Y%m%d",ug.search_from) and format_date("%Y%m%d",ug.search_to)
group by 1,2
Approach 2
declare queries array<string> default [];
create temp table maintable as (
select 'India' visit_from, '20161115' as start_date, '20161202' end_date
union all select 'Sweden' , '20161201', '20161202'
);
create temp table user_per_grp as(
select *, DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 4 DAY)search_from
,DATE_SUB(parse_date("%Y%m%d", start_date), INTERVAL 1 DAY)search_to
from maintable
);
-- for each user create a seperate query here
FOR record IN (SELECT * from user_per_grp)
DO
set queries = queries || [format('select "%s" Visit_From,_TABLE_SUFFIX Date,count(visitId) total_visits from `bigquery-public-data.google_analytics_sample.ga_sessions_*` where _TABLE_SUFFIX between format_date("%%Y%%m%%d","%t") and format_date("%%Y%%m%%d","%t") and geoNetwork.country="%s" group by 1,2',record.visit_from,record.search_from,record.search_to,record.visit_from)];
--replace your query here.
END FOR;
--aggregating all the queries and executing it
execute immediate (select string_agg(query, ' union all ') from unnest(queries) query);
Here the 2nd approach processed much less data(~750 KB) than the 1st approach(~17 MB). But that might not be the same for your dataset as the date range may overlap for 2 users and that will lead to reading the same table twice.

Postgres insert function by aggregating data from multiple tables

I have got two tables, 'page_visits' and 'comments', which store new webpage visits and new comments, respectively.
PAGE_VISITS
id
page_id
created_at
1
1111
2021-12-02T04:55:26.779Z
2
1442
2021-12-02T02:25:32.219Z
3
1111
2021-12-02T04:55:26.214Z
COMMENTS
id
page_id
...
created_at
1
1024
...
2021-12-02T04:55:26.779Z
2
1111
...
2021-12-02T02:25:32.219Z
3
3849
...
2021-12-02T04:55:26.214Z
I want to aggregate the data from both the tables in the past 1 hour to use for analytics, such that it looks like the table below.
PAGE_DATA
page_id
visit_count
comment_count
created_at
1024
14
3
2021-12-02T04:55:26.779Z
1111
11
8
2021-12-02T02:25:32.219Z
3849
1
0
2021-12-02T04:55:26.214Z
2412
0
1
2021-12-02T04:55:26.779Z
SELECT page_visits.page_id
, COUNT(page_visits.id) AS visitCount
, COALESCE(cmts.cmt_cnt,0) AS commentCount
FROM page_visits
LEFT OUTER
JOIN ( SELECT page_id
, COUNT(*) AS cmt_cnt
FROM comments
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
GROUP
BY page_id
) AS cmts
ON cmts.page_id = page_visits.page_id
WHERE page_visits.created_at >= NOW() - INTERVAL '1 HOUR'
GROUP
BY page_visits.page_id, cmts.cmt_cnt;
I have the above code as of now, however, it only prints the row when comment_count is null, but it does not do the same when visit_count is 0 and comment_count is > 0.
My first question is, how do I get it to print even when visit_count results as 0.
Because someone could have gone on to the page the hour before but only made a comment later on.
Secondly, I am trying to run this code every hour with the use of pg_cron and I know that I can run a function directly in a cron scheduler, however, I am unable to turn the above code into a working postgres function that inserts a new row into the 'page_data' table each time its called.
Could someone help me out with these 2 issues? Thank you.
Consider full join on two aggregates
SELECT page_visits.page_id
, COALESCE(vsts.vst_cnt, 0) AS visitCount
, COALESCE(cmts.cmt_cnt, 0) AS commentCount
FROM (
SELECT page_id
, COUNT(*) AS vst_cnt
FROM page_visits
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
GROUP BY page_id
) AS vsts
FULL OUTER JOIN (
SELECT page_id
, COUNT(*) AS cmt_cnt
FROM comments
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
GROUP BY page_id
) AS cmts
ON cmts.page_id = vsts.page_id
Alternatively, aggregate a UNION query of both tables:
SELECT page_id
, SUM(vst_n) AS vst_cnt
, SUM(cmt_n) AS cmt_cnt
FROM (
SELECT page_id, 1 AS vst_n, 0 AS cmt_n
FROM page_visits
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
UNION ALL
SELECT page_id, 0 AS vst_n, 1 AS cmt_n
FROM comments
WHERE created_at >= NOW() - INTERVAL '1 HOUR'
) AS sub
GROUP BY page_id
Regarding last question, if I understand you, simply run an insert-select query from above query. Not quite sure how you aggregated created_at, but add a MIN or MAX to above aggregations and include an additional column below:
INSERT INTO page_data (page_id, visit_count, comment_count)
SELECT ...above query...

Daily count of user_ids who have visited my store 4 or more than 4 times every day

I have a table of user_id who have visited my platform. I want to get count of only those user IDs who have visited my store 4 or more times for each user and for every day for a duration of 10 days.
To achieve this I am using this query:
select date(arrival_timestamp), count(user_id)
from mytable
where date(arrival_timestamp) >= current_date-10
and date(arrival_timestamp) < current_date
group by 1
having count(user_id)>=4
order by 2 desc
limit 10;
But this query is virtually taking all the users having count value greater than 4 and not on a daily basis which covers almost every user and hence I am not able to segregate only those users who vist my store more than once on a particular day. Any help in this regard is appreciated.
Thanks
you can try this
with list as (
select user_id, count(*) as user_count, array_agg(arrival_timestamp) as arrival_timestamp
from mytable
where date(arrival_timestamp) >= current_date-10
and date(arrival_timestamp) < current_date
group by user_id)
select user_id, unnest(arrival_timestamp)
from list
where user_count >= 4
From a list of daily users that have visited your store 4 or more times a day over the last 10 days (the internal query) select these who have 10 occurencies, i.e. every day.
select user_id
from
(
select user_id
from the_table
where arrival_timestamp::date between current_date - 10 and current_date - 1
group by user_id, arrival_timestamp::date
having count(*) >= 4
) t
group by user_id
having count(*) = 10;

Postgres inner query performance

I have a table which I need to select from everything with this rule:
id = 4524522000143 and validPoint = true
and date > (max(date)- interval '12 month')
-- the max date, is the max date for this id
Explaining the rule: I have to get all registers and count them, they must be at least 1 year old from the newest register.
This is my actual query:
WITH points as (
select 1 as ct from base_faturamento_mensal
where id = 4524522000143 and validPoint = true
group by id,date
having date > (max(date)- interval '12 month')
) select sum(ct) from points
Is there a more efficient way for this?
Well your query is using the trick with including an unaggregated column within HAVING clause but I don't find it particularly bad. It seems fine, but without the EXPLAIN ANALYZE <query> output I can't say much more.
One thing to do is you can get rid of the CTE and use count(*) within the same query instead of returning 1 and then running a sum on it afterwards.
select count(*) as ct
from base_faturamento_mensal
where id = 4524522000143
and validPoint = true
group by id, date
having date > max(date) - interval '12 months'

How to calculate an SQL MAX() for any N-second duration within a certain timeframe

I need to answer a question like this:
For each user, what is the most items that user viewed in any 60 second
time frame between START_TIMESTAMP and END_TIMESTAMP?
The 60 second time frame is a sliding window. It's not just a matter of "items viewed" counts for each whole minute. Also, 60 seconds was just an example, it should work for any number of seconds.
My data is stored like this:
-- Timestamped log of users viewing items
CREATE TABLE user_item_views (
user_id integer,
item_id integer,
timestamp timestamp
);
Doing it for each whole minute is easy enough, just format timestamp to something like YYYY-MM-DD hh:mm and do a count grouped by that formatted timestamp and the user_id.
Doing it for a sliding window, I have no idea how to approach.
If this would be easier outside of SQL, I am open to exporting the data to another format, or using another language.
Desired output is something like:
User ID Max items viewed in N seconds, between START and END.
... ...
... ...
... ...
How can I do this?
Here's how I would do it (beware, untested code, this ist just to outline the idea).
You need a helper table with as many rows as there are seconds between START_TIMESTAMP and END_TIMESTAMP. Create that as a temp table before you begin your query.
For the sake of the sample, let's call it every_second. I'm assuming your minimum time resolution is one second.
Then do:
SELECT
s.timestamp,
v.user_id,
(
SELECT COUNT(*) FROM user_item_views
WHERE timestamp BETWEEN s.timestamp AND ADDTIME(s.timestamp, '00:00:59')
AND user_id = v.user_id
) item_count
FROM
every_second s
LEFT JOIN user_item_views v ON v.timestamp = s.timestamp
GROUP BY
s.timestamp,
v.user_id
Store that in another temporary table and select the desired maxima from it (this is necessary because of the "select max from group" problem).
In MySQL (assuming that timestamp is unique):
SELECT
user_id
, MAX(max_count) AS max_count
FROM
( SELECT
a.user_id
, COUNT(*) AS max_count
FROM
user_item_views AS a
JOIN
user_item_views AS b
ON a.user_id = b.user_id
AND a.timestamp <= b.timestamp
AND b.timestamp < a.timestamp + INTERVAL 60 SECOND
GROUP BY
a.user_id
, a.timestamp
) AS grp
GROUP BY
user_id