PSQL Filter query by time intervals - sql

I have a query that will count the number of all completed issuances from a specific network. Problem is DB has a lot if issuances, starting from 2019-2020 and it counts all of them while I need the ones since last month (from current time, not some fixed date), IN A PRACTICAL WAY. Examples:
This is the query that counts all, which is about 12k
select count(*)
from issuances_extended
where network = 'ethereum'
and status = 'completed'
And this is the query I wrote that counts from a month ago to current time, which is about 100
select count(*)
from issuances_extended
where network = 'ethereum'
and issued_at > now() - interval '1 month'
and status = 'completed'
But I have a lot to count (1,2,3,4,5 months ago, year to date) and different networks so if I go my way as a solution it's ultimately very inefficient way of solving this. Is there a better way? Seems like this could be done via JS transformers but I couldn't figure it out.

Try using GROUP BY and DATE_TRUNC.
SELECT DATE_TRUNC('month', issued_at) as month, count(*) as issuances
FROM issuances_extended
WHERE network = 'ethereum'
AND status = 'completed'
GROUP BY DATE_TRUNC('month', issued_at)
How to Group by Month in PostgreSQL

Related

PostgreSQL: A window with all records of a month, plus the last record of the priv month

I have a table with update requests for disk quotas. Each record has a request time, a path and as quota size.
I need to aggregate all request for a certain path in a cretain month, plus the last request of the priv month. Something like
sum(quota) over (partition by )
I would appreciate some ideas or thoughts about how to do that in the most elegant way. Of course, it may be (and probably is) a multy-phase process.
Your question is short on details. To give complete answer you need to supply table definition (ddl), test data (or better a Fiddle), and the expected results of that data. However, retrieving this month's data and the latest form the prior month becomes something like:
select sq.path, sum(sq.quota) over (partition by sq.path) total_quota
from ( select s.path, s.quota
from some_table s
where s.path = 'certain path'
and date_trunc('month', s.request_time) = date_trunc('month', current_date)
union all
select sc.path, sc.quota
from some_table sc
where sc.path = 'certain path'
and (sc.path,sc.request_time) =
(select sl.path, max(sl.request_time)
from some_table sl
where sl.path = sc.path
and date_trunc('month', sl.request_time) = date_trunc('month', current_date - interval '1 month')
group by sl.path
) sq1
) sq;
If the month of interest does not contain current_date you could pass a parameter value for the month of interest.
Note: this is not tested.

Calculations based on condition in PostgreSQL

I am having trouble doing calculations in one table using conditional statements. I have a table 'df' with the following column names:
id - int
time - timestamp
correctness - boolean
subject - text
Every student (id) completes tasks on particular subject (subject). The system assigns "True" value in column "correctness" if the assignment is completed correctly and "False" if not. The time (time) when the student completes the task is also saved by the system.
I need to write an optimal sql query that counts all students who completed 20 tasks successfully within an hour during March 2020.
Thanks in advance!
You can do this with no subqueries using:
select distinct s.id
from t
where t.timestamp >= '2020-03-01' and t.timestamp < '2020-04-01'
group by s.id, date_trunc('hour', timestamp)
having count(*) >= 20;
Note: You may want that the tasks are completed successfully, but that is not actually mentioned in your question.
For performance, you want an index on (timestamp).
You need to look at each 'correct' task and see if there are 20 previous tasks, delivered within one hour, that are correct.
That means you have to inner join task unto itself and then count them.
select distinct on(tasks.id) tasks.id, tasks.time, sum(previous_tasks.id)
from tasks
inner join tasks previous_tasks
on tasks.id = previous_tasks.id
and (previous_tasks.time - tasks.time) < interval '1 hour'
and previous_tasks.correctness
and tasks.time >= '2020-03-01' and tasks.time < '2020-04-01'
and previous_tasks.time >= '2020-03-01' and previous_tasks.time < '2020-04-01'
group by 1, 2
having sum(previous_tasks.id) >= 20

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

How do I design a SQL query that will show me all users who visited at least one page for the last 20 out of 24 hours?

In order to identify human traffic (as opposed to crawlers, bots, etc), I would like to design an SQL query that will identify all unique visitor ID's that have visited websites in the last 20 of 24 hours (as most humans would not be browsing for that long). I believe I understand how I want to structure it, "How many UNIQUE hours have any activity for each visitor in the past 24 hours, and WHERE 20 hours have at least some activity".
While the specifics of such a query would depend on the tables involved, I'm having trouble understanding if my structure is on the right track:
SELECT page_url, affinity, num
FROM (
SELECT AGG GROUP BY visitor_id, pages.page_url, max(v.max_affinity) as affinity, COUNT(*) as num, Row_Number()
OVER (Partition By v.visitor_id ORDER BY COUNT(visitor_id) DESC) AS RowNumber
FROM audience_lab_active_visitors v
SELECT pages ON pages.p_date >= '2017-09-14'
WHERE v.p_date='2017-09-14'
GROUP BY v.vispage_visitors, pages.page_url
) tbl WHERE RowNumber < 20
I don't believe your query is valid SQL, but I have an idea of what you're trying to accomplish. Rather than use a static date, I filtered by the past 24 hours and truncated the current timestamp to the hour, otherwise the query would be considering 25 unique hours. I also removed page_url from the query since it didn't seem relevant to the results based on what you're trying to solve.
For each visitor_id, the query counts the number of unique hours recorded based on the column (timestamp_col in this example) used to record the timestamp of the page view. HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20 returns those you've identified at humans, meaning they visited the website up to 19 of the past 24 hours.
SELECT
visitor_id,
COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) AS num,
MAX(v.max_affinity) AS affinity
FROM audience_lab_active_visitors AS v
JOIN pages AS p ON v.page_url = p.page_url
WHERE
v.p_date >= DATE_TRUNC('hour', CURRENT_TIMESTAMP) - INTERVAL '24' hour
GROUP BY 1
HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20;

Grouping timestamps by day, not by time

I've got a table storing user access/view logs for the webservice I run. It tracks the time as a timestamp though I'm finding when I do aggregate reporting I only care about the day, not the time.
I'm currently running the following query:
SELECT
user_logs.timestamp
FROM
user_logs
WHERE
user_logs.timestamp >= %(timestamp_1)s
AND user_logs.timestamp <= %(timestamp_2)s
ORDER BY
user_logs.timestamp
There are often other where conditions but they shouldn't matter to the question. I'm using Postgres but I'd assume whatever feature is used will work in other languages.
I pull the results into a Python script which counts the number of views per date but it'd make much more sense to me if the database could group and count for me.
How do I strip it down so it'll group by the day and ignore the time?
SELECT date_trunc('day', user_logs.timestamp) "day", count(*) views
FROM user_logs
WHERE user_logs.timestamp >= %(timestamp_1)s
AND user_logs.timestamp <= %(timestamp_2)s
group by 1
ORDER BY 1