Moving window/average based on timestamps in PostgreSQL

Moving window/average based on timestamps in PostgreSQL - sql

My question is the same as the one asked here, except the chosen answer goes "assuming you want to restart the rolling average after each 15 minute interval." What if I don't? I.e. what if I want a different rolling average for every row and the next fifteen minutes of rows?

I would approach this as a correlated subquery:
select t.*,
(select avg(t2.col)
from t t2
where t2.timestamp >= t.timestamp and
t2.timestamp < t.timestamp + interval '15 minute'
) as future_15min_avg
from t;
This is challenging to do with window functions because the size of the window can change for each row.
There is an alternative approach which is more cumbersome but more efficient on larger amounts of data. It is probably going to work best using a temporary table. The idea is:
Insert each timestamp with its value in the table
Insert each timestamp plus 15 minutes in the table with a value of 0
Add an index on timestamp
Do a cumulative sum on the values
Use two joins between the original table and this table to get the sum
You can try this with a CTE, and the performance might be ok. However, I think the index may be important for the performance. And, the correlated subquery is probably fine even for several hundreds or thousands of rows, assuming you have an index on the time column.

Related

How to modify my Schema or Query, so it runs effectively?

In a interview I was asked " You have a table with huge data but there is a requirement to view the rows that have been added in the last 15 minutes. How do you do this effectively without having to query the whole table as it takes so long".
I have said that I will create a view to have the latest 1000 records(Here I am assuming that there were less than a 1000 records cretaed in last 15 min) and I would query the view rather than the entire table. The interviewer was okay but he said there is a better approach and I am not able to find it.

You just need to make an index for the created_at column, this will avoid scanning the whole table and will considerably improve performance

You can try something like this:
SELECT columns
FROM table
WHERE DATE_ADD(last_seen, INTERVAL 15 MINUTE) >= NOW();
Or:
select *
from my_table
where my_column > timestamp '2020-10-09 00:00:05' - numtodsinterval(15,'MINUTE')
I am not sure aboutthe db you are using.
This goes 15 minutes back from last seen.

Oracle query: date filter gets really slow

I have this oracle query that takes around 1 minute to get the results:
SELECT TRUNC(sysdate - data_ricezione) AS delay
FROM notifiche#fe_engine2fe_gateway n
WHERE NVL(n.data_ricezione, TO_DATE('01011900', 'ddmmyyyy')) =
(SELECT NVL(MAX(n2.data_ricezione), TO_DATE('01011900', 'ddmmyyyy'))
FROM notifiche#fe_engine2fe_gateway n2
WHERE n.id_sdi = n2.id_sdi)
--AND sysdate-data_ricezione > 15
Basically i have this table named "notifiche", where each record represents a kind of update to another type of object (invoices). I want to know which invoice has not received any update in the last 15 days. I can do it by joining the notifiche n2 table, getting the most recent record for each invoice, and evaluate the difference between the update date (data_ricezione) and the current date (sysdate).
When i add the commented condition, the query takes then infinite time to complete (i mean hours, never saw the end of it...)
How is possibile that this simple condition make the query so slow?
How can I improve the performance?

Try to keep data_ricezione alone; if there's an index on it, it might help.
So: switch from
and sysdate - data_ricezione > 15
to
and -data_ricezione > 15 - sysdate / * (-1)
to
and data_ricezione < sysdate - 15
As everything is done over the database link, see whether the driving_site hint does any good, i.e.
select /*+ driving_site (n) */ --> "n" is table's alias
trunc(sysdate-data_ricezione) as delay
from
notifiche#fe_engine2fe_gateway n
...

Use an analytic function to avoid a self-join over a database link. The below query only reads from the table once, divides the rows into windows, finds theMAX value for each window, and lets you select rows based on that maximum. Analytic functions are tricky to understand at fist, but they often lead to code that is smaller and more efficient.
select id_sdi, data_ricezion
from
(
select id_sdi, data_ricezion, max(data_ricezion) over (partition by id_sdi) max_date
from notifiche#fe_engine2fe_gateway
)
where sysdate - max_date > 15;
As for why adding a simple condition can make the query slow - it's all about cardinality estimates. Cardinality, the number of rows, drives most of the database optimizer's decision. The best way to join a small amount of data may be very different than the best way to join a large amount of data. Oracle must always guess how many rows are returned by an operation, to know which algorithm to use.
Optimizer statistics (metadata about the tables, columns, and indexes) are what Oracle uses to make cardinality estimates. For example, to guess the number of rows filtered out by sysdate-data_ricezione > 15, the optimizer would want to know how many rows are in the table (DBA_TABLES.NUM_ROWS), what the maximum value for the column is (DBA_TAB_COLUMNS.HIGH_VALUE), and maybe a break down of how many rows are in different age ranges (DBA_TAB_HISTOGRAMS).
All of that information depends on optimizer statistics being correctly gathered. If a DBA foolishly disabled automatic optimizer statistics gathering, then these problems will happen all the time. But even if your system is using good settings, the predicate you're using may be an especially difficult case. Optimizer statistics aren't free to gather, so the system only collects them when 10% of the data changes. But since your predicate involves SYSDATE, the percentage of rows will change every day even if the table doesn't change. It may make sense to manually gather stats on this table more often than the default schedule, or use a /*+ dynamic_sampling */ hint, or create a SQL Profile/Plan Baseline, or one of the many ways to manage optimizer statistics and plan stability. But hopefully none of that will be necessary if you use an analytic function instead of a self-join.

Hive Not Utilizing Partitions in Query

I have a view that works to pull the most recent data for a Hive history table. The history table is partitioned by day. The way that the view works is very straightforward—it has a subquery that does a max date on the date field (the one that is used as the partition) then filters the table based upon that value. The table contains hundreds of days (partitions), each with many millions of rows. In order to speed up the subquery, I am attempting to limit the partitions that are scanned to the last one created. To account for holiday weekends, I'm going back four days to ensure that the query returns data.
If I hard code the values with dates, the subquery runs very fast, and limits to the partitions correctly.
However, if I attempt to limit the partitions with a subquery to calculate the last partition, it doesn’t recognize the partitions and does a full table scan. The query will return correct results, as the filter works, but it takes a long time because it is not limiting the partitions scanned.
I tried doing the subquery as a WITH statement, then using an INNER JOIN on bus_date, but got the same results—partitions were not utilized.
The behavior is repeatable via a query, so I’ll use that rather than the view to demonstrate:
SELECT *
FROM a.transactions
WHERE bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
There are no error messages, and the query actually works (filters to pull the correct data), but it scans all partitions so it is extremely slow. How can I limit the query to utilize the partitions identified in the subquery?

I'm still hopeful that someone will have an answer for this, but I did want to post the workaround that I've come up with in case it is useful for someone else.
SELECT *
FROM a.transactions
WHERE bus_date >= date_sub (CURRENT_DATE, 4)
AND bus_date IN (SELECT MAX (bus_date)
FROM a.transactions maxtrans
WHERE bus_date >= date_sub (CURRENT_DATE, 4));
The query is a little clumsy, as it is filtering on the business date twice. The first time it limits the main set of data to the last four days (which limits to those partitions and avoids a scan of all partitions) and the second pins it down to the last day for which data has been loaded (via the MAX bus_date). This is far from perfect, but performs CONSIDERABLY better than the query scanning all partitions. Thanks.

SQL Select, different than the last 10 records

I have a table called "dutyroster". I want to make a random selection from this table's "names" column, but, I want the selection be different than the last 10 records so that the same guy is not given a second duty in 10 days. Is that possible ?

Create a temporary table with only one column called oldnames which will have no records initially. For each select, execute a query like
select names from dutyroster where dutyroster.names not in (select oldnamesfrom temporarytable) limit 10
and when execution is done add the resultset to the temporary table

The other answer already here is addressing the portion of the question on how to avoid duplicating selections.
To accomplish the random part of the selection, leverage newid() directly within your select statement. I've made this sqlfiddle as an example.
SELECT TOP 10
newid() AS [RandomSortColumn],
*
FROM
dutyroster
ORDER BY
[RandomSortColumn] ASC
Keep executing the query, and you'll keep getting different results. Use the technique in the other answer for avoiding doubling a guy up.

The basic idea is to use a subquery to get all but users from the last ten days, then sort the rest randomly:
select dr.*
from dutyroster dr
where dr.name not in (select dr2.name
from dutyroster dr2
where dr2.datetimecol >= date_sub(curdate(), interval 10 day)
)
order by rand()
limit 1;
Different databases may have different syntax for limit, rand(), and for the date/time functions. The above gives the structure of the query, but the functions may differ.
If you have a large amount of data and performance is a concern, there are other (more complicated) ways to take a random sample.

you could use TOP function for SQL Server
and for MYSQL you could use LIMIT function

Maybe this would help...
SELECT TOP number|percent column_name(s)
FROM table_name;
Source: http://www.w3schools.com/sql/sql_top.asp

SQL - Optimize date calculation for large table

Can this query below be optimized?
select
max(date), sysdate - max(date)
from
table;
Query execution time ~5.7 seconds
I have another approach
select
date, sysdate - date
from
(select * from table order by date desc)
where
rownum = 1;
Query execution ~7.9 seconds
In this particular case, table has around 17,000,000 entries.
Is there a more optimal way to rewrite this?
Update: Well, I tried the hint a few of you suggested in a database development, although with a smaller subset than the original (approximately 1,000,000 records). Without the index the queries runs slower than with the index.
The first query, without index: ~0.56 secs, with index: ~0.2 secs. The second query, without index: ~0.41 secs, with index: ~0.005 secs. (This surprised me, I thought the first query would run faster than the second, maybe it's more suitable for smaller set of records).
I suggested to DBA this solution and he will change the table structure to accommodate this, and then i will test it with the actual data. Thanks

Is there an index on the date column?

That query is simple enough that there's likely nothing that can be done to optimize it beyond adding an index on the date column. What database is this? And is sysdate another column of the table?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas