DB2 - optimized query - apply function to current date or db column?

DB2 - optimized query - apply function to current date or db column? - sql

I need to write a query that will be as efficient as possible returning rows that have mycol (timestamp value) equal to today's date minus 100 or 200 days. (exactly 100 or 200 days ago - not a range)
Note that mycol always has 00.00.000000 time. (Ignore why that is)
Here is one example of how this can be written:
select * from mytable mt where
date(mycol) in (current date - 100 days, current date - 200 days)
I'm thinking this may be more efficient:
select * from mytable mt where
mycol in (timestampadd(16,-100,timestamp(current date,'00:00:00')),
timestampadd(16,-200,timestamp(current date,'00:00:00')))
The reason I believe it is more efficient is because I'm not calling a function on mycol (as I did in the first example) and the calculations on current date happen only once per execution of this query and not for every row.
Am I correct in my assumption?

I would write the second version as:
select *
from mytable mt
where mycol in (timestamp(current date - 100 days, '00:00:00'),
timestamp(current date - 200 days, '00:00:00')
);
If you care about performance, then you should have an index on mytable(mycol), because this will speed the query. Without an index, the additional overhead is the call to date() in each row. You would need to run timings in your environment to determine whether that is an issue in your environment.

I think second option is more eficient but for other reason.
If you declare an index over your column mycol the IN operator will be much faster.
But at the moment you perform date(mycol) db can't use the index anymore.
The best way is test both query unsing analyze/ explain.
ADD INFO:
You should consider using EXISTS instead of IN as show HERE
Test your two querys using EXPLAIN de DB2 or DB2 SQL Performance Analyzer
And then try the same using the EXISTS version.

Related

Incorrect parameter count in the call to native function 'DATEDIFF' in sql

what command should I use to make an output based on this problem:
Display the all values of customers who have joined as members for more than 700 days until today
This is the table that I have created:
table Customers
I've tried other references using DATEDIFF(), but it's always invalid :
SELECT * FROM Customers where DATEDIFF(DAY,customer_join,GETDATE())>700;

In MySQL/MariaDB, as opposed to SQL Server, DATEDIFF() takes just two arguments, and returns an integer number of days between them. We have timestampdiff(), which takes three arguments.
Also, getdate() is not a thing in MySQL (this is a bespoke SQL Server function).
You don't really need date functions here. I would phrase this logic using simple data arithmetics:
select *
from customers
where customer_join < current_date - interval 700 day
This expression can take advantage of an index on customer_join.
Depending on whether you want to take in account the time portion of customer_join (if it has one), you might want to use now() instead of current_date.

SQL to group time intervals by arbitrary time period

I need help with this SQL query. I have a big table with the following schema:
time_start (timestamp) - start time of the measurement,
duration (double) - duration of the measurement in seconds,
count_event1 (int) - number of measured events of type 1,
count_event2 (int) - number of measured events of type 2
I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.
I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:
SELECT
ROUND(time_start/group_period,0) AS time_period,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM measurements
GROUP BY time_period;
However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?
Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?

Based on Timekiller's advice, I have come up with the following query:
-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.
-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)
SET TIME ZONE UTC;
BEGIN TRANSACTION;
-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;
-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;
COMMIT TRANSACTION;
It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:
I expect the measurements table to have about 500-800 million rows.
The time_start column is primary key and has unique btree index on it.
I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.

(This turned out way too large for a comment, so I'll post it as an answer instead).
Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.
As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.
There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.
Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.
Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.
If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.
Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10

You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.

I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

How to get all rows from a table inserted in a particular date.

I am trying to write a query that gets all the rows of a table for a particular date.
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE='2013-05-07'
However that does not work, because in the table the COLUMN_CONTAINING_DATE contains data like '2013-05-07 00:00:01' etc. So, this would work
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE>='2013-05-07' AND COLUMN_CONTAINING_DATE<'2013-05-08'
However, I dont want to go for option 2 because that feels like a hacky way. I would rather put a query that says get me all the rows for a give date and somehow not bother about the minutes and hours in the COLUMN_CONTAINING_DATE.
I am trying to have this query run on both H2 and DB2.
Any suggestions?

You can do:
select *
from MY_Table
where trunc(COLUMN_CONTAINING_DATE) = '2013-05-07';
However, the version that you describe as a "hack" is actually better. By wrapping a function around the data, many SQL optimizers will not use indexes. With just direct comparisons, an index would definitely be used.

Use something like this
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE=DATE('2013-05-07')

You can ease this if you use the Temporal data management capability from DB2 10.1.
For more information:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/

If your concerns are related to the different data types (timestamp in the column, and a string containing a date), you can do this:
SELECT * FROM MY_TABLE
WHERE
COLUMN_CONTAINING_DATE >= '2013-05-07 00:00:00'
and COLUMN_CONTAINING_DATE < '2013-05-08 00:00:00'
and I'd pay attention to the formatting of the where clause, because this will improve readability a lot, if you have to look at your queries two months later. Just pick a style you prefer for ranges like "a <= x < b". Unfortunately SQL's between does not support this.
One could argue that the milliseconds are still missing, so perfectionists may append another ".0" in the timestamp ...

How Does Dateadd Impact the Performance of a SQL Query?

Say for instance I'm joining on a number table to perform some operation between two dates in a subquery, like so:
select n
,(select avg(col1)
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate))
from numbers
where n < 1200
Would the query perform better if I, say, constructed the date from concatenating varchars than using the dateadd function?

Keeping data in the datetime format using DATEADD is most likely to be quicker
Check this question: Most efficient way in SQL Server to get date from date+time?
The accepted answer (not me!) demonstrates DATEADD over string conversions. I've seen another too many years ago that showed the same

Be careful with between and dates, take a look at How Does Between Work With Dates In SQL Server?
I once optmized a query to run from over 24 hours to 36 seconds. Just don't use date functions or conversions on the column , see here: Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code
to see what query performs better, execute both queries and look at execution plans, you can also use statistics io and statistics time to get how many reads and the time it took to execute the queries

I would NOT go with concatenating varchars.
DateAdd will def be better performace than string contatenation, and casting to DATETIME.
As always, you best bet would be to profile the 2 options, and determine the best result, as no DB is specified.

most likely there will be no differenfce one way or another.
I would run this:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
followed by both variants of your query, so that you see and compare real execution costs.

As long as your predicate calculations do not include references to the columns of the table you're querying, your approach shouldn't matter either way (go for clarity).
If you were to include something from Table1 in the calculation, though, I'd watch out for table scans or covering index scans as it may no longer be sargable.
In any case, check (or post!) the execution plan to confirm.

Why would you ever use a correlated subquery to begin with? That's going to slow you up far more than dateadd. They are like cursors, they work row by row.
Will something like this work?
select n.n , avgcol1
from numbers n
left outer join
(
select avg(col1) as avgcol1, n
from table1
where timestamp between dateadd(minute, 15*n, #ArbitraryDate)
and dateadd(minute, 15*(n+1), #ArbitraryDate)
Group by n
) t
on n.n = t.n
where n < 1200

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas