I'm trying to write a query that looks for all records for yesterday from a column with type: TIMESTAMP(6) WITH LOCAL TIME ZONE.
The issue I'm facing is the query runs extremely slow, I'm wondering if there is something I'm doing incorrectly when querying with a timestamp format?
select * from inventory_transaction
WHERE Complete_timestamp >= to_timestamp(sysdate-1)
and Complete_timestamp < to_timestamp(sysdate);
First, if you want "yesterday", the logic would be:
select *
from inventory_transaction
where Complete_timestamp >= trunc(sysdate - 1) and
Complete_timestamp < trunc(sysdate);
What is important here is the trunc() so you get calendar days. Otherwise, you have the time in the comparison. Despite its name, sysdate has a time component.
I would also use sysdate - interval '1' day, but that has no impact on performance or results.
Why is this slow? There could be multiple reasons. The most likely is that inventory_transaction is a table. In that case, you want an index on inventory_transaction(complete_timestamp).
Another possibility is that there is a lot of data. More data takes more time for the database to read. There is no way around that, other than upgrading the server or changing server parameters (if they are suboptimal).
Another possibility is that inventory_transaction is really a view. If that is the case, you need to optimize the view. It is possible that an index on an underlying column will speed the query. However, it might be the entire view that needs to be optimized.
Related
What's the problem?
I'm trying to select two different date periods at once within a query (to be used in Data Studio), to do some complex period comparison calculations.
This has to all happen in the query, because it's getting used in Data Studio (and other reporting platforms).
However the logic I'm using to prune the partitions, seems to be inconsistent on the BigQuery side.
One of them prunes partitions correctly, the other doesn't. Some times variations seem to use arbitrary amounts of data.
Examples of partition pruning
Here's the first example. It either selects the correct amount ~ 12MB. Or the entire table ~5GB.
SELECT
columns...
FROM
table
WHERE
(
date(_PARTITIONTIME) >= PARSE_DATE('%Y%m%d', "20220509")
AND date(_PARTITIONTIME) <= PARSE_DATE('%Y%m%d', "20220509")
)
OR
(
date(_PARTITIONTIME) >= DATE_SUB(PARSE_DATE('%Y%m%d', "20220509"), INTERVAL 1 DAY)
AND date(_PARTITIONTIME) <= DATE_SUB(PARSE_DATE('%Y%m%d', "20220509"), INTERVAL 1 DAY)
)
AND REGEXP_CONTAINS (path, r"/c/")
I can't consistently recreate the 12MB version, it's more typically returning the 5GB version.
Question 1: I'm assuming this is some sort of weird cache thing, where it's somehow run the date calculation before?
Question 2: I'm also not sure why it's not pruning the tables here as I'm not doing any subqueries, just some date calculations. Why is it pruning?
I'm modifying a query in Oracle, in which the data from the previous month is needed. At present, the query has a where clause with the following:
...
...
...
WHERE cust_dt BETWEEN TO_DATE('05-01-2022','mm-dd-yyyy) AND TO_DATE('05-31-2022', 'mm-dd-yyyy')
I am modifying the query so that the start date and end date do not need to be manually changed every month to run the query. After doing some research, I came up with the following:
...
...
...
WHERE TO_CHAR(cust_dt, 'MM-YYYY') = TO_CHAR(add_months(sysdate, -1), 'MM-YYYY')
The results I get back are as I want them, but I am curious as to which query will be better performance-wise given a larger set of data. All the posts I saw online used BETWEEN, so I was wondering if there was some reason for this.
I am a complete novice as far as tuning, testing, performance, etc. goes on queries. The actual query itself it fairly complex with several joins, so performance is important. At present, I only have a small amount of test data to work with, so I am limited in what all I can do to find the best result.
So to circle back to my question, which query would be best? The one that uses a BETWEEN, or the one that uses TO_CHAR?
I recommend to do not use one of your two queries. The first query has the problem that it's checking a hard coded date range. This is generally a risk that should be avoided when possible. The second query makes a not necessary conversion from date to char and this will likely be slow.
You can just use common date functions to get the data for the previous month.
According to your description, your reference date is the current date, i.e. sysdate.
In order to rule out incorrect comparison because of the time, you can use the function TRUNC to remove the time from the date.
The function ADD_MONTHS can be used to go one month back.
The function LAST_DAY can be used to find the last day of the month.
Putting this together, you can use a where clause like this:
WHERE cust_dt BETWEEN ADD_MONTHS(TRUNC(sysdate,'mm'),-1) AND
LAST_DAY(ADD_MONTHS(TRUNC(sysdate,'mm'),-1));
This will be executed fast and safe and avoids unnecessary conversions or hard coded dates.
A last note: Consider to change the above query to...
WHERE cust_dt BETWEEN ADD_MONTHS(TRUNC(sysdate,'mm'),-1)
AND TRUNC(sysdate,'MM')-INTERVAL '0.001' SECOND;
...depending on whether you need the full last day of the previous month or not.
Please see here the difference and try out: db<>fiddle
WHERE TO_CHAR(cust_dt, 'MM-YYYY') = TO_CHAR(add_months(sysdate, -1), 'MM-YYYY')
Will not use an index on the cust_dt column; instead you would need to create a separate function-based index on TO_CHAR(cust_dt, 'MM-YYYY')
WHERE cust_dt BETWEEN TO_DATE('05-01-2022', 'mm-dd-yyyy')
AND TO_DATE('05-31-2022', 'mm-dd-yyyy')
Will use an index on the cust_dt column.
However, in Oracle, a DATE data-type consists of 7 bytes representing: century, year-of-century, month, day, hour, minute and second. It ALWAYS has those components (but often the client application you use to access the database will default to only showing the date component and not the time component - but the time component will still exist).
This means that your query will find values where cust_dt is between 2022-05-01 00:00:00 and 2022-05-31 00:00:00. It will not match values where cust_dt is between 2022-05-31 00:00:01 and 2022-05-31 23:59:59.
So to circle back to my question, which query would be best?
Neither.
You want (if you are hardcoding the dates):
WHERE cust_dt >= TO_DATE('05-01-2022', 'mm-dd-yyyy')
AND cust_dt < TO_DATE('06-01-2022', 'mm-dd-yyyy')
Or (if you are finding the date dynamically):
WHERE cust_dt >= ADD_MONTHS(TRUNC(SYSDATE, 'MM'), -1)
AND cust_dt < TRUNC(SYSDATE, 'MM')
Which will use an index on cust_dt and match the entire range for the month.
db<>fiddle here
I need help with this SQL query. I have a big table with the following schema:
time_start (timestamp) - start time of the measurement,
duration (double) - duration of the measurement in seconds,
count_event1 (int) - number of measured events of type 1,
count_event2 (int) - number of measured events of type 2
I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.
I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:
SELECT
ROUND(time_start/group_period,0) AS time_period,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM measurements
GROUP BY time_period;
However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?
Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?
Based on Timekiller's advice, I have come up with the following query:
-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.
-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)
SET TIME ZONE UTC;
BEGIN TRANSACTION;
-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;
-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;
COMMIT TRANSACTION;
It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:
I expect the measurements table to have about 500-800 million rows.
The time_start column is primary key and has unique btree index on it.
I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.
(This turned out way too large for a comment, so I'll post it as an answer instead).
Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.
As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.
There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.
Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.
Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.
If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.
Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.
I am trying to write a query that gets all the rows of a table for a particular date.
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE='2013-05-07'
However that does not work, because in the table the COLUMN_CONTAINING_DATE contains data like '2013-05-07 00:00:01' etc. So, this would work
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE>='2013-05-07' AND COLUMN_CONTAINING_DATE<'2013-05-08'
However, I dont want to go for option 2 because that feels like a hacky way. I would rather put a query that says get me all the rows for a give date and somehow not bother about the minutes and hours in the COLUMN_CONTAINING_DATE.
I am trying to have this query run on both H2 and DB2.
Any suggestions?
You can do:
select *
from MY_Table
where trunc(COLUMN_CONTAINING_DATE) = '2013-05-07';
However, the version that you describe as a "hack" is actually better. By wrapping a function around the data, many SQL optimizers will not use indexes. With just direct comparisons, an index would definitely be used.
Use something like this
SELECT * FROM MY_TABLE WHERE COLUMN_CONTAINING_DATE=DATE('2013-05-07')
You can ease this if you use the Temporal data management capability from DB2 10.1.
For more information:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1204db2temporaldata/
If your concerns are related to the different data types (timestamp in the column, and a string containing a date), you can do this:
SELECT * FROM MY_TABLE
WHERE
COLUMN_CONTAINING_DATE >= '2013-05-07 00:00:00'
and COLUMN_CONTAINING_DATE < '2013-05-08 00:00:00'
and I'd pay attention to the formatting of the where clause, because this will improve readability a lot, if you have to look at your queries two months later. Just pick a style you prefer for ranges like "a <= x < b". Unfortunately SQL's between does not support this.
One could argue that the milliseconds are still missing, so perfectionists may append another ".0" in the timestamp ...
Could somebody recommend the query to retrieve records up to today or certain dates?
I'm required to produce an Oracle report where user needs to enter a date and records up to that date will be shown.
I tried
select * from the_table where the_date <= sysdate
However it seems to produce an inaccurate result. What is the better query for this. For now I'm just playing around with sysdate. Later I will need to use a certain date keyed in by the user and all the records up to that date needs to be shown.
Any suggestions?
Sometimes you get inaccurate records because of little differences like minutes and seconds when two dates have the same day/month/year. Try the following
select * from the_table where TRUNC(the_date) <= sysdate
The TRUNC removes the minute and the seconds. Sometimes you get inaccurate records without using that