We have a table with more than two million rows where all queries against it will be a Between lookup using Column1 and Column2. Also, there will only be one possible result. For example...
Col1 Col2
1 5
6 10
11 15
select * from table1 where 8 between Col1 and Col2
I currently have an unique clustered index on Col1 and Col2. So far I have been unable to figure out how to further tune the query and the indexes to minimize the rows handled. The execution plan currently reports cost of almost 0.5 and 113k rows handled when locating the one and only correct answer.
What options might I be overlooking?
As requested, some details from the current execution plan:
Operation
Clustered Index Seek
Predicate
CONVERT_IMPLICIT(bigint,[#2],0)<=[Col2]
Seek Predicate
Seek Keys[1]: End: Col1 <= Scalar Operator(CONVERT_IMPLICIT(bigint,[#1],0))
Are the ranges always non-overlapping? You mention that there is always only one match. If they are, you can write it as:
SELECT * FROM table1
WHERE 8 <= Col2
ORDER BY Col2 ASC
LIMIT 1
This will give you the row with the lowest value of Col2 which is above 8 - which is the range you are interested in. The index would only be needed on Col2, and the cost should be small.
Since you did not mention the DBMS you are using, the LIMIT 1 should be replaced with whatever your DB uses to fetch the first N results.
You will have to check Col1 <= your_value in code to ensure that the value you are looking for really is in the range.
I think I have found the answer. I had to start by creating an Unique Clustered Index on Col1, then create an Unique Unclustered Index on Col2. The query then had to be updated to force lookups on each Index.
select * from table1 where Col1 =
(select max(Col1) from table1 where Col1 <= 8)
and Col2 =
(select min(Col2) from table1 where Col2 >= 8)
Execution plan now reports 0.0098 cost and 1 row handled.
select * from table1 where Col1 <= 8 and Col2 >= 8
Maybe the "between" with two columns is causing an issue.
Also, you should just have 1 composite index on both columns (Col1, Col2).
Related
I was searching on how to get the latest occurences based on col1 and col2.
Let's suppose we have the following table (all rows needed are marked with *):
col1 col2 col3
---------------------------------------------------------
002478 ABC 2019-08-23 *
002478 ABC 2019-05-14
002588 CVMG 2019-01-07 *
002588 IP 2019-01-31 *
002588 MMG 2019-09-04 *
002588 MMG 2019-08-28
002588 NUSA 2019-11-04 *
002588 NUSA 2019-04-24
002746 IE 2019-01-15 *
003467 IE 2020-01-10
003467 IE 2020-03-13 *
I was able to get the latest occurences based on col1 and col2 with the following select.
SELECT t.col1,
t.col2,
t.col3
FROM
teste t
WHERE t.col3 IN (SELECT max(a.col3)
FROM teste a
WHERE a.col1 = t.col1 AND a.col2 = t.col2)
In this example, it only takes about 10 ~ 7 ms to complete, but on my real database, it takes about 1 hour.
I removed all JOINS that I could and the minimum time I've reached was about 55 minutes.
As I'm using Progress, there's no window function (that I'm aware of) like partition by.
There's another way to solve this problem? The only query I could think was on that "style".
Here's an SQL Fiddle with that example database.
Another way of writing the same query is to select the rows for which not excist a newer related row:
SELECT t.col1, t.col2, t.col3
FROM teste t
WHERE NOT EXISTS
(
SELECT NULL
FROM teste t_newer
WHERE t_newer.col1 = t.col1
AND t_newer.col2 = t.col2
AND t_newer.col3 > t.col3
);
This may be faster or slower or equally fast. This depends on how your DBMS runs this internally.
With either of the two queries the DBMS faces the task to quickly look up other rows with the same col1 and col2. With only the table, the DBMS would have to sequentially read it again and again and again. This is where indexes come into play. You provide the DBMS with indexes, where it can look up where in the table are the matching rows.
In your case you want an index an col1 and col2, in order to provide a means to find the related rows. And you can also add col3, as this must be compared, too. Maybe it doesn't matter whether to start the index with col1 or col2, maybe it does. How many different col1 are in the table, how many different col2? If one has just 5 different values and the other 5,000, then start the index with the one with 5,000 values, because for one value you will find fewer rows, i.e. get faster to the rows you are interested in.
An index could then look like
create index idx on teste (col1, col2, col3);
The queries stay the same. The DBMS will look at your query and decide whether to use an index or not. For the given queries I am sure it will use the index mentioned, because the queries are all about quickly looking up related rows.
I have a quite complex query that is based on multiple tables unioned together. At the moment, we are using view in order to perform operations on all the rows we need, so the view and a query look like:
CREATE VIEW
V_VIEW
(
COL1, COL2, COL3, COL4
) AS
SELECT
"COL1", "COL2", "COL3", "COL4"
FROM
TABLE1
UNION ALL
SELECT
"COL1", "COL2", "COL3", "COL4"
FROM
TABLE2;
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
V_VIEW
WHERE
COL1 like 'val%'
AND COL2 =
(
SELECT
MAX(COL3)
FROM
V_VIEW
WHERE
COL4 = 'Y' ) part1
UNION ALL
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
V_VIEW
WHERE
COL1 like 'sth%'
AND COL2 =
(
SELECT
MIN(COL3)
FROM
V_VIEW
WHERE
COL4 = 'N' ) part2;
I'm looking for a way to improve performance of this query and unfortunately creating new table that consists all rows of Table1 and Table2 is not an option for now (we are not allowed to interfere with the way rows are being inserted there). I tried to use WITH clause instead of the view, so it would look a bit like:
WITH TEMP_TABLE AS (
SELECT
COL1, COL2, COL3, COL4
FROM
TABLE1
UNION ALL
SELECT
COL1, COL2, COL3, COL4
FROM
TABLE2 )
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
TEMP_TABLE
WHERE
COL1 like 'val%'
AND COL2 =
(
SELECT
MAX(COL3)
FROM
TEMP_TABLE
WHERE
COL4 = 'Y' ) part1
UNION ALL
SELECT
COL1, COL2
FROM
( SELECT
COL1, COL2
FROM
TEMP_TABLE
WHERE
COL1 like 'sth%'
AND COL2 =
(
SELECT
MIN(COL3)
FROM
TEMP_TABLE
WHERE
COL4 = 'N' ) part2
On a small data volume (Table1 and Table2 have about 20k rows) this improves performance very well. However, those tables will eventually get stuffed with millions of rows. I don't entirely understand how WITH clause is being processed, so I wonder: is there a chance that query using WITH closure, on a large set of data, will fail (due to lack of memory?), where a query without it would work slow, but will finish just fine?
You could try using the following:
WITH main_res AS (SELECT col1,
col2,
MAX(CASE WHEN col4 = 'N' THEN col3) OVER () col3_n_max,
MAX(CASE WHEN col4 = 'Y' THEN col3) OVER () col3_y_max
FROM v_view
WHERE col1 LIKE 'val%'
OR col1 LIKE 'sth%')
SELECT col1,
col2
FROM main_res
WHERE (col1 LIKE 'val%' AND col2 = col3_y_max)
OR (col1 LIKE 'sth%' AND col2 = col3_n_max);
This uses a conditional max analytic function to return the max value (depending on the col4 value) across all the rows.
Once you know that information, you can then filter on it appropriately. This should reduce the number of times you're querying each table, which usually is faster (but not always!) than the original query. I advise you test this query and work out if it's faster than the original query (and any other answers) before you choose which one to use.
WITH clause is a kind of VIEW which is created on the fly, used and the code for wont get stored in the DB. However, the it consumes main memory to store the information related to the cursor which is used to retrieve rows from the WITH SELECT query. You are right; WITH query on tables with huge data will slow down the DB.
I am not aware of:
a) Whether TABLE1 and TABLE2 hold full data set or these tables are incrementally updated.
b) Do we have date columns in this table?
c) At what interval these tables are populated or updated?
Based on the answers to above questions:
After discussing with your DBAs:
You can ask DBAs to extract data belonging to TRUNC(SYSDATE) or TRUNC(SYSDATE)-1 from TABLE1 and TABLE2 and populate this data into a single "new" table with same columns along with two additional columns:
a) One column is going to contain 1st three letters of COL1 value.
b) Another column to hold status value with DEFAULT 'Q'.
Create a LIST partition on this new table on COL1 for values 'Val' and 'Sth' and COL4 for Y and N.
Write an anonymous block which prepares data the way you need. Then, simple query on this new table should fetch data for you. We can schedule this anonymous block in job schedule depending on the frequency at which data will be available in the source tables TABLE1 and TABLE2.
These suggestions are based on a set of assumptions and amount of information you have shared.
If there is any UI or report running on this data then, house keeping of this data is required.
Bottom line :
Prepare the data as required by the subsequent process(es) beforehand rather than preparing the data on-the-fly when it is required. This will simplify your entire process and query part also.
Most of the times when we encounter performance bottlenecks in Prod or Int environment, we always look for short-term solutions. Short-time solutions are very much required to sort out the issue at hand. However, I would suggest you to be prepared with a long-term solution as well.
Before investing too much time in rewriting, it would be helpful to ensure that the optimizer is given a fighting chance at doing a good job. Make sure the tables have good stats and appropriate indexes.
Run explain plans on your queries to see what Oracle is actually doing in each case. You may find that something unexpected is going on with those UNION ALL statements. The optimizer sometimes makes dumb decisions and you may need to help it with indexes or strategically applied hints.
The WITH clause is quite handy and does the same job as a standalone view or a view defined inline in the table list, with one key exception: Oracle treats standalone views, WITH-clause views, and inline views slightly differently in the optimization process.
Oracle may choose to materialize the results of a view defined in a WITH clause, while it may merge the view if it is defined inline.
The point is that changing between these three kinds of views in your query will cause odd nuances of the optimizer to start showing up.
Finally, what version of Oracle are you on? The optimizer is one area where version really matters.
Which would be more cost effective way to create a basic SELECT query.
Option one:
SELECT id
FROM table
WHERE COL0 NOT IN (2,3,4,5,6,7,8,9...)
AND COL1 >= 20
AND COL2 <= 10
AND .... ;
Or option two:
SELECT id FROM table WHERE COL0 NOT IN (2,3,4,5,6,7,8,9...);
The COL0 is FK column.
The first thing necessary would be index on the COL0. But from there..
The number included in the NOT IN clause could be from 1 to 1000 for example.
Questions:
Would the additional values in the WHERE clause help the DB to perform the query faster by eliminating stuff that should not be in the response, or will it just be additional work to check the accordance to the additional values?
Theoretically having hundreds of ID values in the NOT IN clause would be considered as bad and "expensive" design?
I'm using Firebird 2.5 .
The db query optimizer will use the best index to filter the most number of rows.
So you should use first aproach and add either:
separate index for col0, col1 and col2
composite index for both (col0, col1, col2)
so imagine you have 1000 rows but only 10 are > 20 optimizer will use the col1 index to filter out 990 rows making the rest of the query faster.
Also instead of use NOT IN you could save those value in a separated table tblFilter
SELECT id
FROM table T1
LEFT JOIN tblFilter T2
ON T2.col0 = T2.col0
WHERE T2.col0 IS NULL
I am using sqlite3 and I am trying to retrieve all rows ordered by some col1 with null values coming last. As of now I am using this kind o query:
select * from table order by row1 is null, row1 asc
As there are many rows in my table, the query worked quite slowly, so I decided to create an index on table(row1).
After creating the index it extremely improved the speed of queries like:
select * from table order by row1 asc
However sqlite doesn't seem to use that index with "order by col1 is null" type of queries.
Why sqlite, based on that index, can't just move rows with null values to the end?
Is there any way I can make null values come last without the need to evaluate every row every time again?
SQLite 3.8.12 will support expressions in indexes:
> CREATE TABLE t(x);
> CREATE INDEX tnx ON T(x IS NULL, x);
> EXPLAIN QUERY PLAN SELECT * FROM t ORDER BY x IS NULL, x;
0|0|0|SCAN TABLE t USING COVERING INDEX tnx
In earlier versions, you can split the query into two subqueries, each of which can use an index:
SELECT *
FROM (SELECT *
FROM MyTable
WHERE row1 IS NOT NULL
ORDER BY row1)
UNION ALL
SELECT *
FROM MyTable
WHERE row1 IS NULL;
You can try a conditional in order by.
select * from table
order by case when row1 is null then 1 else 0 end, row1
The default ordering is ascending. Hence asc has been omitted from the query above.
I have got a table table with 21 millions records of which 20 millions meet the criterion col1= 'text'. I then started iteratively to set the value of col2 to a value unequal to NULL. After I have mutated 10 million records, the following query got slow, that was fast in the beginning:
SELECT T_PK
FROM (SELECT T_PK
FROM table
WHERE col1= 'text' AND col2 IS NULL
ORDER BY T_PK DESC)
WHERE ROWNUM < 100;
I noticed that as soon as I remove the DESC, the whole order by clause ORDER BY T_PK DESC or the whole outer query with the condition WHERE ROWNUM < 100 it is fast again (fast means a couple of seconds, < 10s).
The execution plan looks as follows:
where the index full scan descending index is performed on the PK of the table. Besides the index on the PK, I have an index defined on col2.
What could be the reason that the query was fast and then got slow? How can I make the query fast regardless of how many records are already set to non-null value?
For this query:
SELECT T_PK
FROM (SELECT T_PK
FROM table
WHERE col1= 'text' AND col2 IS NULL
ORDER BY T_PK DESC
) t
WHERE ROWNUM < 100;
The optimal index is table(col1, col2, t_pk).
I think the problem is that the optimizer has a choice of two indexes -- either for the where clause (col1 and -- probably -- col2) or one on t_pk. If you have a single index that handles both clauses, then performance should improve.
One reason that the DESC might make a difference is where the matching rows lie. If all the matching rows are in the first 100,000 rows of the table, then when you order descending, the query might have to throw out 20.9 million rows before finding a match.
I think Burleson explained this quite nicely:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm
Beware!
This use of rownum< can cause performance problems. Using rownum may change the all_rows optimizer mode for a query to first_rows, causing unexpected sub-optimal execution plans. One solution is to always include an all_rows hint when using rownum to perform a top-n query.