SQL: how do I speed up this query - sql

Here is the situation. I have one table that contains records based on records in many different tables (t1 below). t2 is one of the tables that has information pulled from it into t1.
t1
table_oid --which table id is a FK to
id --fk to other table
store_num --field
t2
t2_id
Here is what I need to find: I need the largest t2_id where the store_num is not null in the corresponding record of t1. Here is the query I wrote:
select max(id) from t1
join t2 on t2.t2_id = t1.id
where store_num is not null
and table_oid = 1234;
However, this takes fairly long. I think this should be a fast query. all _ids have indexes for them. (t1.id/t1.table_oid, t2.t2_id). The vast majority of entries in t1 have a store_num.
Mentally, I would get the t2_ids in desc order, than one by one, try them against t1 until I found the first one that had a store_num.
select t2_id from t2 order by t2_id desc;
has an explain cost of 25612
select t1.* from t1 where table_oid = 1234
and id in (select max(t2_id) from t2);
has an explain cost of 8.
So why wouldn't the above query be a cost of at most 25612*8 = 204896? When I explain it, it comes back as more than 3 times that.
Really, my question is how do I re-write that query to run faster.
NOTE: I am using Oracle.
EDIT:
t2 has 11,895,731 rows
t1 has 473,235,192 rows
EDIT 2:
As I've tried different things, the part of the query that is taking the longest is the full scan on t1 looking for the store_num. Is there a way to keep this from doing a full scan, since I only need the biggest entry?

You say:
all _ids have indexes for them
But your query is:
...
where store_num is not null
and table_oid = 1234;
All of your _id indexes are useless for this query unless store_num and table_oid are also indexed, and are the first columns in said index.
So of course it has to do a full scan; it can give you back max(id) instantly without any filter conditions, but as soon as you put in the filter, it can't use the id index anymore because it doesn't know which part of the index matches those store_num is not null entries - not without a scan.
To speed the query up, you need to create an index on (store_num, table_oid, id). Standard disclaimers about creating indexes for a single ad-hoc query apply; having too many indexes will hurt insert/update performance.
It really doesn't matter how you "rewrite" your query - this isn't like application code, the optimizer is going to rearrange all of the pieces of your query anyway. Unless you have sufficiently-selective indexes on your seek columns or the entire query is completely covered by a single index, it's going to be slow.

Not sure if these apply to Oracle. Do you have an index on the fk id column for the join. Also if you can avoid the 'NOT IN' is't a non-sargable type in SQL which slows down a query.
another option that might be slower is doing an outer join then checking for null on that column. (not sure if that only applies to sql also)
select max(id) from t1
left outer join t2 on t2.t2_id = t1.id
where t1... IS NULL
and table_oid = 1234;

The best way I can think of to have this run fast is to:
Create an index on (TABLE_OID, ID DESC, COVERED_ENTITY_ID) in that order. Why?
table_oid -- this is your primary access condition
id -- so you don't have to access a data block to read it,
-- and you get higher ID values first
covered_entity_id -- you're filtering the data based on this, null vs not null
That should prevent the need to access the 473m rows in T1 at all.
Ensure that there's an index on T2_ID.
If all that's in place, a query like:
select max(id)
from t1
inner join t2
on t2.t2_id = t1.id
where covered_entity_id is not null
and table_oid = 1234;
should be (the optimizer is a finicky beast) able to do a semi-join driven by a fast full scans against the index on T1, never scanning the data blocks. Also consider writing it manaully as:
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and exists (select null
from t2
where t1.id = t2.t2_id);
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and id in (select t2_id from t2);
As the optimizer may write those plans slightly differently.

In the following I assume covered_entity_id is the same as store_num - it would really make things easier for us if you were consistent in your naming.
The vast majority of entries in t1
have a store_num.
Given that this is the case, the following clause shouldn't have any impact on the performance of your query ...
where covered_entity_id is not null
However, you go on to say
the part of the query that is taking
the longest is the full scan on t1
looking for the store_num
This suggests the query is looking for covered_entity_id is not null first rather than the presumably far more selective table_oid = 1234. The solution might be as simple as re-writing the query like this ...
where table_oid = 1234
and covered_entity_id is not null;
... although I suspect not. You could try hinting to get the query to use the index on table_oid.
The other thing is, how fresh are the statistics? When the optimizer chooses a radically bad execution plan it is often because the stats are out of date.
Incidentally, why are you joining to T2 at all? Your requirements could be met by selecting max(id) from T1 (unless you don't have a foreign key enforcing T1.ID references T2.T2_ID, and hence need to be sure).
edit
To check your statistics run this query:
select table_name
, num_rows
, last_analyzed
from user_tables
where table_name in ('T1', 'T2')
/
If the results show num_rows is widely divergent from the values you gave in your first edit then you should re-gather statistics. If last_anlayzed is something like the day you went live then you definitely should re-gather. You may want to export your statistics first; refreshing the statistics can affect the execution plans (that is the object of the exercise) usually for good but sometimes things can get worse. Find out more.

Related

SQL Server 2016 : query performance with join and without join

I have 2 tables TABLE1 AND TABLE2.
TABLE1 has columns masterId, Id, col1, col2, category
TABLE2 has columns Id, col1, col2
TABLE2.Id is primary key and TABLE1.Id is foreign key.
TABLE1.masterId is primary key of TABLE1.
TABLE1 has 10 million rows with Id 1 to 10 million and first 10 rows having category = 1
TABLE2 has only 10 rows with Id 1 to 10.
Now I want col1 and col2 values with category=1 (either from TABLE1 OR TABLE2 because the values are same in both tables)
Which among below 2 queries gives output faster?
Solution1:
SELECT T1.col1, T1.col2
FROM TABLE1 T1
WHERE T1.category = 1
Solution2:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.Id = T2.Id
WHERE T1.category = 1
Does Solution2 save Table Scan time on millions of rows of TABLE1.
Limitation is:
In my real db scenario, I can make Table1.Id as non clustered index and Table1.category also non clustered index. I cannot make Table1.Id as clustered index because I actually have another auto increment column as primary key in my Table1 in real scenario. So please share your thoughts with this limitation.
Please confirm and share thoughts on this.
It depends on the existing indexes. With a nonclustered index on Id in T1, then the solution 2 might perform better than solution 1, that would require a complete table scan to select the rows with category1. If instead we also have a nonclustered index on Category, then the solution 1 will be faster, since it would only have to seek the nonclustered index to find the rows.
Without any index on Id on T1 a full scan would be required to find the T2.Id row, therefore there might be 10 full scan of T1 for solution 2 and 1 full scan on T1.Category for solution 1, so the solution 1 might be faster. But this depends on the query optimizer and a test the real case to see what are the actual execution plans would be the best way to answer.
But the way to go is to implement the right model and then proceed to create the indexes needed to make the query run fast.
Edit: adapted the answer according to the query edits.
Edit2: index coverage would be expensive and a 10 index seek on PK on table 1 would not cost so much.
[Notice]
This answer was given for an older version of the question, https://stackoverflow.com/revisions/65263530/7
The scenario back then was:
T2 also had a category column, and,
the second query was:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.categoryId = T2.category Id
WHERE T2.category = 1
Assuming the only indices are the PKs, nope, Solution 2 will NOT avoid the table scan. Worse:
Solution 1
Full table scan
Solution 2
Full table scan on T2 (T2.category) and then nested loops (T2.category = T1.category)
Please, what are your goals here?
To begin with, this statement shows a lack of understanding of databases:
first 10 rows having category = 1
SQL tables represent unordered sets. There is no such thing as "first 10 rows". In the context of your question, I think you mean "the 10 rows with the lowest values of the id". However, the ordering of the table is still arbitrary from the perspective of the engine. There are situations where a clustered index could reasonably be assumed to be a "table ordering", but there is never a guarantee that:
select *
from t;
returns data in a particular ordering even with a clustered index.
Two possible execution plans for the first query -- depending on the indexing -- are:
Scanning the table (i.e. reading millions of rows) and doing the test for each row.
Scanning an index on category and just fetching the rows that are needed.
In general, (1) would be much, much slower than (2) when the scanned rows is in the millions and the returned rows are just a few. However, if this may not be true if a significant proportion of all records were returned.
I interpret your question as asking whether the second query could ever be faster than the first:
SELECT T2.col1, T2.col2
FROM TABLE2 T2 INNER JOIN
TABLE1 T1
ON T1.Id = T2.Id
WHERE T1.category = 1;
The answer is "definitely faster than the scan". This is a possible if you have an index on Table1(id, category). However, the query would be better written using EXISTS:
select t2.*
from table2 t2
where exists (select 1
from table1 t1
where t1.id = t2.id and t2.category = 1
);
I would expect this to be faster than the indexed version of the first query as well. Even with an index on (category), the database still has to fetch the data for the select. If the data is on one page (as the "first" statement might suggest), then the two might be quite comparable. However, it would be hard to measure the difference in performance with the correct indexing on table1.
A note about clustered indexes in SQL Server. If the id is an identity primary key and there is no other clustered index, then it is automatically used as the clustered index.

Why query with "in" and "on" statement runs infinitely

I have three tables, table3 is bascially the intermediate table of table1 and table2. When I execute the query statement that contains "in" and joins table1 and table3, it just kept running and I could not get the result. If I use id=134 instead of id in (134,267,390,4234 ... ), the result comes up. I don't understand why "in" has the effect, does anyone have an idea?
Query statement:
select count(*) from table1, table3 on id=table3.table1_id where table3.table2_id = 123 and id in (134,267,390,4234) and item = 30;
table structure:
table1:
id integer primary key,
item integer
table2:
id integer,
item integer
table3:
table1_id integer,
table2_id integer
-- the DB without index was 0.8 TB after the three indices is now 2.5 TB
indices on: table1.item, table3.table1_id, table3.table2_id
env: Linux, sqlite 3.7.17
from table1, table3 is a cross join on most databases, with the size of your data a cross join is enormous, but in SQLite3 it's an inner join. From the SQLite SELECT docs
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite.
That's not your problem in this specific instance, but let's not tempt fate; always write out your joins explicitly.
select count(*)
from table1
join table3 on id=table3.table1_id
where table3.table2_id = 123
and id in (134,267,390,4234);
Since you're just counting, you don't need any data from table1 but the ID. table3 has table1_id, so there's no need to join with table1. We can do this entirely with the table3 join table.
select count(*)
from table3
where table2_id = 123
and table1_id in (134,267,390,4234);
SQLite can only use one index per table. For this to be performant on such a large data set, you need a composite index of both columns: table3(table1_id, table2_id). Presumably you don't want duplicates, so this should take the form of a unique index. That will cover queries for just table1_id and queries for both table1_id and table2_id; you should drop your table1_id index to save space and time.
create unique index table3_unique on table3(table1_id, table2_id);
The composite index will not for queries which use only table2_id, keep your existing table2_id index.
Your query should now run lickity-split.
For more, read about the SQLite Query Optimizer.
A terabyte is a lot of data. While SQLite technicly can handle this, it might not be the best choice. It's great for small and simple databases, but it's missing a lot of features. You should look into a more powerful database such as PostgreSQL. It is not a magic bullet, all the same principles apply, but it is much more appropriate for data at that scale.

How to redesign a database to find distinct values more effectively?

I often have a need to select a set of distincs values from a column with low selectivity in a big table while joining it to some other table where I can't really filter the entries in the resulting set to some reasonable amount.
For example, I have a table with 20M rows, with column someID which has 200 unique values. I join this table with some other result set on another column and filter 20M rows down to, say, 10M rows (still a lot), and then need to find distinct someID. So I end up with a 10M rows scan no matter what, which is a pain.
In this join, there is no way to filter the results more, 10M records is really the set I need to find distint someID in.
Is there any standard approach to redesign the tables or create some additional table to make this work better?
Your basic query is:
select distinct t1.someID
from table1 t1 join
table2 t2
on t1.col1 = t2.col1;
The optimal indexes for this query are table1(col1, someId) and table2(col1).
Here is another version of the query:
select distinct t1.someId
from table1 t1
where exists (select 1 from table2 t2 where t1.col1 = t2.col1);
In this case, the optimal index would be table1(someid, col1). It is possible that SQL Server will be intelligent in this case and stop looking for an exists value when it encounters a match (although I am a bit skeptical). You would have to investigate the execution plans generated on your data.
Another idea extends this even further:
select s.someId
from someIdtable s
where exists (select 1
from table1 t1 join
table2 t2
on t1.col1 = t2.col1 and t1.someId = s.someId);
This removes the outer distinct, depending only on the semi-join in the exists clause. The optimal index would be table1(someid, col1).
Under some circumstances, this version would probably have the best performance -- for instance, if all the someIds were in the result set. On the other hand, if very few are, this might have poor performance.
I'm stealing the "basic query" from Gordons answer:
select t1.someID
from table1 t1
join table2 t2 on t1.col1 = t2.col1
group by t1.someID
This query fits the requirements of indexed views. You can index this query. Running it will result in a simple clustered index scan which is as cheap as it gets.

Execute MySQL update query on 750k rows

I've added a field to a MySQL table. I need to populate the new column with the value from another table. Here is the query that I'd like to run:
UPDATE table1 t1
SET t1.user_id =
(
SELECT t2.user_id
FROM table2 t2
WHERE t2.usr_id = t1.usr_id
)
I ran that query locally on 239K rows and it took about 10 minutes. Before I do that on the live environment I wanted to ask if what I am doing looks ok i.e. does 10 minutes sound reasonable. Or should I do it another way, a php loop? a better query?
Use an UPDATE JOIN! This will provide you a native inner join to update from, rather than run the subquery for every bloody row. It tends to be much faster.
update table1 t1
inner join table2 t2 on
t1.usr_id = t2.usr_id
set t1.user_id = t2.user_id
Ensure that you have an index on each of the usr_id columns, too. That will speed things up quite a bit.
If you have some rows that don't match up, and you want to set t1.user_id = null, you will need to do a left join in lieu of an inner join. If the column is null already, and you're just looking to update it to the values in t2, use an inner join, since it's faster.
I should make mention, for posterity, that this is MySQL syntax only. The other RDBMS's have different ways of doing an update join.
There are two rather important pieces of information missing:
What type of tables are they?
What indexes exist on them?
If table2 has an index that contains user_id and usr_id as the first two columns and table1 is indexed on user_id, it shouldn't be that bad.
You don't have an index on t2.usr_id.
Create this index and run your query again, or a multiple-table UPDATE proposed by #Eric (with LEFT JOIN, of course).
Note that MySQL lacks other JOIN methods than NESTED LOOPS, so it's index that matters, not the UPDATE syntax.
However, the multiple table UPDATE is more readable.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.