How to redesign a database to find distinct values more effectively?

How to redesign a database to find distinct values more effectively? - sql

I often have a need to select a set of distincs values from a column with low selectivity in a big table while joining it to some other table where I can't really filter the entries in the resulting set to some reasonable amount.
For example, I have a table with 20M rows, with column someID which has 200 unique values. I join this table with some other result set on another column and filter 20M rows down to, say, 10M rows (still a lot), and then need to find distinct someID. So I end up with a 10M rows scan no matter what, which is a pain.
In this join, there is no way to filter the results more, 10M records is really the set I need to find distint someID in.
Is there any standard approach to redesign the tables or create some additional table to make this work better?

Your basic query is:
select distinct t1.someID
from table1 t1 join
table2 t2
on t1.col1 = t2.col1;
The optimal indexes for this query are table1(col1, someId) and table2(col1).
Here is another version of the query:
select distinct t1.someId
from table1 t1
where exists (select 1 from table2 t2 where t1.col1 = t2.col1);
In this case, the optimal index would be table1(someid, col1). It is possible that SQL Server will be intelligent in this case and stop looking for an exists value when it encounters a match (although I am a bit skeptical). You would have to investigate the execution plans generated on your data.
Another idea extends this even further:
select s.someId
from someIdtable s
where exists (select 1
from table1 t1 join
table2 t2
on t1.col1 = t2.col1 and t1.someId = s.someId);
This removes the outer distinct, depending only on the semi-join in the exists clause. The optimal index would be table1(someid, col1).
Under some circumstances, this version would probably have the best performance -- for instance, if all the someIds were in the result set. On the other hand, if very few are, this might have poor performance.

I'm stealing the "basic query" from Gordons answer:
select t1.someID
from table1 t1
join table2 t2 on t1.col1 = t2.col1
group by t1.someID
This query fits the requirements of indexed views. You can index this query. Running it will result in a simple clustered index scan which is as cheap as it gets.

Related

SQL Server 2016 : query performance with join and without join

I have 2 tables TABLE1 AND TABLE2.
TABLE1 has columns masterId, Id, col1, col2, category
TABLE2 has columns Id, col1, col2
TABLE2.Id is primary key and TABLE1.Id is foreign key.
TABLE1.masterId is primary key of TABLE1.
TABLE1 has 10 million rows with Id 1 to 10 million and first 10 rows having category = 1
TABLE2 has only 10 rows with Id 1 to 10.
Now I want col1 and col2 values with category=1 (either from TABLE1 OR TABLE2 because the values are same in both tables)
Which among below 2 queries gives output faster?
Solution1:
SELECT T1.col1, T1.col2
FROM TABLE1 T1
WHERE T1.category = 1
Solution2:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.Id = T2.Id
WHERE T1.category = 1
Does Solution2 save Table Scan time on millions of rows of TABLE1.
Limitation is:
In my real db scenario, I can make Table1.Id as non clustered index and Table1.category also non clustered index. I cannot make Table1.Id as clustered index because I actually have another auto increment column as primary key in my Table1 in real scenario. So please share your thoughts with this limitation.
Please confirm and share thoughts on this.

It depends on the existing indexes. With a nonclustered index on Id in T1, then the solution 2 might perform better than solution 1, that would require a complete table scan to select the rows with category1. If instead we also have a nonclustered index on Category, then the solution 1 will be faster, since it would only have to seek the nonclustered index to find the rows.
Without any index on Id on T1 a full scan would be required to find the T2.Id row, therefore there might be 10 full scan of T1 for solution 2 and 1 full scan on T1.Category for solution 1, so the solution 1 might be faster. But this depends on the query optimizer and a test the real case to see what are the actual execution plans would be the best way to answer.
But the way to go is to implement the right model and then proceed to create the indexes needed to make the query run fast.
Edit: adapted the answer according to the query edits.
Edit2: index coverage would be expensive and a 10 index seek on PK on table 1 would not cost so much.

[Notice]
This answer was given for an older version of the question, https://stackoverflow.com/revisions/65263530/7
The scenario back then was:
T2 also had a category column, and,
the second query was:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.categoryId = T2.category Id
WHERE T2.category = 1
Assuming the only indices are the PKs, nope, Solution 2 will NOT avoid the table scan. Worse:
Solution 1
Full table scan
Solution 2
Full table scan on T2 (T2.category) and then nested loops (T2.category = T1.category)
Please, what are your goals here?

To begin with, this statement shows a lack of understanding of databases:
first 10 rows having category = 1
SQL tables represent unordered sets. There is no such thing as "first 10 rows". In the context of your question, I think you mean "the 10 rows with the lowest values of the id". However, the ordering of the table is still arbitrary from the perspective of the engine. There are situations where a clustered index could reasonably be assumed to be a "table ordering", but there is never a guarantee that:
select *
from t;
returns data in a particular ordering even with a clustered index.
Two possible execution plans for the first query -- depending on the indexing -- are:
Scanning the table (i.e. reading millions of rows) and doing the test for each row.
Scanning an index on category and just fetching the rows that are needed.
In general, (1) would be much, much slower than (2) when the scanned rows is in the millions and the returned rows are just a few. However, if this may not be true if a significant proportion of all records were returned.
I interpret your question as asking whether the second query could ever be faster than the first:
SELECT T2.col1, T2.col2
FROM TABLE2 T2 INNER JOIN
TABLE1 T1
ON T1.Id = T2.Id
WHERE T1.category = 1;
The answer is "definitely faster than the scan". This is a possible if you have an index on Table1(id, category). However, the query would be better written using EXISTS:
select t2.*
from table2 t2
where exists (select 1
from table1 t1
where t1.id = t2.id and t2.category = 1
);
I would expect this to be faster than the indexed version of the first query as well. Even with an index on (category), the database still has to fetch the data for the select. If the data is on one page (as the "first" statement might suggest), then the two might be quite comparable. However, it would be hard to measure the difference in performance with the correct indexing on table1.
A note about clustered indexes in SQL Server. If the id is an identity primary key and there is no other clustered index, then it is automatically used as the clustered index.

ORACLE join multiple tables performance

I have kinda complex question.
Let's say that I have 7 tables (20mil+ rows each) (Table1, Table2 ...) with corresponding pk (pk1, pk2, ....) (cardinality among all tables is 1:1)
I want to get my final table (using hash join) as:
Create table final_table as select
t1.column1,
t2.column2,
t3.column3,
t4.column4,
t5.column5,
t6.column6,
t7.column7
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
join table2 t3 on t1.pk1 = t3.pk3
join table2 t4 on t1.pk1 = t4.pk4
join table2 t5 on t1.pk1 = t5.pk5
join table2 t6 on t1.pk1 = t6.pk6
join table2 t7 on t1.pk1 = t7.pk7
I would like to know if it would be faster to create partial tables and then final table, like this?
Create table partial_table1 as select
t1.column1,
t2.column2
from table1 t1
join table2 t2 on t1.pk1 = t2.pk2
create table partial_table2 as select
t1.column1, t1.column2
t3.column3
from partial_table1 t1
join table3 t3 on t1.pk1 = t3.pk3
create table partial_table3 as select
t1.column1, t1.column2, t1.column3
t4.column4
from partial_table1 t1
join table3 t4 on t1.pk1 = t4.pk4
...
...
...
I know it depends on RAM (because I want to use hash join), actual server usage, etc.. I am not looking for specific answer, I am looking for some explanations why and in what situations would it be better to use partial results or why it would it be better to use all 7 joins in 1 select.
Thanks, I hope that my question is easy to understand.

In general, it is not better to create temporary tables. SQL engines have an optimization phase and this optimization phase should do well as figuring out the best query plan.
In the case of a bunch of joins, this is mostly about join order, use of indexes, and the optimal algorithm.
This is a good default attitude. Does it mean that temporary tables are never useful for performance optimization? Not at all. Here are some exceptions:
The optimizer generates a suboptimal query plan. In this case, query hints can push the optimizer in the right direction. And, temporary tables can help.
Indexing the temporary tables. Sometimes an index on the temporary tables can be a big win for performance. The optimizer might not pick this up.
Re-use of temporary tables across queries.
For your particular goal of using hash joins, you can use a query hint to ensure that the optimizer does what you would like. I should note that if the joins are on primary keys, then a hash join might not be the optimal algorithm.

It is not a good idea to create temporary tables in your database. To Optimize your query for reporting purposes or faster results trying using views and it can lead to much better results.
For your specific case, you want to use hash join can you please explain a bit more like why you want to use that in particular because the optimizer will determine the best plan by itself and you don't need to worry about the type of join it performs.

Retrieve count from query using joins which has billions of rows in SQL

I found a faster way to get the count but only 1 table can be specified.
SELECT CONVERT(bigint, rows)
FROM sysindexes
WHERE id = OBJECT_ID('table_name')
AND indid < 2
Is there a way to use the above query to when we are using joins.
Ex: to get the count of this query -
Select t1.col1, t2.col1
from t1
join t2 on t1.col2 = t2.col2

No.
The sysindexes query you show is using metadata to return the estimated row count for the table. It is not necessarily up to date; it can even be lied to (See the link from #Ivan-starostin).
To accomplish your join requires relating two tables and counting the results of that relation. This is not even theoretically possible using metadata; it requires looking at actual data.

Select proper columns from JOIN statement

I have two tables: table1, table2. Table1 has 10 columns, table2 has 2 columns.
SELECT * FROM table1 AS T1 INNER JOIN table2 AS T2 ON T1.ID = T2.ID
I want to select all columns from table1 and only 1 column from table2. Is it possible to do that without enumerating all columns from table1 ?

Yes, you can do the following:
SELECT t1.*, t2.my_col FROM table1 AS T1 INNER JOIN table2 AS T2 ON T1.ID = T2.ID

Even though you can do the t1.*, t2.col1 thing, I would not recommend it in production code.
I would never ever use a SELECT * in production - why?
you're telling SQL Server to get all columns - do you really, really need all of them?
by not specifying the column names, SQL Server has to go figure that out itself - it has to consult the data dictionary to find out what columns are present which does cost a little bit of performance
most importantly: you don't know what you're getting back. Suddenly, the table changes, another column or two are added. If you have any code which relies on e.g. the sequence or the number of columns in the table without explicitly checking for that, your code can brake
My recommendation for production code: always (no exceptions!) specify exactly those columns you really need - and even if you need all of them, spell it out explicitly. Less surprises, less bugs to hunt for, if anything ever changes in the underlying table.

Use table1.* in place of all columns of table1 ;)

SQL: how do I speed up this query

Here is the situation. I have one table that contains records based on records in many different tables (t1 below). t2 is one of the tables that has information pulled from it into t1.
t1
table_oid --which table id is a FK to
id --fk to other table
store_num --field
t2
t2_id
Here is what I need to find: I need the largest t2_id where the store_num is not null in the corresponding record of t1. Here is the query I wrote:
select max(id) from t1
join t2 on t2.t2_id = t1.id
where store_num is not null
and table_oid = 1234;
However, this takes fairly long. I think this should be a fast query. all _ids have indexes for them. (t1.id/t1.table_oid, t2.t2_id). The vast majority of entries in t1 have a store_num.
Mentally, I would get the t2_ids in desc order, than one by one, try them against t1 until I found the first one that had a store_num.
select t2_id from t2 order by t2_id desc;
has an explain cost of 25612
select t1.* from t1 where table_oid = 1234
and id in (select max(t2_id) from t2);
has an explain cost of 8.
So why wouldn't the above query be a cost of at most 25612*8 = 204896? When I explain it, it comes back as more than 3 times that.
Really, my question is how do I re-write that query to run faster.
NOTE: I am using Oracle.
EDIT:
t2 has 11,895,731 rows
t1 has 473,235,192 rows
EDIT 2:
As I've tried different things, the part of the query that is taking the longest is the full scan on t1 looking for the store_num. Is there a way to keep this from doing a full scan, since I only need the biggest entry?

You say:
all _ids have indexes for them
But your query is:
...
where store_num is not null
and table_oid = 1234;
All of your _id indexes are useless for this query unless store_num and table_oid are also indexed, and are the first columns in said index.
So of course it has to do a full scan; it can give you back max(id) instantly without any filter conditions, but as soon as you put in the filter, it can't use the id index anymore because it doesn't know which part of the index matches those store_num is not null entries - not without a scan.
To speed the query up, you need to create an index on (store_num, table_oid, id). Standard disclaimers about creating indexes for a single ad-hoc query apply; having too many indexes will hurt insert/update performance.
It really doesn't matter how you "rewrite" your query - this isn't like application code, the optimizer is going to rearrange all of the pieces of your query anyway. Unless you have sufficiently-selective indexes on your seek columns or the entire query is completely covered by a single index, it's going to be slow.

Not sure if these apply to Oracle. Do you have an index on the fk id column for the join. Also if you can avoid the 'NOT IN' is't a non-sargable type in SQL which slows down a query.
another option that might be slower is doing an outer join then checking for null on that column. (not sure if that only applies to sql also)
select max(id) from t1
left outer join t2 on t2.t2_id = t1.id
where t1... IS NULL
and table_oid = 1234;

The best way I can think of to have this run fast is to:
Create an index on (TABLE_OID, ID DESC, COVERED_ENTITY_ID) in that order. Why?
table_oid -- this is your primary access condition
id -- so you don't have to access a data block to read it,
-- and you get higher ID values first
covered_entity_id -- you're filtering the data based on this, null vs not null
That should prevent the need to access the 473m rows in T1 at all.
Ensure that there's an index on T2_ID.
If all that's in place, a query like:
select max(id)
from t1
inner join t2
on t2.t2_id = t1.id
where covered_entity_id is not null
and table_oid = 1234;
should be (the optimizer is a finicky beast) able to do a semi-join driven by a fast full scans against the index on T1, never scanning the data blocks. Also consider writing it manaully as:
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and exists (select null
from t2
where t1.id = t2.t2_id);
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and id in (select t2_id from t2);
As the optimizer may write those plans slightly differently.

In the following I assume covered_entity_id is the same as store_num - it would really make things easier for us if you were consistent in your naming.
The vast majority of entries in t1
have a store_num.
Given that this is the case, the following clause shouldn't have any impact on the performance of your query ...
where covered_entity_id is not null
However, you go on to say
the part of the query that is taking
the longest is the full scan on t1
looking for the store_num
This suggests the query is looking for covered_entity_id is not null first rather than the presumably far more selective table_oid = 1234. The solution might be as simple as re-writing the query like this ...
where table_oid = 1234
and covered_entity_id is not null;
... although I suspect not. You could try hinting to get the query to use the index on table_oid.
The other thing is, how fresh are the statistics? When the optimizer chooses a radically bad execution plan it is often because the stats are out of date.
Incidentally, why are you joining to T2 at all? Your requirements could be met by selecting max(id) from T1 (unless you don't have a foreign key enforcing T1.ID references T2.T2_ID, and hence need to be sure).
edit
To check your statistics run this query:
select table_name
, num_rows
, last_analyzed
from user_tables
where table_name in ('T1', 'T2')
/
If the results show num_rows is widely divergent from the values you gave in your first edit then you should re-gather statistics. If last_anlayzed is something like the day you went live then you definitely should re-gather. You may want to export your statistics first; refreshing the statistics can affect the execution plans (that is the object of the exercise) usually for good but sometimes things can get worse. Find out more.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas