Azure Synapse query optimization: Join of 3 huge tables

Azure Synapse query optimization: Join of 3 huge tables - sql

One of the SP involving the join of 3 tables and then putting it into a new table is taking forever to process.
Table1 has 86,12,44,940 records (84gb). Table2 has 72,98,41,882 records. (56gb)
Table 3 has 46140 records.
Any idea how can I process this faster?
Query(SP) below:
insert into newtable
select ...
from t1 join t2 on t1.id = t2.id
join t3 on t2.id2 = t3.id2
id2 has a lot of duplicate values(not unique) but all records are required.
The total records will increase in the final table after the join since id2 is not unique.
I ran it for 5 hours and it failed because of query timeout. There is a date record as well through which i can process the results batch by batch, but not sure how long it will take.

Query Optimization is one of the solutions for above issue there are many query optimization ways one of which is Clustered column store index
Note you can refer other optimization techniques in Optimising Query Performance — In Azure Synapse Analytics | by Samueldavidwinter | Medium
Clustered column store index
A clustered columnstore index is the physical storage for the entire table. To reduce fragmentation of the column segments and improve performance, the columnstore index might store some data temporarily into a clustered index called a deltastore and a B-tree list of IDs for deleted rows. for more information check the following links
Columnstore indexes - Query performance - SQL Server | Microsoft Learn
Performance tuning with ordered clustered columnstore index - Azure Synapse Analytics | Microsoft Learn
I have created a demo table Account, Acc, Persons and inserted some random repeated values into it
Account Table
Acc Table
Persons Table
Now, it should be converted to store data in 'columnstore' in order to reduce the storage space
CREATE TABLE account2
WITH
(
DISTRIBUTION = ROUND_ROBIN
,CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT *
FROM account;
Apply the above query for all three tables. Now compare sizes of the table before and after performing Clustered Columstore index using the following query:
exec sp_spaceused 'account'
Before:
After:
Joining these three indexed tables to insert results into fourth table demo
select * into finalt from account2 t1 inner join Persons2 t2 on t1.ACCOUNTKEY = t2.PersonID join acc2 t3 on t2.PersonID = t3.IKEY
Join condition for table applied for clustered column takes less time than for the table not applied as seen below

Related

SQL Server 2016 : query performance with join and without join

I have 2 tables TABLE1 AND TABLE2.
TABLE1 has columns masterId, Id, col1, col2, category
TABLE2 has columns Id, col1, col2
TABLE2.Id is primary key and TABLE1.Id is foreign key.
TABLE1.masterId is primary key of TABLE1.
TABLE1 has 10 million rows with Id 1 to 10 million and first 10 rows having category = 1
TABLE2 has only 10 rows with Id 1 to 10.
Now I want col1 and col2 values with category=1 (either from TABLE1 OR TABLE2 because the values are same in both tables)
Which among below 2 queries gives output faster?
Solution1:
SELECT T1.col1, T1.col2
FROM TABLE1 T1
WHERE T1.category = 1
Solution2:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.Id = T2.Id
WHERE T1.category = 1
Does Solution2 save Table Scan time on millions of rows of TABLE1.
Limitation is:
In my real db scenario, I can make Table1.Id as non clustered index and Table1.category also non clustered index. I cannot make Table1.Id as clustered index because I actually have another auto increment column as primary key in my Table1 in real scenario. So please share your thoughts with this limitation.
Please confirm and share thoughts on this.

It depends on the existing indexes. With a nonclustered index on Id in T1, then the solution 2 might perform better than solution 1, that would require a complete table scan to select the rows with category1. If instead we also have a nonclustered index on Category, then the solution 1 will be faster, since it would only have to seek the nonclustered index to find the rows.
Without any index on Id on T1 a full scan would be required to find the T2.Id row, therefore there might be 10 full scan of T1 for solution 2 and 1 full scan on T1.Category for solution 1, so the solution 1 might be faster. But this depends on the query optimizer and a test the real case to see what are the actual execution plans would be the best way to answer.
But the way to go is to implement the right model and then proceed to create the indexes needed to make the query run fast.
Edit: adapted the answer according to the query edits.
Edit2: index coverage would be expensive and a 10 index seek on PK on table 1 would not cost so much.

[Notice]
This answer was given for an older version of the question, https://stackoverflow.com/revisions/65263530/7
The scenario back then was:
T2 also had a category column, and,
the second query was:
SELECT T2.col1, T2.col2
FROM TABLE2 T2
INNER JOIN TABLE1 T1 ON T1.categoryId = T2.category Id
WHERE T2.category = 1
Assuming the only indices are the PKs, nope, Solution 2 will NOT avoid the table scan. Worse:
Solution 1
Full table scan
Solution 2
Full table scan on T2 (T2.category) and then nested loops (T2.category = T1.category)
Please, what are your goals here?

To begin with, this statement shows a lack of understanding of databases:
first 10 rows having category = 1
SQL tables represent unordered sets. There is no such thing as "first 10 rows". In the context of your question, I think you mean "the 10 rows with the lowest values of the id". However, the ordering of the table is still arbitrary from the perspective of the engine. There are situations where a clustered index could reasonably be assumed to be a "table ordering", but there is never a guarantee that:
select *
from t;
returns data in a particular ordering even with a clustered index.
Two possible execution plans for the first query -- depending on the indexing -- are:
Scanning the table (i.e. reading millions of rows) and doing the test for each row.
Scanning an index on category and just fetching the rows that are needed.
In general, (1) would be much, much slower than (2) when the scanned rows is in the millions and the returned rows are just a few. However, if this may not be true if a significant proportion of all records were returned.
I interpret your question as asking whether the second query could ever be faster than the first:
SELECT T2.col1, T2.col2
FROM TABLE2 T2 INNER JOIN
TABLE1 T1
ON T1.Id = T2.Id
WHERE T1.category = 1;
The answer is "definitely faster than the scan". This is a possible if you have an index on Table1(id, category). However, the query would be better written using EXISTS:
select t2.*
from table2 t2
where exists (select 1
from table1 t1
where t1.id = t2.id and t2.category = 1
);
I would expect this to be faster than the indexed version of the first query as well. Even with an index on (category), the database still has to fetch the data for the select. If the data is on one page (as the "first" statement might suggest), then the two might be quite comparable. However, it would be hard to measure the difference in performance with the correct indexing on table1.
A note about clustered indexes in SQL Server. If the id is an identity primary key and there is no other clustered index, then it is automatically used as the clustered index.

Query performance of non clustered index during inner join

We have two table in a inner join query. Which of the below options is recommended for high performance when there are millions of records in both the tables.
NOTE: We have a non-clustered index on the foreign key column. We don't have enough data to verify the performance in the development environment. Also there might be more tables come into this join with INNER or LEFT joins.
Tables:
Subscriber(SID(PK), Name)
Account(AID(PK),SID(FK), AName)
Query:
SELECT *
FROM Account A
INNER JOIN Subscriber S ON S.SubscriberID= A.SubscriberID
WHERE
S.SubscriberID = #subID -- option 1
A.SubscriberID = #subID -- option 2

It should not make any difference which column you put the predicate on.
SQL Server can see the S.SubscriberID= A.SubscriberID condition and create an "implied predicate" for the other column anyway.
This does become a cross join (between the filtered rows from both sides) then as discussed here

Impact of index on different columns while join

lets say we have two tables- table A and table B and both tables have 5 million records each. They have common fields, id and name. i want to check that what would be the impact of index if we apply on join field while joining the tables and what would be the impact of index on select column while joining the tables. below is query
select t1.name from table A t1 inner join table B t2 on t1.id=t2.id;
on which field shall i create index in order to have faster result. shall i put index on id or name? please help
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
looking for performance improvement
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.

For this query:
select t1.name
from tableA t1 inner join
tableB t2
on t1.id = t2.id;
I would expect the best index to be tableB(id). This is the key used for the JOIN.
Under some circumstances, an index on tableA(id, name) might be the best alternative. This would be particularly true if tableA were much larger than tableB.

Index suggestion for a table which has id columns

I have a table whose all columns store ids of other tables (huge tables).
CREATE TABLE #mytable (
Table1Id int,
Table2Id int,
Table3Id int,
Table4Id int,
Table5Id int,
)
Now my select has join to all the tables whose ids are stored in the columns of my table.
select T1.col1, t2.Col1, T3.col1... from
#mytable MyTable inner join table1 T1 on MyTable.Table1Id = T1.Id
inner join table2 T2 on MyTable.Table2Id = T2.Id
inner join table3 T3 on MyTable.Table3Id = T3.Id
inner join table4 T4 on MyTable.Table4Id = T4.Id
inner join table5 T5 on MyTable.Table5Id = T5.Id
order by T1.Col1, T2.col1
At the moment I only have an index on Table1Id and on all the id columns of the other tables. Any suggestions to improve the performance.

You don't say which column your index is currently defined on, but based on your example query, you should create an index for all five columns;
Table1Id, Table2Id, Table3Id, Table4Id, Table5Id
This allows the SQL engine to resolve the query just by reading the index, which should be faster than reading the index, then reading the table.
If you run queries where you access some of the columns, then you need an index for those columns as well. Let's say you run a query on Table3Id and Table4Id. Then you need to create an index on;
Table3Id, Table4Id
I can't tell from the information you provided in your question if these indexes should be unique or non unique. You would have to make that determination.

Examine #mytable
You have no search criteria on that table
no where
no order by
no group by
You are just going to get those rows in no particular order.
There is no use for any index on #mytable
The index Table1Id is not used by that query and will slow down inserts
I suspect #mytable is just an output table and the where conditions are used to populate that table.
The join will use the ID on the table to be joined.
So index ID on table1-x and index it as a PK (clustered) if you can.
If that index is fragmented then defrag.
That join should be an index seek and you can't do any better.
Verify the query plan has index seeks on the joins.
If you don have index seeks on those joins then post the query plan.
You could experiment with hints on the join but I suspect the query optimizer will get it right - that may be a big query but it is not a complex query.
Since SQL will grab pages if you order the #mytable by the individual columns you have a better chance of that page being in memory.
A PK is free IF you can insert in the order of the PK.
In that case you would put the column with the most values in the last position.
Actually would would put the column with tightest groupings of PK in the last position.
And then sort by PK.

For the statement that you have put in your question, there is probably little that you can do. In fact, indexes could even hurt under some circumstances if you are in a memory limited environment.
As a first step, though, you should have indexes in the numbered tables on the id column. That is, you should be storing and then joining on the primary key of these tables (the index is automatic on a primary key).
Generally, the purpose of indexes is to prevent scanning an entire table to find a particular set of records. In this case, it looks like you want all the records anyway, so full-table scans are necessary. That limits the applicability of indexes. There is a good chance that SQL Server will turn these joins into hash joins, which is an efficient way of joining a table when you need to read all the rows.
Additional indexes might be warranted depending on where and group by clauses.

Performance issue with select query in Firebird

I have two tables, one small (~ 400 rows), one large (~ 15 million rows), and I am trying to find the records from the small table that don't have an associated entry in the large table.
I am encountering massive performance issues with the query.
The query is:
SELECT * FROM small_table WHERE NOT EXISTS
(SELECT NULL FROM large_table WHERE large_table.small_id = small_table.id)
The column large_table.small_id references small_table's id field, which is its primary key.
The query plan shows that the foreign key index is used for the large_table:
PLAN (large_table (RDB$FOREIGN70))
PLAN (small_table NATURAL)
Statistics have been recalculated for indexes on both tables.
The query takes several hours to run. Is this expected?
If so, can I rewrite the query so that it will be faster?
If not, what could be wrong?

I'm not sure about Firebird, but in other DBs often a join is faster.
SELECT *
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.small_id
WHERE lt.small_id IS NULL
Maybe give that a try?
Another option, if you're really stuck, and depending on the situation this needs to be run in, is to take the small_id column out of the large_table, possibly into a temp table, and then do a left join / EXISTS query.

If the large table only has relatively few distinct values for small_id, the following might perform better:
select *
from small_table st left outer join
(select distinct small_id
from large_table
) lt
on lt.small_id = st.id
where lt.small_id is null
In this case, the performance would be better by doing a full scan of the large table and then index lookups in the small table -- the opposite of what it is doing. Doing a distinct could do just an index scan on the large table which then uses the primary key index on the small table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas