Nested LEFT OUTER JOINS takes too long compared to programmatical solution - sql

I have an SQL query generated by Entity Framework that contains a two level nested LEFT OUTER JOIN on tables TableA, TableB and TableC1 and TableC2 with the following foreign key relations as the arrows indicate.
TableA->TableB->TableC1
->TableC2
TableA contains 1000 rows, and all other tables contain around 100000 rows
The SQL statement looks something like this:
select * from TableA A
LEFT JOIN TableB B on A.Id = B.TableAId
LEFT JOIN TableC1 C1 on B.Id = C1.TableBId
LEFT JOIN TableC2 C2 on B.Id = C2.TableBId
When SQL is executed on Microsoft SQL Server, it takes around 30 seconds.
However, if I do a select of each table and retrieve the rows as lists in memory and join them programmatically in C#, it takes 3 seconds or so.
Can anyone give any indication of why sql server is so slow?
Thanks

When you use joins in SQL you create something called a Cartesian Product.
For example, if I have 2 tables, Table A and Table B, Table A has 10 rows and each row has 10 Table B references:
If I load these 2 tables separately, I am loading 110 rows. 10 rows from table A, and 100 rows from table B.
If I JOIN these tables I am loading 100 rows, however, those hundred rows each represent the combine data of both tables. If Table A has 10 columns, and Table B has 20 columns, the total data read loading the tables separately would be 10x10 + 100x20 or 2100 columns worth of data. With a JOIN, I am loading 30x100 or 3000 columns worth of data. That's not a huge difference, but it compounds as I join more tables.
If each Table B has a Table C with an average of 5 rows and 10 columns, loaded separately that would add 5000 (500x10) or now 7600 columns worth of data. When Joined, that becomes 3000x5x10, or 150,000 columns worth of total data being loaded into memory or sifted through. You should see how this can and does snowball quite quickly if you start doing SELECT * FROM ... with Joins.
When EF goes to build a query where you are loading entity graphs (related entities) the resulting query will often use JOINs resulting in these Cartesian results that it loads, then sifts through to build the resulting object graph, condensing the results back down into the 10 A's with 10 B's and 5 C's, etc. But it still takes memory and time to chew through all of that flattened resulting data. EF Core can offer query splitting to essentially execute more like what your counter-comparison was, loading the related tables separately to piece together, greatly reducing the total amount of data being read.
Ultimately to improve performance of queries generated by EF:
Use Select or Automapper's ProjectTo to select just the values from the related tables, rather than loading Entities /w Include to eager load related entities when reading "sets" of entities like search results. Load entities /w Include for single entities like when updating one.
Ensure that when querying the above data, inspect the execution plans for index suggestions.
If you do need to load large amounts of related data, consider use query splitting.

Related

Creating a view from JOIN two massive tables

Context
I have a big table, say table_A, with roughly 20 billion rows and 600 columns. I don't own this table but I can read from it.
For a fraction of these columns I produce a few extra columns (50) which I store in a separate table, say table_B, which is therefore roughly 20 bn X 50 large.
Now I have the need to expose the join of table table_A and table_B to users, which I tried as
CREATE VIEW table_AB
AS
SELECT *
FROM table_A AS ta
LEFT JOIN table_B AS tb ON (ta.tec_key = tb.tec_key)
The problem is that for any simple query like SELECT * FROM table_AB LIMIT 2 will fail because of memory issues: apparently Impala attempts to do a full join first in memory which would result into a table of 0.5 Petabyte. Hence the failure.
Question
What is the best way to create such a view?
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
I have also tried with [...] SELECT STRAIGHT_JOIN * [...] but did not help.
What is the best way to create such a view?
Since both tables are huge, there will be memory problem. here are some points i would recommend,
Assuming table a and b have same tec_key, do a inner join
Keep (smaller) table b as driver. create vw as select ... from b join a on .... Impala stores driver table in memory and so it will require less memory.
Select only columns required and do not select all.
put filter in view.
Do partitions in table b if you can on some dtae/year/region/anything that can evenly distribute the data.
How can one instruct SQL that e.g. filtering operations are to be performed on table_AB are to be executed before the join?
You can not ensure filter goes before or after join. Only way to ensure a filter will improve perf is if you have partition on the filter column. Else, you can try to filter first and the join to see if it improves perf like this
select ... from b
join ( select ... from a where region='Asia') a on ... -- wont improve much
Creating a new table is also suboptimal because it would mean duplicating the data in table_AB, using up hundreds of Terabytes.
Completely agree on this. Multiple smaller tables is way better than one giant table with 600 columns. So, create few stg table with only required fields and then enrich that data. Its a difficult data set, but no one will change 20b rows everyday - so some sort of incremental is also possible to implement.

Inner join return duplicates data which does not exist in table

When I execute this it return more than million rows, in first table I have 315 000 rows, in second about 14 000. What should I do to get all rows from both tables? Also if I don't stop server it breakdown during listing unexcited rows.
select *
from tblNormativiIspratnica
inner join tblNormativiSubIspratnica on tblNormativiIspratnica.ZaklucokBroj = tblNormativiSubIspratnica.ZaklucokBroj
If the first table has 315000 rows and the second one has 14000 rows, then the fields that you are using are not forming part of a good primary-key-foreign-key relationship, with the result you are getting cartesian product duplicates. If you want to have a well-defined result, you must have well-defined fields that serve these purposes. btw, Take care of your server-breakdown etc, do not try queries that fetch large resultsets, if you have no idea what you are doing. Quickly look-up the basics, and understand the design, by writing simple queries with more specific criteria that fetch small resultsets, before you run large ones.
If you do not have issues of server performance and breaking down, I would have suggested DISTINCT as a (poor) quick solution for getting unique rows, but remember, DISTINCT queries can, at times, take a performance toll.

cardinality estimation when the foreign key is limited to a small subset

Recently I have been optimizing the performance of a large scale ERP packet.
One of the performance issues I haven't been able to solve involves bad cardinality estimation for a foreign key which is limited to a very small subset of records from a large table.
Table A holds 3 mil records and has a type field
Table B holds 7 mil records and holds a foreign key FK to Table A
The foreign key will only be filled with primary keys from table A with a certain type, only 36 from the 3 mil records in Table A have this certain type.
B JOIN A ON B.FK = A.PK AND A.TYPE = X AND A.Name = Y
Now using the correct statistics SQL knows table A will only return 1 value
But SQL estimates only 2 records will be returned from table B (my guess is 7 mil / 3 mil) while actually 930 000 records are returned
This results in a slow query plan being used.
The real query is more complex but the cause of the bad query plan is because of this simplified statement.
Our DB does have accurate statistics for the FK (histogram shows EQ_Rows for every distinct value of this FK) and filtering on a fixed FK value does result in accurate estimations.
Is there any way to show SQL that this FK can only hold a small amount of distinct values or in any other way help him with the estimations.
If we had a chance we would split up the table and put these 36 records in a separate table but unfortunately this is how the ERP system works.
Extra info:
We are using SQL 2014.
The ERP system is Dynamics AX 2012 R3
Using trace flag 9481 does help (not perfect but a lot better) but unfortunately we cannot set trace flags for separate queries with Dynamics AX
I've encountered these kinds of problems before, and have often found that I can dramatically reduce total run time for a stored proc or script by pulling those 'very few relevant rows' from a large table into a small temp table and then joining that temp table into the main query later. Or using CTE queries to isolate the few needed rows. A little experimentation should quickly tell you if this has potential in your case.
Look at the query plan
Clearly you want it to filter on TYPE early
It is probably doing loop join
FROM B
JOIN A
ON B.FK = A.PK
AND A.TYPE = X
AND A.Name = Y
Try the various join hints
Next would be to create a #temp and join to it
Declare a PK on you temp

Best practices for multiple table joins UNION vs JOIN

I'm using a query which brings ~74 fields from different database tables.
The query consists of 10 FULL JOINS and 15 LEFT JOINS.
Each join condition is based on different fields.
The query fetches data from a main table which contains almost 90% foreign keys.
I'm using those joins for the foreign keys but some of the data doesn't require those joins because it's type of data(as logic) doesn't use those information.
Let me give an example:
Each employee can have multiple Tasks.There are four types of tasks(1,2,3,4).
Each TaskType has different meaning. When running the query , I'm getting data for all those tasktypes and then do some logic for showing them separately.
My question is : It is better to use UNION ALL and split all the 4 different cases into queries? This way I could use only the required joins for each case in each union.
Thanks,
I would think it depends strongly on the size (row number count) of the main table an e.g. the task tables.
Say if your main table has tens of millions of rows and the tasks are smaller, then a union with all tasks will necessitate table scans every time, whereas a join with the smaller task tables can do this with one table scan.

Inefficient JOIN Method?

I'm trying to query two fairly large tables here to pull some results and having some trouble with effeciency.
Note: I've only included relevant columns to make this not look so messy!
TableA (Stock) has productID, ownerID, and count columns
TableB (Owners) has ID, accountHolderID, and name columns
What I'm trying to do is query TableA and where productID = X pull up Stock.productID, Stock.accountHolderID and Owners.name. The relation between these two tables is Stock.ownerID = Owners.ID so if the WHERE condition pulled say five productIDs then I'd want the name from TableB that matched up to the ownerID from TableA.
The only unique ID in this situation is Owners.ID from TableB
Just doing a basic SELECT query on TableA for those products takes 15 seconds however when I add an INNER JOIN to match things up to TableB the query takes significantly longer, upwards of 10 minutes. I'm guessing I've designed this query inefficiently.
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID = 42301679
How can I make this query more efficient?
Would adding ORs to the WHERE condition allow me to pull multiple productIDs at once?
Based on your comment, it looks like you're missing a very critical index on the owners.id field. Now, keep in mind this index will help this query, but you have to take into consideration all of the other queries that run against this table to determine if it is a good idea to add that index.
At 29M rows, having an index on a table that is frequently inserted to may have a noticeable effect on insert times.
This may be a situation where different applications need different indexes - namely your OLTP app and your reporting app (which may just be you running ad hoc queries). A common solution is to have a second server that runs your reporting/data warehouse queries that has indexes properly tuned to this function.
Best of luck.
Your'e query looks right
perhaps we can see the schema
In order to pull multiple productIDs at once you can use the IN operator instead of OR
SELECT
Owners.name,
Stock.productID,
Stock.ownerID
FROM Stock
INNER JOIN
Owners
ON Stock.ownerID = Owners.ID
WHERE
Stock.productID IN (42301679,123232,232324)
If the productID is unique in the Stock table, it makes sense to make this the index and this can greatly improve performance as others have mentioned.
Another performance gain comes from setting a specific length Owner.name field. In mySQL, VARCHAR can be used for Strings of varied length while a CHAR(32) column indicates that the name will always occupy 32 characters. The extra unused space is just padded, so you can really think of the (32) as indicating a maximum length. The performance advantage comes from the fact that the database now knows exactly how many bytes each row occupies and it can use this information to improve lookup time.