I am curious on how exactly LINQ (not LINQ to SQL) is performing is joins behind the scenes in relation to how Sql Server performs joins.
Sql Server before executing a query, generates an Execution Plan. The Execution Plan is basically an Expression Tree on what it believes is the best way to execute the query. Each node provides information on whether to do a Sort, Scan, Select, Join, ect.
On a 'Join' node in our execution plan, we can see three possible algorithms; Hash Join, Merge Join, and Nested Loops Join. Sql Server will choose which algorithm to for each Join operation based on expected number of rows in Inner and Outer tables, what type of join we are doing (some algorithms don't support all types of joins), whether we need data ordered, and probably many other factors.
Join Algorithms:
Nested Loop Join:
Best for small inputs, can be optimized with ordered inner table.
Merge Join:
Best for medium to large inputs sorted inputs, or an output that needs to be ordered.
Hash Join:
Best for medium to large inputs, can be parallelized to scale linearly.
LINQ Query:
DataTable firstTable, secondTable;
...
var rows = from firstRow in firstTable.AsEnumerable ()
join secondRow in secondTable.AsEnumerable ()
on firstRow.Field<object> (randomObject.Property)
equals secondRow.Field<object> (randomObject.Property)
select new {firstRow, secondRow};
SQL Query:
SELECT *
FROM firstTable fT
INNER JOIN secondTable sT ON fT.Property = sT.Property
Sql Server might use a Nested Loop Join if it knows there are a small number of rows from each table, a merge join if it knows one of the tables has an index, and Hash join if it knows there are a lot of rows on either table and neither has an index.
Does Linq choose its algorithm for joins? or does it always use one?
The methods on System.Linq.Enumerable are performed in the order they are issued. There is no query optimizer at play.
Many methods are very lazy, which allows you to not fully enumerate the source by putting .First or .Any or .Take at the end of the query. That is the easiest optimization to be had.
For System.Linq.Enumerable.Join specifically, the docs state that this is a hash join.
The default equality comparer, Default, is used to hash and compare keys.
So examples:
//hash join (n+m) Enumerable.Join
from a in theAs
join b in theBs on a.prop equals b.prop
//nestedloop join (n*m) Enumerable.SelectMany
from a in theAs
from b in theBs
where a.prop == b.prop
Linq to SQL does not send join hints to the server. Thus the performance of a join using Linq to SQL will be identical to the performance of the same join sent "directly" to the server (i.e. using pure ADO or SQL Server Management Studio) without any hints specified.
Linq to SQL also doesn't allow you to use join hints (as far as I know). So if you want to force a specific type of join, you'll have to do it using a stored procedure or the Execute[Command|Query] method. But unless you specify a join type by writing INNER [HASH|LOOP|MERGE] JOIN, then SQL Server always picks the type of join it thinks will be most efficient - it doesn't matter where the query came from.
Other Linq query providers - such as Entity Framework and NHibernate Linq - will do exactly the same thing as Linq to SQL. None of these have any direct knowledge of how you've indexed your database and so none of them send join hints.
Linq to Objects is a little different - it will (almost?) always perform a "hash join" in SQL Server parlance. That is because it lacks the indexes necessary to do a merge join, and hash joins are usually more efficient than nested loops, unless the number of elements is very small. But determining the number of elements in an IEnumerable<T> might require a full iteration in the first place, so in most cases it's faster just to assume the worst and use a hashing algorithm.
LINQ itself does not chose algorithms of any kind, as LINQ, strictly speaking, is simply a way of expressing a query in SQL-like syntax that can map to function calls on either IEnumerable<T> or IQueryable<T>. LINQ is entirely a language feature and does not provide functionality, only another way of expressing existing function calls.
In the case of IQueryable<T>, it's entirely up to the provider (such as LINQ to SQL) to chose the best method of producing the results.
In the case of LINQ to Objects (using IEnumerable<T>), simple enumeration is what's used (roughly equivalent to nested loops) in all cases. There is no deep inspection (or even knowledge of) the underlying data types in order to optimize the query.
Related
I have two tables. one is a small table and another one is a large table. While joining between two table, which table i will keep in left and which one in right so that the query optimiser will search quicker or it does not matter where i will join the table..
for example :
--1
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON smalltable.column1 = largetable.column1 ;
--2
SELECT smalltable.column1,
largetable.column1
FROM smalltable
INNER JOIN largetable
ON largetable.column1 = smalltable.column1 ;
Which query will make it faster or it doesnot matter.
If you're talking about Microsoft SQL Server, both queries are equivalent to the query optimizer. In fact, to almost any cost-based query optimizer they'll be equivalent. You can try it by looking at the execucution plan (here for details http://www.simple-talk.com/sql/performance/execution-plan-basics/).
The query optimizer for most decent SQL Server variants will solve that. Some pritimitive ones dont (have a query optimizer - older MySQL, Access come to my mind). SOme may get overlaoded with complex decisions (this is simple).
But in general - trust the query optimizer first.
It should not matter which order you use, as your SQL Server should optimise the query execution for you. However, (if you are using Microsoft SQL Server) you could use SQL Server Profiler (found under the Tools menu of SQL Server Management Studio) to check the execution plans of both options.
If one of the tables is smaller that the other table.
Place the smaller table first and then the larger table as it will have less work to do and more over this will help the query optimizer to choose a plan that uses a Hash Join.
Then run the query profiler and check that the Hash join is used because this is the best and fastest in this scenario.
If there are no indexes on the joined tables then optimizer will select hash join.
You can force a Hash join by using OPTION (HASH JOIN) after inner join statement
From MSDN,http://blogs.msdn.com/b/irenak/archive/2006/03/24/559855.aspx
The column name that joins the table is called a hash key. In the example above, it’ll be au_id. SQL Server examines the two tables being joined, chooses the smaller table (so called build input), and builds a hash table applying a hash algorithm to the values of a hash key. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If build input is done completely in-memory, the hash join is called an “in-memory hash join”. If SQL Server doesn’t have enough memory to hold the entire build input, the process will be done in chunks, and is called “grace hash join”.
Before running both the queries,select'Include Actual Execution Plan' from the menu & then run the queries. The Sql server will show the execution plan which is the best tool to create the optimized queries. See more about Execution Plan here.
The order of the join columns does matter. See this post for more detail. Also there has been no discussion of indexing in this thread. It is the combination of optimal join table order AND useful indexing that results in the fastest executing queries.
Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.
Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.
Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...
Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)
I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.
I'm working with a medical-record system that stores data in a construct that resembles a spreadsheet--date/time in column headers, measurements (e.g. physician name, Rh, blood type) in first column of each row, and a value in the intersecting cell. Reports that are based on this construct often require 10 or more of these measures to be displayed.
For reporting purposes, the dataset needs to have one row for each patient, the date/time the measurement was taken, and a column for each measurement. In essence, one needs to pivot the construct by 90 degrees.
At one point, I actually used SQL Server's PIVOT functionality to do just that. For a variety of reasons, it became apparent that this approach wouldn't work. I decided that I would use an inline view (IV) to massage the data into the desired format. The simplified query resembles:
SELECT patient_id,
datetime,
m1.value AS physician_name,
m2.value AS blood_type,
m3.value AS rh
FROM patient_table
INNER JOIN ( complex query here
WHERE measure_id=1) m1...
INNER JOIN (complex query here
WHERE measure_id=2) m2...
LEFT OUTER JOIN (complex query here
WHERE measure_id=3) m3...
As you can see, in some cases these IVs are used to restrict the resulting dataset (INNER JOIN), in other cases they do not restrict the dataset (LEFT OUTER JOIN). However, the 'complex query' part is essentially the same for each of these measure, except for the difference in measure_id. While this approach works, it leads to fairly large SQL statements, limits reuse, and exposes the query to errors.
My thought was to replace the 'complex query' and WHERE clause with a Inline Table-Value UDF. This would simplify the queries quite a bit, reduce errors, and increase code reuse. The only question on my mind is performance. Will the UDF approach lead to significant decreases in performance? Might it improve matters?
Thanks for your time and consideration.
A correctly defined TVF will not introduce any problem. You'll find many claims on the interned blasting TVFs for performance problems as compared to views or temp tables and variables. What is usualy not understood is that a TVF behaves differently from a view. A View definition is placed into the original query and then the optimizer wil rearrange the query tree as it sees fit (unless the NOEXPAND clause is used on indexed views). A TVF has different semantics and sometimes, specially when updating data, this results in the TVF output being spooled for haloween protection. It helps to mark the function WITH SCHEMABINDING, see Improving query plans with the SCHEMABINDING option on T-SQL UDFs.
Alsois important to understant the concepts of deterministic and precise function. Although they apply mostly to scalar value funcitons, TVFs can be also affected. See User-Defined Function Design Guidelines.
Since you need a SQL String and may not have the ability to add a view or UDF to the system, you may want to use WITH ... AS to limit the complex query to one place (At least for this statement.).
WITH complex(patientid, datetime, measure_id, value) AS
(Select... Complex Query)
SELECT patient_id
, datetime
, m1.value AS physician_name
, m2.value AS blood_type
, m3.value AS rh
FROM patient_table
INNER JOIN (Select ,,,, From complex WHERE measure_id=1) m1...
INNER JOIN (Select ,,,, From complex WHERE measure_id=2) m2...
LEFT OUTER JOIN (Select ,,,, From complex WHERE measure_id=3) m3...
You also have a third option; a traditional VIEW (assuming that you have a key to join to). In theory, there shouldn't be a performance difference between the three options because SQL Server should evaluate and optimize the plans accordingly. The reality is that sometimes that doesn't happen as well as we'd like.
The benefit of a traditional view is that you could make it an indexed view, and give SQL Server another performance aid; however, you'll just have to test and see.
Sql Server 2005 answer:
You can reduce the inline view by using temp/var tables. Performace issues on these are the temp inserts you require per hit on the query, but if the result sets are small enough, they can help. you can use primary keys on var tables, and primary keys/ indexes on temp tables. Other than normal belive, i have found a couple of articles indicating that both temp/var tables are stored in the temp db.
UDF functions, we have found to be less performant, when you have multi layer udfs in complex queries, but will maintain usability.
Be sure to create the function correctly for the various conditions specified. Those that WILL be used for inner joins, and those that will be used for left joins.
So, in general. We do use UDFs, but when we find that the performance degrade, we move the query to insert UDF selections into temp/var tables and join on those.
Create functionality for ease of use/maintinance, and apply performance inhancements where and when required.
EDIT:
If you are required to run this for crystal, and you plan to use Stored procedures, Yes, you can execute sql statements inside the SP to temp/var tables.
Let me know if you are going to use SPs. Sql will then also cache the sp plans with given params as requied.
Also from previous experiance with crystal, things to avoid, is grouping in Crystal that can be done in the SP, page numbers if not required. and function calls, if this can be handled on the server.
My question is not how to use inner join in sql. I know about how it matches between table a and table b.
I'd like to ask how is the internal working of inner working. What algorithm it involves? What happens internally when joining multiple tables?
There are different algorithms, depending on the DB server, indexes and data order (clustered PK), whether calculated values are joined or not etc.
Have a look at a query plan, which most SQL systems can create for a query, it should give you an idea what it does.
In MS Sql, different join algorithms will be used in different situations depending on the tables (their size, what sort of indexes are available, etc). I imagine other DB engines also use a variety of algorithms.
The main types of join used by Ms Sql are:
- Nested loops joins
- Merge joins
- Hash joins
You can read more about them on this page: Msdn -Advanced Query Tuning Concepts
If you get SQL to display the 'execution plan' for your queries you will be able to see what type of join is being used in different situations.
It depends on what database you're using, what you're joining (large/small, in sequence/random, indexed/non-indexed etc).
For example, SQL Server has several different join algorithms; loop joins, merge joins, hash joins. Which one is used is determined by the optimizer when it is working out an execution plan. Sometimes it makes a misjudgement and you can then force a specific join algorithm by using join hints.
You may find the following MSDN pages interesting:
http://msdn.microsoft.com/en-us/library/ms191318.aspx (loop)
http://msdn.microsoft.com/en-us/library/ms189313.aspx (hash)
http://msdn.microsoft.com/en-us/library/ms190967.aspx (merge)
http://msdn.microsoft.com/en-us/library/ms173815.aspx (hints)
In this case you should see how to saved data in b-tree after it i think you will understand JOIN algorithm.
All based set theory, been around a while.
Try not to link too many table at any one time, seems to conk out database resources with all the scanning. Indices help with performance, look at some sql sites and search on optimising sql queries to get some insight. SQL Management Studio has some inbuilt execution plan utility that's often interesting, especially for large complex queries.
The optimizer will (or should) choose the fastest join algo.
However there are two different kinds of determining what is fast:
You measure the time that it takes to return all the joined rows.
You measure the time that it takes to return the first joined rows.
If you want to return all the rows as fast as possible the optimizer will often choose a hash join or a merge join. If you want to return the first few rows as fast as possible the optimzer will choose a nested loops join.
It creates a Cartesian Product of the two tables and then selects the rows out of it. Read Korth book on Databases for the same.
I have a query that looks like this:
SELECT *
FROM employees e
LEFT JOIN
(
SELECT *
FROM timereports
WHERE date = '2009-05-04'
) t
ON e.id = t.employee_id
As you can see, my LEFT JOIN second table parameter is generated by a a subquery.
Does the db evaluate this subquery only once, or multiple times?
thanks.
matti
This depends on the RDBMS.
In most of them, a HASH OUTER JOIN will be employed, in which case the subquery will be evaluated once.
MySQL, on the other hand, isn't capable of making HASH JOIN's, that's why it will most probably push the predicate into the subquery and will issue this query:
SELECT *
FROM timereports t
WHERE t.employee_id = e.id
AND date = '2009-05-04'
in a nested loop. If you have an index on timereports (employee_id, date), this will also be efficient.
If you are using SQL Server, you can take a look at the Execution Plan of the query. The SQL Server query optimizer will optimize the query so that it takes the least time in execution. Best time will be based on some conditions viz. indexing and the like.
You have to ask the database to show the plan. The algorithm for doing this is chosen dynamically (at query time) based on many factors. Some databases use statistics of the key distribution to decide which algorithm to use. Other databases have relatively fixed rules.
Further, each database has a menu of different algorithms. The database could use a sort-merge algorithm, or nested loops. In this case, there may be a query flattening strategy.
You need to use your database's unique "Explain Plan" feature to look at the query execution plan.
You also need to know if your database uses hints (usually comments embedded in the SQL) to pick an algorithm.
You also need to know if your database uses statistics (sometimes called a "cost-based query optimizer) to pick an algorithm.
One you know all that, you'll know how your query is executed and if an inner query is evaluated multiple times or flattened into the parent query or evaluated once to create a temporary result that's used by the parent query.
What do you mean by evaluated?
The database has a couple of different options how to perform a join, the two most common ones being
Nested loops, in which case each row in one table will be looped through and the corresponding row in the other table will be looked up, and
Hash join, which means that both tables will be scanned once and the results are then merged using some hash algorithm.
Which of those two options is chosen depends on the database, the size of the table and the available indexes (and perhaps other things as well).