I am trying to create a stored procedure that calculates difference between a large table's last week's version vs this week's version (Current data).
Both LEFT JOIN and EXCEPT will eventually give same results. However I would like to know if there is a preferred approach to do so in terms of performance.
LEFT JOIN and EXCEPT do not produce the same results.
EXCEPT is set operator that eliminates duplicates. LEFT JOIN is a type of join, that can actually produce duplicates. It is not unusual in SQL that two different things produce the same result set for a given set of input data.
I would suggest that you use the one that best fits your use-case. If both work, test which one is faster and use that.
Related
My boss wants me to do a join on three tables, let's call them tableA, tableB, tableC, which have respectively 74M, 3M and 75M rows.
In case it's useful, the query looks like this :
SELECT A.*,
C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
When trying the query on Dataiku Data Science Studio + Vertica, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.
My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).
Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT(), such as this, for instance :
SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).
Or is it possible that a COUNT() gives me a result ? (because it's not building the dataset but working it out some other way)
(NB : I am currently testing it, but the count has been running for 35mn now)
Vertica is a columnar database. Any query you do only needs to look at the columns required to resolve output, joins, predicates, etc.
Vertica also is able to query against encoded data in many cases, avoiding full materialization until it is actually needed.
Counts like that can be very fast in Vertica. You don't really need to jump through hoops, Vertica will only include columns that are actually used. The optimizer won't try to reconstitute the entire row, only the columns it needs.
What's probably happening here is that you have hash joins with rebroadcasting. If your underlying projections do not line up and your sorts are different and you are joining multiple large tables together, just the join itself can be expensive because it has to load it all into hash and do a lot of network rebroadcasting of the data to get the joins to happen on the initiator node.
I would consider running DBD using these queries as input, especially if these are common query patterns. If you haven't run DBD at all yet and are not using custom projections, then your default projections will likely not perform well and cause the situation I mention above.
You can do an explain to see what's going on.
Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.
Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.
Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...
Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)
I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.
SELECT * FROM TableA
INNER JOIN TableB
ON TableA.name = TableB.name
SELECT * FROM TableA, TableB
where TableA.name = TableB.name
Which is the preferred way and why?
Will there be any performance difference when keywords like JOIN is used?
Thanks
The second way is the classical way of doing it, from before the join keyword existed.
Normally the query processor generates the same database operations from the two queries, so there would be no difference in performance.
Using join better describes what you are doing in the query. If you have many joins, it's also better because the joined table and it's condition are beside each other, instead of putting all tables in one place and all conditions in another.
Another aspect is that it's easier to do an unbounded join by mistake using the second way, resulting in a cross join containing all combinations from the two tables.
Use the first one, as it is:
More explicit
Is the Standard way
As for performance - there should be no difference.
find out by using EXPLAIN SELECT …
it depends on the engine used, on the query optimizer, on the keys, on the table; on pretty much everything
In some SQL engines the second form (associative joins) is depreicated. Use the first form.
Second is less explicit, causes begginers to SQL to pause when writing code. Is much more difficult to manage in complex SQL due to the sequence of the join match requirement to match the WHERE clause sequence - they (squence in the code) must match or the results returned will change making the returned data set change which really goes against the thought that sequence should not change the results when elements at the same level are considered.
When joins containing multiple tables are created, it gets REALLY difficult to code, quite fast using the second form.
EDIT: Performance: I consider coding, debugging ease part of personal performance, thus ease of edit/debug/maintenance is better performant using the first form - it just takes me less time to do/understand stuff during the development and maintenance cycles.
Most current databases will optimize both of those queries into the exact same execution plan. However, use the first syntax, it is the current standard. By learning and using this join syntax, it will help when you do queries with LEFT OUTER JOIN and RIGHT OUTER JOIN. which become tricky and problematic using the older syntax with the joins in the WHERE clause.
Filtering joins solely using WHERE can be extremely inefficient in some common scenarios. For example:
SELECT * FROM people p, companies c WHERE p.companyID = c.id AND p.firstName = 'Daniel'
Most databases will execute this query quite literally, first taking the Cartesian product of the people and companies tables and then filtering by those which have matching companyID and id fields. While the fully-unconstrained product does not exist anywhere but in memory and then only for a moment, its calculation does take some time.
A better approach is to group the constraints with the JOINs where relevant. This is not only subjectively easier to read but also far more efficient. Thusly:
SELECT * FROM people p JOIN companies c ON p.companyID = c.id
WHERE p.firstName = 'Daniel'
It's a little longer, but the database is able to look at the ON clause and use it to compute the fully-constrained JOIN directly, rather than starting with everything and then limiting down. This is faster to compute (especially with large data sets and/or many-table joins) and requires less memory.
I change every query I see which uses the "comma JOIN" syntax. In my opinion, the only purpose for its existence is conciseness. Considering the performance impact, I don't think this is a compelling reason.
I have a relatively simple query joining two tables. The "Where" criteria can be expressed either in the join criteria or as a where clause. I'm wondering which is more efficient.
Query is to find max sales for a salesman from the beginning of time until they were promoted.
Case 1
select salesman.salesmanid, max(sales.quantity)
from salesman
inner join sales on salesman.salesmanid =sales.salesmanid
and sales.salesdate < salesman.promotiondate
group by salesman.salesmanid
Case 2
select salesman.salesmanid, max(sales.quantity)
from salesman
inner join sales on salesman.salesmanid =sales.salesmanid
where sales.salesdate < salesman.promotiondate
group by salesman.salesmanid
Note Case 1 lacks a where clause altogether
RDBMS is Sql Server 2005
EDIT
If the second piece of the join criteria or the where clause was sales.salesdate < some fixed date so its not actually any criteria of joining the two tables does that change the answer.
I wouldn't use performance as the deciding factor here - and quite honestly, I don't think there's any measurable performance difference between those two cases, really.
I would always use case #2 - why? Because in my opinion, you should only put the actual criteria that establish the JOIN between the two tables into the JOIN clause - everything else belongs in the WHERE clause.
Just a matter of keeping things clean and put things where they belong, IMO.
Obviously, there are cases with LEFT OUTER JOINs where the placement of the criteria does make a difference in terms of what results get returned - those cases would be excluded from my recommendation, of course.
Marc
You can run the execution plan estimator and sql profiler to see how they stack up against each other.
However, they are semantically the same underneath the hood according to this SQL Server MVP:
http://www.eggheadcafe.com/conversation.aspx?messageid=29145383&threadid=29145379
I prefer to have any hard coded criteria in the join. It makes the SQL much more readable and portable.
Readability:
You can see exactly what data you're going to get because all the table criteria is written right there in the join. In large statements, the criteria may be buried within 50 other expressions and is easily missed.
Portability:
You can just copy a chunk out of the FROM clause and paste it somewhere else. That gives the joins and any criteria you need to go with it. If you always use that criteria when joining those two tables, then putting it in the join is the most logical.
For Example:
FROM
table1 t1
JOIN table2 t2_ABC ON
t1.c1 = t2_ABC.c1 AND
t2_ABC.c2 = 'ABC'
If you need to get a second column out of table 2 you just copy that block into Notepad, search/repalce "ABC" and presto and entire new block of code ready to paste back in.
Additional:
It's also easier to change between an inner and outer join without having to worry about any criteria that may be floating around in the WHERE clause.
I reserve the WHERE clause strictly for run-time criteria where possible.
As for efficiency:
If you're referring to excecution speed, then as everyone else has stated, it's redundant.
If you're referring to easier debugging and reuse, then I prefer option 1.
One thing I want to say finally as I notified, before that..
Both ways may give the same performance or using the criteria at Where clause may be little faster as found in some answers..
But I identified one difference, you can use for your logical needs..
Using the criteria at ON clause will not filter/skip the rows to select instead the join columns would be null based on the conditions
Using the criteria at Where clause may filter/skip the rows at the entire results
I don't think you'll find a finite answer for this one that applies to all cases. The 2 are not always interchangeable - since for some queries (some left joins) you will come up with different results by placing the criteria in the WHERE vs the FROM line.
In your case, you should evaluate both of these queries. In SSMS, you can view the estimated and actual execution plans of both of these queries - that would be a good first step in determining which is more optimal. You could also view the time & IO for each (set statistics time on, set statistics io on) - and that will also give you information to make your decision.
In the case of the queries in your question - I'd bet that they'll both come out with the same query plan - so in this case it may not matter, but in others it could potentially produce different plans.
Try this to see the difference between the 2...
SET STATISTICS IO ON
SET STATISTICS TIME ON
select salesman.salesmanid,
max(sales.quantity)
from salesmaninner join sales on salesman.salesmanid =sales.salesmanid
and sales.salesdate < salesman.promotiondate
group by salesman.salesmanid
select salesman.salesmanid,
max(sales.quantity)
from salesmaninner join sales on salesman.salesmanid = sales.salesmanid
where sales.salesdate < salesman.promotiondate
group by salesman.salesmanid
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
It may seem flippant, but the answer is whichever query for which the query analyzer produces the most efficient plan.
To my mind, they seem to be equivalent, so the query analyzer may well produce identical plans, but you'd have to test.
Neither is more efficient, using the WHERE method is considered the old way to do so (http://msdn.microsoft.com/en-us/library/ms190014.aspx). YOu can look at the execution plan and see they do the same thing.
Become familiar with the Estimated Execution Plan in SQL Management Studio!! Like others have said, you're at the mercy of the analyzer no matter what you do so trust its estimates. I would guess the two you provided would produce the exact same plan.
If it's an attempt to change a development culture, pick the one that gives you a better plan; for the ones that are identical, follow the culture
I've commented this on other "efficiency" posts like this one (it's both sincere and sarcastic) -- if this is where your bottlenecks reside, then high-five to you and your team.
Case 1 (criteria in the JOIN) is better for encapsulation, and increased encapsulation is usually a good thing: decreased copy/paste omissions to another query, decreased bugs if later converted to LEFT JOIN, and increased readability (related stuff together and less "noise" in WHERE clause). In this case, the WHERE clause only captures principal table criteria or criteria that spans multiple tables.
I have a query containing three inner join statements in the Where clause. The query takes roughly 2 minutes to execute. If I simply change the order of two of the inner joins, performance drops to 40 seconds.
How can doing nothing but changing the order of the inner joins have such a drastic impact of query performance? I would have thought the optimizer would figure all this out.
SQL is declarative, that is, the JOIN order should not matter.
However it can in practice, say, if it's a complex query when the optimiser does not explore all options (which in theory could take months).
Another option is that it's a very different query if you reorder and you get different results, but this is usually with OUTER JOINs.
And it could also be the way the ON clause is specified It has to change if you reorder the FROM clause. Unless you are using the older (and bad) JOIN-in-the-WHERE-clause.
Finally, if it's a concern you could use parenthesis to change evaluation order to make your intentions clear, say, filter on a large table first to generate a derived table.
Because by changing the order of the joins, SQL Server is coming up with a different execution plan for your query (chances are it's changing the way it's filtering the tables based on your joins).
In this case, I'm guessing you have several large tables...one of which performs the majority of the filtering.
In one query, your joins are joining several of the large tables together and then filtering the records at the end.
In the other, you are filtering the first table down to a much smaller sub-set of the data...and then joining the rest of the tables in. Since that initial table got filtered before joining the other large recordsets, performance is much better.
You could always verify but running the query with the 'Show query plan' option enabled and see what the query plan is for the two different join orders.
I would have thought it was smart enough to do that as well, but clearly it's still performing the joins in the order you explicitly list them... As to why that affects the performance, if the first join produces an intermediate result set of only 100 records in one ordering scheme, then the second join will be from that 100-record set to the third table.
If putting the other join first produces a first intermediate result set of one million records, then the second join will be from a one million row result set to the third table...