I am not sure whether the title of this question is correct or not.
I have a table for example users which contains different types of users. Like user type 10, 20, 30 etc.
In a query I need to join the user table, but I want only user type 20. So which of the below query perform better.
SELECT fields
FROM consumer c
INNER JOIN user u ON u.userid = c.userid
WHERE u.type = 20
In another way,
SELECT fields
FROM consumer c
INNER JOIN (SELECT user_fields FROM user WHERE type = 20) u ON u.userid = c.userid
Please advice.
Let's start with this query:
SELECT . . .
FROM consumer c INNER JOIN
user u
ON u.userid = c.userid
WHERE u.type = 20;
Assuming that type is relatively rare, you want indexes on the tables. The best indexes are probably user(type, userid) and customer(userid). It is possible that an index on user(userid, type) would be better (and would be unnecessary if userid is a clustered primary key).
The second query . . . well, from the SQL Server perspective it is probably the same. Why? SQL Server has a good optimizer. You can check the execution plans if you like. Because of the optimizer:
There is no benefit to having a subquery select only a handful of columns. For better or worse, SQL Server pushes that information down to the node that reads the data.
The where clause is not necessarily going to be evaluated before the join. SQL Server is smart enough to re-arrange operations.
Not all optimizers are this smart. In a database such as MySQL, MS Access, or SQLite, I'm pretty sure the first version is much better than the second.
Run the two queries in SSMS as a batch, and click "execution plan" , you will find that the execution plan of both queries, and the query cost (relative to the batch ): 50%
That means they are the same.
If they are different (in case of some optimization), you find the ratio different.
I simulated your query and find the query cost=50% ===> i.e they are the same.
It really depends on a various number of factors:
is "userid" on both table indexed?
is "type" on table "users" indexed?
how many rows in each table?
Usually a subquery produces slower performances, but depending on the conditions listed above and how your sql server installation is configured, both query can be resolved (and so, executed) as the same by the query analyzer.
SQLServer takes your query and tries to optimize it so it can happen that query B is "transformed" in query A.
Look at the QueryAnalyzer tool for both queries, and see if they have differences.
Generally speaking inner queries are better to be avoided, and you'll probably get the best performances doing query A.
Both your options are valid. Personally would code it like this;
SELECT fields
FROM consumer c
INNER JOIN user u ON u.userid = c.userid and u.type = 20
Run both queries in SQL Management Studio (query) and tick 'Include actual execution plan'. This will let you see the performance of your queries against each other. It will depend on your particular database.
Related
I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:
This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.
As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.
Which query is better in performance aspect ?
A
select users.email
from (select * from purchases where id = ***) A
left join users on A.user_id = users.id;
B
select users.email
from purchases A
left join users on A.user_id = users.id where A.id = ***;
Basically, I'm thinking A is better. (But If sql server is optimizing query significantly)
Please explain me which query is better and why. Thanks.
Almost any database optimizer is going to ignore the subquery. Why? SQL queries describer the result set being produced, not the steps for processing it. The SQL optimizer produces the underlying code that is run.
And most optimizers are smart enough to ignore subqueries and to choose optimal indexes and partitions and algorithms regardless of them. One except is that some versions of MySQL/MariaDB tend to materialize subqueries -- and that is a performance killer. I think even that has improved in more recent versions.
What is the best way for query with joins?
First join tables and then add where conditions
First add where conditions with subquery and then join
For example which one of the following queries have a better performance?
select * from person persons
inner join role roles on roles.person_id_fk = persons.id_pk
where roles.deleted is null
or
select * from person persons
inner join (select * from role roles where roles.deleted is null) as roles
on roles.person_id_fk = persons.id_pk
In a decent database, there should be no difference between the two queries. Remember, SQL is a descriptive language, not a procedural language. That is, a SQL SELECT statement describes the result set that should be returned. It does not specify the steps for creating it.
Your two queries are semantically equivalent and the SQL optimizer should be able to recognize that.
Of course, SQL optimizers are not omniscient. So, sometimes how you write a query does affect the execution plan. However, the queries that you are describing are turned into execution plans that have no concept of "subquery", so it is reasonable that they would produce the same execution plan.
Note: Some databases -- such as MySQL and MS Access -- do not have very good optimizers and such queries do produce different execution plans. Alas.
As I build bigger, more advanced web applications, I'm finding myself writing extremely long and complex queries. I tend to write queries within queries a lot because I feel making one call to the database from PHP is better than making several and correlating the data.
However, anyone who knows anything about SQL knows about JOINs. Personally, I've used a JOIN or two before, but quickly stopped when I discovered using subqueries because it felt easier and quicker for me to write and maintain.
Commonly, I'll do subqueries that may contain one or more subqueries from relative tables.
Consider this example:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Rarely, I'll do subqueries after the WHERE clause.
Consider this example:
SELECT
user_id,
(SELECT name FROM organizations WHERE (SELECT organization FROM locations WHERE records.location = location_id) = organization_id) AS organization_name
FROM records
ORDER BY in_timestamp
In these two cases, would I see any sort of improvement if I decided to rewrite the queries using a JOIN?
As more of a blanket question, what are the advantages/disadvantages of using subqueries or a JOIN? Is one way more correct or accepted than the other?
In simple cases, the query optimiser should be able to produce identical plans for a simple join versus a simple sub-select.
But in general (and where appropriate), you should favour joins over sub-selects.
Plus, you should avoid correlated subqueries (a query in which the inner expression refer to the outer), as they are effectively a for loop within a for loop). In most cases a correlated subquery can be written as a join.
JOINs are preferable to separate [sub]queries.
If the subselect (AKA subquery) is not correlated to the outer query, it's very likely the optimizer will scan the table(s) in the subselect once because the value isn't likely to change. When you have correlation, like in the example provided, the likelihood of single pass optimization becomes very unlikely. In the past, it's been believed that correlated subqueries execute, RBAR -- Row By Agonizing Row. With a JOIN, the same result can be achieved while ensuring a single pass over the table.
This is a proper re-write of the query provided:
SELECT u.username,
u.last_name||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
LEFT JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
...because the subselect can return NULL if the user_id doesn't exist in the USERS table. Otherwise, you could use an INNER JOIN:
SELECT u.username,
u.last_name ||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
Derived tables/inline views are also possible using JOIN syntax.
a) I'd start by pointing out that the two are not necessarily interchangable. Nesting as you have requires there to be 0 or 1 matching value otherwise you will get an error. A join puts no such requirement and may exclude the record or introduce more depending on your data and type of join.
b) In terms of performance, you will need to check the query plans but your nested examples are unlikely to be more efficient than a table join. Typically sub-queries are executed once per row but that very much depends on your database, unique constraints, foriegn keys, not null etc. Maybe the DB can rewrite more efficiently but joins can use a wider variety of techniques, drive the data from different tables etc because they do different things (though you may not observe any difference in your output depending on your data).
c) Most DB aware programmers I know would look at your nested queries and rewrite using joins, subject to the data being suitably 'clean'.
d) Regarding "correctness" - I would favour joins backed up with proper constraints on your data where necessary (e.g. a unique user ID). You as a human may make certain assumptions but the DB engine cannot unless you tell it. The more it knows, the better job it (and you) can do.
Joins in most cases will be much more faster.
Lets take this with an example.
Lets use your first query:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Now consider we have 100 records in records and 100 records in user.(Assuming we dont have index on user_id)
So if we understand your algorithm it says:
For each record
Scan all 100 records in users to find out username
Scan all 100 records in users to find out last name and first name
So its like we scanned users table 100*100*2 time. Is it really worth. If we consider index on user_id it will make thing better, but is it still worth.
Now consider a join (nested loop will almost produce same result as above, but consider a hash join):
Its like.
Make a hash map of user.
For each record
Find a mapping record in Hashmap. Which will be certainly much more faster then looping and finding a record.
So clearly, joins should be favorable.
NOTE: Example used of 100 record may produce identical plan, but the idea is to analyze how it can effect the performance.
I've been toying around with switching from ms-access files to SQLite files for my simple database needs; for the usual reasons: smaller file size, less overhead, open source, etc.
One thing that is preventing me from making the switch is what seems to be a lack of speed in SQLite. For simple SELECT queries, SQLite seems to perform as well as, or better than MS-Access. The problem occurs with a fairly complex SELECT query with multiple INNER JOIN statements:
SELECT DISTINCT
DESCRIPTIONS.[oCode] AS OptionCode,
DESCRIPTIONS.[descShort] AS OptionDescription
FROM DESCRIPTIONS
INNER JOIN tbl_D_E ON DESCRIPTIONS.[oCode] = tbl_D_E.[D]
INNER JOIN tbl_D_F ON DESCRIPTIONS.[oCode] = tbl_D_F.[D]
INNER JOIN tbl_D_H ON DESCRIPTIONS.[oCode] = tbl_D_H.[D]
INNER JOIN tbl_D_J ON DESCRIPTIONS.[oCode] = tbl_D_J.[D]
INNER JOIN tbl_D_T ON DESCRIPTIONS.[oCode] = tbl_D_T.[D]
INNER JOIN tbl_Y_D ON DESCRIPTIONS.[oCode] = tbl_Y_D.[D]
WHERE ((tbl_D_E.[E] LIKE '%')
AND (tbl_D_H.[oType] ='STANDARD')
AND (tbl_D_J.[oType] ='STANDARD')
AND (tbl_Y_D.[Y] = '41')
AND (tbl_Y_D.[oType] ='STANDARD')
AND (DESCRIPTIONS.[oMod]='D'))
In MS-Access, this query executes in about 2.5 seconds. In SQLite, it takes a little over 8 minutes. It takes the same amount of time whether I'm running the query from VB code or from the command prompt using sqlite3.exe.
So my questions are the following:
Is SQLite just not optimized to handle multiple INNER JOIN statements?
Have I done something obviously stupid in my query (because I am new to SQLite) that makes it so slow?
And before anyone suggests a completely different technology, no I can not switch. My choices are MS-Access or SQLite. :)
UPDATE:
Assigning an INDEX to each of the columns in the SQLite database reduced the query time from over 8 minutes down to about 6 seconds. Thanks to Larry Lustig for explaining why the INDEXing was needed.
As requested, I'm reposting my previous comment as an actual answer (when I first posted the comment I was not able, for some reason, to post it as an answer):
MS Access is very aggressive about indexing columns on your behalf, whereas SQLite will require you to explicitly create the indexes you need. So, it's possible that Access has indexed either [Description] or [D] for you but that those indexes are missing in SQLite. I don't have experience with that amount of JOIN activity in SQLite. I used it in one Django project with a relatively small amount of data and did not detect any performance issues.
Do you have issues with referencial integrity? I ask because have the impression you've got unnecessary joins, so I re-wrote your query as:
SELECT DISTINCT
t.[oCode] AS OptionCode,
t.[descShort] AS OptionDescription
FROM DESCRIPTIONS t
JOIN tbl_D_H h ON h.[D] = t.[oCode]
AND h.[oType] = 'STANDARD'
JOIN tbl_D_J j ON j.[D] = t.[oCode]
AND j.[oType] = 'STANDARD'
JOIN tbl_Y_D d ON d.[D] = t.[oCode]
AND d.[Y] = '41'
AND d.[oType] ='STANDARD'
WHERE t.[oMod] = 'D'
If DESCRIPTIONS and tbl_D_E have multiple row scans then oCode and D should be indexed. Look at example here to see how to index and tell how many row scans there are (http://www.siteconsortium.com/h/p1.php?id=mysql002).
This might fix it though ..
CREATE INDEX ocode_index ON DESCRIPTIONS (oCode) USING BTREE;
CREATE INDEX d_index ON tbl_D_E (D) USING BTREE;
etc ....
Indexing correctly is one piece of the puzzle that can easily double, triple or more the speed of the query.