SQL Perfomance : Subselects in joins or direct joins? - sql

I have got a question on the performance of the below tables\
Table A -- Has only 5 customer ID's(5 Rows 1 column)
Table B -- Is the master base for all Customer's and their information.(1 Million Rows and 500 Columns)
Query 1:-
Select A.*,
B.Age
from A
left join B
on A.Customer_id = B.Customer_id;
Query 2:-
Select a.*,
B.Age
from A
left join
(select Customer_id,age from B) C
on A.Customer_id = C.Customer_id;
The main question of performance here is because of the presence of 500 columns in Table B.
I feel the 2nd Query is better as SQL wont have to create a temporary table during the join containing all columns from table B.
Please let me know if this is wrong?

I feel the 2nd Query is better as SQL wont have to create a temporary table during the join containing all columns from table B.
You can tell whether Oracle does create a temporary table during the execution or not from the explain plan. You should also consider whether the Oracle kernel developers would not have got round such an obvious performance problem if it existed.
As it happens, there will be no temporary table, and there is nothing wrong with your first query. There is almost never a need to manipulate the query for performance reasons -- write queries that are the best encapsulation of the logic you require.

CREATE INDEX index_name ON table_b (customer_id)
then use
Select a.*,
B.Age
from A
left join (select Customer_id,
age
from B) C
on A.Customer_id = C.Customer_id;

500 columns is rather extensive.
Maybe you can create an index like:
CREATE INDEX index_name
ON table_b (customer_id,
age
);

sub query in select is faster than using join (no matter if direct join or sub select)
select
a.*,
(select b.age
from b
where b.customer_id = a.customer_id)
from a
note:
it behaves like outer join (return empty field in age if customer_id from b doesn't exists in a)
the sub query should return only one row from b per row from a.

Related

SQL Inner Join w/ Unique Vals

Questions similar to this one about using DISTINCT values in an INNER JOIN have been asked a few times, but I don't see my (simple) use case.
Problem Description:
I have two tables Table A and Table B. They can be joined via a variable ID. Each ID may appear on multiple rows in both Table A and Table B.
I would like to INNER JOIN Table A and Table B on the distinct values of ID which appear in Table B and select all rows of Table A with a Table A.ID which appears matching some condition in Table B.
What I want:
I want to make sure I get only one copy of each row of Table A with a Table A.ID matching a Table B.ID which satisfies [some condition].
What I would like to do:
SELECT * FROM TABLE A
INNER JOIN (
SELECT DISTINCT ID FROM TABLE B WHERE [some condition]
) ON TABLE A.ID=TABLE B.ID
Additionally:
As a further (really dumb) constraint, I can't say anything about the SQL standard in use, since I'm executing the SQL query through Stata's odbc load command on a database I have no information about beyond the variable names and the fact that "it does accept SQL queries," ( <- this is the extent of the information I have).
If you want all rows in a that match an id in b, then use exists:
select a.*
from a
where exists (select 1 from b where b.id = a.id);
Trying to use join just complicates matters, because it both filters and generates duplicates.

Compare 2 Tables and write RowID from Table A to Table B

I have Table A with RowID, Date, Vendor and Cost, Confirmed
I have table B with RowID, Date, Vendor and Cost, Confirmed
Table A list our purchases.
Table B list the statement data from the credit card.
I would like to compare Date, Vendor and Cost in Table A with the same columns in Table B. If there is a match with those three columns, then I would like to take the RowID value from Table A and write it to the matching row in Table B under the Confirmation column.
I am very new to SQL and I am not even sure this is a reasonable expectation.
What do you think?
Is this enough detail to provide your opinion?
Thank you for any help you can give.
Currently I am using an Outer Right Join to get all the rows that do NOT have a match. What I really need is the opposite.
It might help to know what database engine you are using...
My answers are going to relate to MS SQL Server, but much SQL Syntax is the same...
To answer your first question, I would write something like:
Update TableB Set Confirmation = TableA.RowID From TableA
Where TableA.Vendor = TableB.Vendor
And TableA.Cost = TableB.Cost
And TableA.Date = TableB.Date
I would use aliases, but I left them out to hopefully make it easier to understand.
To answer your second question, you can specify an INNER JOIN which is the opposite of an OUTER JOIN and as you mentioned, what you are looking for, as it will return ALL rows that Match and exclude the rest.
You can do this with this query
UPDATE
B
SET
Confirmation = A.RowID
FROM
TableA A
INNER JOIN TableB B
ON B.Vendor = A.Vendor
AND B.Cost = A.Cost
AND B.Date = A.Date
Basically we do an inner join to keep the intersection (the records that match) of the two tables. and update the records that coincided from table B with the id of table A

sqlite query does not complete - bad index, or just too much data?

I have two tables in SQLITE: from_original and from_binary. I want to LEFT OUTER JOIN to pull the records in from_original that are not present in from_binary. The problem is that the query I have written does not complete (I terminated it after about 1 hour). Can anyone help me to understand why?
I have an index defined for each of the fields in my query, but explain query plan only mentions one of the indices will be referenced. I am not sure if this is the problem, or if it is just a matter of too much data.
Here is the query I am trying to run:
select * from from_original o
left outer join from_binary b
on o.id = b.id
and o.timestamp = b.timestamp
where b.id is null
The tables each have about 4 million records:
I have an index defined on all the id and timestamp fields (see schema at end of post), but explain query plan only indicates that one of the id indices will be used.
Here is the table schema, including index definitions:
This is your query:
select o.*
from from_original o left outer join
from_binary b
on o.id = b.id and o.timestamp = b.timestamp
where b.id is null;
(Note: you do not need the columns from b because they should all be NULL.)
The best index for this query is a composite index on from_binary(id, timestamp). You do have a lot of data for SQLite, but this might finish in a finite amount of time.

SQL Join between tables with conditions

I'm thinking about which should be the best way (considering the execution time) of doing a join between 2 or more tables with some conditions. I got these three ways:
FIRST WAY:
select * from
TABLE A inner join TABLE B on A.KEY = B.KEY
where
B.PARAM=VALUE
SECOND WAY
select * from
TABLE A inner join TABLE B on A.KEY = B.KEY
and B.PARAM=VALUE
THIRD WAY
select * from
TABLE A inner join (Select * from TABLE B where B.PARAM=VALUE) J ON A.KEY=J.KEY
Consider that tables have more than 1 milion of rows.
What your opinion? Which should be the right way, if exists?
Usually putting the condition in where clause or join condition has no noticeable differences in inner joins.
If you are using outer joins ,putting the condition in the where clause improves query time because when you use condition in the where clause of
left outer joins, rows which aren't met the condition will be deleted from the result set and the result set becomes smaller.
But if you use the condition in join clause of left outer joins ,no rows deletes and result set is bigger in comparison to using condition in the where clause.
for more clarification,follow the example.
create table A
(
ano NUMBER,
aname VARCHAR2(10),
rdate DATE
)
----A data
insert into A
select 1,'Amand',to_date('20130101','yyyymmdd') from dual;
commit;
insert into A
select 2,'Alex',to_date('20130101','yyyymmdd') from dual;
commit;
insert into A
select 3,'Angel',to_date('20130201','yyyymmdd') from dual;
commit;
create table B
(
bno NUMBER,
bname VARCHAR2(10),
rdate DATE
)
insert into B
select 3,'BOB',to_date('20130201','yyyymmdd') from dual;
commit;
insert into B
select 2,'Br',to_date('20130101','yyyymmdd') from dual;
commit;
insert into B
select 1,'Bn',to_date('20130101','yyyymmdd') from dual;
commit;
first of all we have normal query which joins 2 tables with each other:
select * from a inner join b on a.ano=b.bno
the result set has 3 records.
now please run below queries:
select * from a inner join b on a.ano=b.bno and a.rdate=to_date('20130101','yyyymmdd')
select * from a inner join b on a.ano=b.bno where a.rdate=to_date('20130101','yyyymmdd')
as you see above results row counts have no differences,and According to my experience there is no noticeable performance differences for data in large volume.
please run below queries:
select * from a left outer join b on a.ano=b.bno and a.rdate=to_date('20130101','yyyymmdd')
in this case,the count of output records will be equal to table A records.
select * from a left outer join b on a.ano=b.bno where a.rdate=to_date('20130101','yyyymmdd')
in this case , records of A which didn't met the condition deleted from the result set and as I said the result set will have less records(in this case 2 records).
According to above examples we can have following conclusions:
1-in case of using inner joins,
there is no special differences between putting condition in where clause or join clause ,but please try to put tables in from clause in order to have minimum intermediate result row counts:
(http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm)
2-In case of using outer joins,whenever you don't care of exact result row counts (don't care of missing records of table A which have no paired records in table B and fields of table B will be null for these records in the result set),put the condition in the where clause to delete a set of rows which aren't met the condition and obviously improve query time by decreasing the result row counts.
but in special cases you HAVE TO put the condition in the join part.for example if you want that your result row count will be equal to table 'A' row counts(this case is common in ETL processes) you HAVE TO put the condition in the join clause.
3-avoiding subquery is recommended by lots of reliable resources and expert programmers.It usually increase the query time and you can use subquery just when its result data set is small.
I hope this will be useful:)
1M rows really isn't that much - especially if you have sensible indexes. I'd start off with making your queries as readable and maintainable as possible, and only start optimizing if you notice a perforamnce problem with the query (and as Gordon Linoff said in his comment - it's doubtful there would even be a difference between the three).
It may be a matter of taste, but to me, the third way seems clumsy, so I'd cross it out. Personally, I prefer using JOIN syntax for the joining logic (i.e., how A and B's rows are matched) and WHERE for filtering (i.e., once matched, which rows interest me), so I'd go for the first way. But again, it really boils down to personal taste and preferences.
You need to look at the execution plans for the queries to judge which is the most computationally efficient. As pointed out in the comments you may find they are equivalent. Here is some information on Oracle execution plans. Depending on what editor / IDE you use the may be a shortcut for this e.g. F5 in PL/SQL Developer.

SQL (any) Request for insight on a query optimization

I have a particularly slow query due to the vast amount of information being joined together. However I needed to add a where clause in the shape of id in (select id from table).
I want to know if there is any gain from the following, and more pressing, will it even give the desired results.
select a.* from a where a.id in (select id from b where b.id = a.id)
as an alternative to:
select a.* from a where a.id in (select id from b)
Update:
MySQL
Can't be more specific sorry
table a is effectively a join between 7 different tables.
use of * is for examples
Edit, b doesn't get selected
Your question was about the difference between these two:
select a.* from a where a.id in (select id from b where b.id = a.id)
select a.* from a where a.id in (select id from b)
The former is a correlated subquery. It may cause MySQL to execute the subquery for each row of a.
The latter is a non-correlated subquery. MySQL should be able to execute it once and cache the results for comparison against each row of a.
I would use the latter.
Both queries you list are the equivalent of:
select a.*
from a
inner join b on b.id = a.id
Almost all optimizers will execute them in the same way.
You could post a real execution plan, and someone here might give you a way to speed it up. It helps if you specify what database server you are using.
YMMV, but I've often found using EXISTS instead of IN makes queries run faster.
SELECT a.* FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id = a.id)
Of course, without seeing the rest of the query and the context, this may not make the query any faster.
JOINing may be a more preferable option, but if a.id appears more than once in the id column of b, you would have to throw a DISTINCT in there, and you more than likely go backwards in terms of optimization.
I would never use a subquery like this. A join would be much faster.
select a.*
from a
join b on a.id = b.id
Of course don't use select * either (especially never use it when doing a join as at least one field is repeated) and it wastes network resources to send unnneeded data.
Have you looked at the execution plan?
How about
select a.*
from a
inner join b
on a.id = b.id
presumably the id fields are primary keys?
Select a.* from a
inner join (Select distinct id from b) c
on a.ID = c.AssetID
I tried all 3 versions and they ran about the same. The execution plan was the same (inner join, IN (with and without where clause in subquery), Exists)
Since you are not selecting any other fields from B, I prefer to use the Where IN(Select...) Anyone would look at the query and know what you are trying to do (Only show in a if in b.).
your problem is most likely in the seven tables within "a"
make the FROM table contain the "a.id"
make the next join: inner join b on a.id = b.id
then join in the other six tables.
you really need to show the entire query, list all indexes, and approximate row counts of each table if you want real help