I'm working with two tables A and B. Table A identifies equity securities and Table B has a number of details on the security.
For example, when B.Item = 5301, the row specifies price for a given security. When B.Item = 9999, the row specifies dividends for a given security. I am trying to get both price and dividends in the same row. In order to achieve this, I FULL JOINed table B twice to table A.
SELECT *
FROM a a
FULL JOIN (SELECT *
FROM b) b
ON b.code = a.code
AND b.item = 3501
FULL JOIN (SELECT *
FROM b) b2
ON b2.code = a.code
AND b.item = 9999
AND b2.year_ = b.year_
AND b.freq = b2.freq
AND b2.seq = b.seq
WHERE a.code IN ( 122514 )
The remaining fields in the join clause like Year_, Freq, and Seq just make sure the dates of the price and dividends match. A.Code simply identifies a single security.
My issue is that when I flip the order of the full joins I get a different number of results. So if b.Item = 9999 comes before b.Item 2501, I get one result. The other way around I get 2 results. I realized the table B has zero entries for security 122514 for dividend, but has two entries for price.
When price is specified first, I get both prices and dividend fields are null. However, when dividend is specified first, I get NULLs for the dividend fields and also nulls for the prices fields.
Why aren't the two price entries showing up? I would expect them to do so in a FULL JOIN
It's because your second FULL OUTER JOIN refers to your first FULL OUTER JOIN. This means changing the order of them is making a fundamental change to the query.
Here is some pseudo-SQL that demonstrates how this works:
DECLARE #a TABLE (Id INT, Name VARCHAR(50));
INSERT INTO #a VALUES (1, 'Dog Trades');
INSERT INTO #a VALUES (2, 'Cat Trades');
DECLARE #b TABLE (Id INT, ItemCode VARCHAR(1), PriceDate DATE, Price INT, DividendDate DATE, Dividend INT);
INSERT INTO #b VALUES (1, 'p', '20141001', 100, '20140101', 1000);
INSERT INTO #b VALUES (1, 'p', '20141002', 50, NULL, NULL);
INSERT INTO #b VALUES (2, 'c', '20141001', 10, '20141001', 500);
INSERT INTO #b VALUES (2, 'c', NULL, NULL, '20141002', 300);
--Same results
SELECT a.*, b1.*, b2.* FROM #a a FULL OUTER JOIN #b b1 ON b1.Id = a.Id AND b1.ItemCode = 'p' FULL OUTER JOIN #b b2 ON b2.Id = a.Id AND b2.ItemCode = 'c';
SELECT a.*, b2.*, b1.* FROM #a a FULL OUTER JOIN #b b1 ON b1.Id = a.Id AND b1.ItemCode = 'c' FULL OUTER JOIN #b b2 ON b2.Id = a.Id AND b2.ItemCode = 'p';
--Different results
SELECT a.*, b1.*, b2.* FROM #a a FULL OUTER JOIN #b b1 ON b1.Id = a.Id AND b1.ItemCode = 'p' FULL OUTER JOIN #b b2 ON b2.Id = a.Id AND b2.ItemCode = 'c' AND b2.DividendDate = b1.PriceDate;
SELECT a.*, b2.*, b1.* FROM #a a FULL OUTER JOIN #b b1 ON b1.Id = a.Id AND b1.ItemCode = 'c' FULL OUTER JOIN #b b2 ON b2.Id = a.Id AND b2.ItemCode = 'p' AND b2.DividendDate = b1.PriceDate;
Related
I know this shouldn't happen in a database, but it happened and we have to deal with it. We need to insert new rows into a table if they don't exist based on the values in another table. This is easy enough (just do LEFT JOIN and check for NULL values in 1st table). But...the join isn't very straight forward and we need to search 1st table on 2 conditions with an OR and not AND. So basically if it finds a match on either of the 2 attributes, we consider that the corresponding row in 1st table exists and we don't have to insert a new one. If there are no matches on either of the 2 attributes, then we consider it as a new row. We can use OR condition in the LEFT JOIN statement but from what I understand, it does full table scan and the query takes a very long time to complete even though it yields the right results. We cannot use UNION either because it will not give us what we're looking for.
Just for simplicity purpose consider the scenario below (we need to insert data into tableA).
If(OBJECT_ID('tempdb..#tableA') Is Not Null) Begin
Drop Table #tableA End
If(OBJECT_ID('tempdb..#tableB') Is Not Null) Begin
Drop Table #tableB End
create table #tableA ( email nvarchar(50), id int )
create table #tableB ( email nvarchar(50), id int )
insert into #tableA (email, id) values ('123#abc.com', 1), ('456#abc.com', 2), ('789#abc.com', 3), ('012#abc.com', 4)
insert into #tableB (email, id) values ('234#abc.com', 1), ('456#abc.com', 2), ('567#abc.com', 3), ('012#abc.com', 4), ('345#abc.com', 5)
--THIS QUERY IS CORRECTLY RETURNING 1 RECORD
select B.email, B.id
from #tableB B
left join #tableA A on A.email = B.email or B.id = A.id
where A.id is null
--THIS QUERY IS INCORRECTLY RETURNING 3 RECORDS SINCE THERE ARE ALREADY RECORDS WITH ID's 1 & 3 in tableA though the email addresses of these records don't match
select B.email, B.id
from #tableB B
left join #tableA A on A.email = B.email
where A.id is null
union
select B.email, B.id
from #tableB B
left join #tableA A on B.id = A.id
where A.id is null
If(OBJECT_ID('tempdb..#tableA') Is Not Null) Begin
Drop Table #tableA End
If(OBJECT_ID('tempdb..#tableB') Is Not Null) Begin
Drop Table #tableB End
The 1st query works correctly and only returns 1 record, but the table size is just few records and it completes under 1 sec. When the 2 tables have thousands or records, the query may take 10 min to complete. The 2nd query of course returns the records we don't want to insert because we consider them existing. Is there a way to optimize this query so it takes an acceptable time to complete?
You are using an anti join, which is another way of writing the straight-forward NOT EXISTS:
where not exists
(
select null
from #tableA A
where A.email = B.email or B.id = A.id
)
I.e. where not exists a row in table A with the same email or the same id. In other words: where not exists a row with the same email and not exists a row with the same id.
where not exists (select null from #tableA A where A.email = B.email)
and not exists (select null from #tableA A where B.id = A.id)
With the appropriate indexes
on #tableA (id);
on #tableA (email);
this should be very fast.
It's hard to tune something you can't see. Another option to get the data is to:
SELECT B.email
, B.id
FROM #TableB B
EXCEPT
(
SELECT B.email
, B.id
FROM #tableB B
INNER JOIN #tableA A
ON A.email = B.email
UNION ALL
SELECT B.email
, B.id
FROM #tableB B
INNER JOIN #tableA A
ON B.id = A.id
)
This way you don't have to use OR, you can use INNER JOIN rather than LEFT JOIN and you can use UNION ALL instead of UNION (though this advantage may well be negated by the EXCEPT). All of which may help your performance. Perhaps the joins can be more efficient when replaced with EXISTS.
You didn't mention how this problem occurred (where the data from both tables is coming from, and why they are out of sync when they shouldn't be), but it would be preferable to fix it at the source.
No the query returns correctly 3 rows
because
select B.email, B.id
from #tableB B
left join #tableA A on A.email = B.email
where A.id is null
Allone reurns the 3 rows.
For your "problemm"
select B.email, B.id
from #tableB B
left join #tableA A on A.email = B.email or B.id = A.id
where A.id is null
will che3kc for every row, if it is true to be included
So for example
('123#abc.com', 1) ('234#abc.com', 1)
as the Ids are the same it will be joined
but when you join by the emails the condition is false and so is included in the result set
You can only use the UNION approach, when you are comparing only the emails or the ids, but with both the queries are not equivalent
I am probably trying to use JOINs for purposes they were not intended here.
Here's my (simplified) table structure:
Table A
ID
Table C ID
IsStatic (bit)
Table B
ID
Table A ID (nullable)
Table C ID
Table C
ID
My goal is to get all of Table B rows joined to Table A rows where Table B's Table A ID column has a value and equals Table A's ID column value.
I also need all of Table B rows where Table B's Table A ID column has no value.
I also need all of Table A rows with there were no joined Table B rows and Table A's IsStatic column is true.
Table C must also be associated with Table A or Table B. If Table B does not have a value for TableAID then it's value for TableCID should equal TableC's ID value. Otherwise TableA's TableCID should equal TableC's ID value.
Here's some SQL to create some TABLE variables and populate with sample data:
DECLARE #TableA TABLE (TableAID int, TableCID int, IsStatic bit)
DECLARE #TableB TABLE (TableBID int, TableAID int, TableCID int)
DECLARE #TableC TABLE (TableCID int)
INSERT INTO #TableC (TableCID) VALUES (1)
INSERT INTO #TableC (TableCID) VALUES (2)
INSERT INTO #TableA (TableAID, TableCID, IsStatic) VALUES (1, 1, 0)
INSERT INTO #TableA (TableAID, TableCID, IsStatic) VALUES (2, 2, 1)
INSERT INTO #TableA (TableAID, TableCID, IsStatic) VALUES (3, 2, 1)
INSERT INTO #TableA (TableAID, TableCID, IsStatic) VALUES (4, 2, 0)
INSERT INTO #TableB (TableBID, TableAID, TableCID) VALUES (1, NULL, 1)
INSERT INTO #TableB (TableBID, TableAID, TableCID) VALUES (2, 1, 1)
INSERT INTO #TableB (TableBID, TableAID, TableCID) VALUES (3, 2, 2)
Here's my (simplified) query that didn't quite work:
SELECT
a.TableAID,
b.TableBID
FROM #TableC c
LEFT OUTER JOIN #TableB b ON
(b.TableAID IS NOT NULL OR (b.TableAID IS NULL AND b.TableCID = c.TableCID))
LEFT OUTER JOIN #TableA a ON
a.TableCID = c.TableCID
AND ((a.IsStatic = 1 AND b.TableBID IS NULL)
OR (b.TableBID IS NOT NULL AND b.TableAID = a.TableAID))
The result of this query using the sampel data is:
TableAID TableBID
-----------------
NULL 1
1 2
NULL 3 (not required)
NULL 2 (not required)
2 3
The required result is:
TableAID TableBID
-----------------
NULL 1
3 NULL (missing)
2 3
1 2
Problem with this query is that if TableB.TableAID has no value then the Table A rows where TableA.IsStatic is true without any matching TableB rows are never included. Also some TableB rows are being included and they shouldn't be.
The only other way I can see of doing this is with a union with a not exists but I was hoping to do this in a more efficient way.
Update: Adding a WHERE clause removes the "not required" rows but still omits the missing row.
WHERE (b.TableBID IS NULL OR b.TableAID IS NULL OR b.TableAID = a.TableAID)
The result of the same query with the where clause is:
TableAID TableBID
-----------------
NULL 1
1 2
2 3
select b.TableAID, b.TableBID
from #TableB b
left join #TableA a on a.TableAID = b.TableAID
inner join #TableC c on c.TableCID = case when a.TableAID IS NULL then b.TableCID else a.TableCID end
union all
select a.TableAID, NULL
from #TableA a
inner join #TableC c on c.TableCID = a.TableCID
left join #TableB b on b.TableAID = a.TableAID
where b.TableAID is NULL
and a.IsStatic = 1
What a mind twister. I think that this is another way to express it. You'll have to see if the performance is good or not:
select a.TableAID, b.TableBID
from (select a.*
from #TableA a
join #TableC c
on c.TableCID = a.TableCID) a
full outer join (select b.*
from #TableB b
join #TableC c
on c.TableCID = b.TableCID) b
on b.TableAID = a.TableAID
where b.TableBID is not null or a.IsStatic = 1
I should also mention that it's hard to know for sure if the above query really respects your requirements using the sample data you provided. To illustrate, if I use this simplified query below that simply ignores the #TableC table, I still get the right results with your sample data:
select a.TableAID, b.TableBID
from #TableA a
full outer join #TableB b
on b.TableAID = a.TableAID
where b.TableBID is not null or a.IsStatic = 1
EDIT: Funny discussion in the comments about the interpretation of OP' requirements... But if I had to address Anton's point:
select a.TableAID, b.TableBID
from (select a.*,
case when c.TableCID is not null then 1 end as has_c
from #TableA a
left join #TableC c
on c.TableCID = a.TableCID) a
full outer join (select b.*,
case when c.TableCID is not null then 1 end as has_c
from #TableB b
left join #TableC c
on c.TableCID = b.TableCID) b
on b.TableAID = a.TableAID
where (b.TableBID is not null or a.IsStatic = 1)
and (a.has_c = 1 or b.has_c = 1)
I have tried to convert old MS sql join syntax to new join syntax but number of rows in the results not matching.
Original SQL:
select
b.Amount
from
TableA a, TableB b,TableC c, TableD d
where
a.inv_no *= b.inv_no and
a.inv_item *= b.inv_item and
c.currency *= b.cash_ccy and
d.tx_code *= b.cash_receipt
Converted SQL:
SELECT
b.AMOUNT
FROM
(TableA AS a
LEFT OUTER JOIN
TableB AS b ON a.INV_NO = b.INV_NO
AND a.inv_item = b.inv_item
LEFT OUTER JOIN
TableC AS c ON c.currency = b.cash_ccy)
LEFT OUTER JOIN
TableD as d ON d.tx_code = b.cash_receipt
Findings
Results are same on both original SQL and modified SQL upto joining of 3 tables but when joining the fourth table (TableD) to the modified SQL, the number of rows returned is different.
The order of fields within predicates is important when using SQL Server's (deprecated) proprietary ANSI 89 join syntax *= or =*
So while
SELECT *
FROM TableA AS A
LEFT JOIN TableB AS B
ON A.ColA = B.ColB;
Is exactly the same as
SELECT *
FROM TableA AS A
LEFT JOIN TableB AS B
ON B.ColB = A.ColA; -- NOTE ORDER HERE
The eqivalent
SELECT *
FROM TableA AS A, TableB AS b
WHERE A.ColA *= B.ColB;
Is not the same as
SELECT *
FROM TableA AS A, TableB AS b
WHERE B.ColA *= A.ColB;
This last query's ANSI 92 equivalent would be
SELECT *
FROM TableA AS A
RIGHT JOIN TableB AS B
ON A.ColA = B.ColB;
Or if you dislike RIGHT JOIN as much as I do you would probably write:
SELECT *
FROM TableB AS B
LEFT OUTER JOIN TableA AS A
ON B.ColB = A.ColA;
So actually the equivalent query in ANSI 92 join syntax would involve starting with TableA, TableC and TableD (since these are the leading fields in the original WHERE Clause). Then since there is no direct link between the three, you end up with a cross join
SELECT b.Amount
FROM TableA AS a
CROSS JOIN TableD AS d
CROSS JOIN TableC AS c
LEFT JOIN TableB AS B
ON c.currency = b.cash_ccy
AND d.tx_code = b.cash_receipt
AND a.INV_NO = b.INV_NO
AND a.inv_item = b.inv_item;
This is the equivalent rewrite, and explans the difference in the number of rows
WORKING EXAMPLE
Needs to be run on SQL Server 2008 or earlier with compatibility level 80 or less
-- SAMPLE DATA --
CREATE TABLE #TableA (Inv_No INT, Inv_item INT);
CREATE TABLE #TableB (Inv_No INT, Inv_item INT, cash_ccy INT, cash_receipt INT, Amount INT);
CREATE TABLE #TableC (currency INT);
CREATE TABLE #TableD (tx_code INT);
INSERT #TableA (inv_no, inv_item) VALUES (1, 1), (2, 2);
INSERT #TableB (inv_no, inv_item, cash_ccy, cash_receipt, Amount) VALUES (1, 1, 1, 1, 1), (2, 2, 2, 2, 2);
INSERT #TableC (currency) VALUES (1), (2), (3), (4);
INSERT #TableD (tx_code) VALUES (1), (2), (3), (4);
-- ORIGINAL QUERY(32 ROWS)
SELECT
b.Amount
FROM
#TableA a, #TableB b,#TableC c, #TableD d
WHERE
a.inv_no *= b.inv_no and
a.inv_item *= b.inv_item and
c.currency *= b.cash_ccy and
d.tx_code *= b.cash_receipt
-- INCORRECT ANSI 92 REWRITE (2 ROWS)
SELECT b.AMOUNT
FROM #TableA AS a
LEFT OUTER JOIN #TableB AS b
ON a.INV_NO = b.INV_NO
and a.inv_item = b.inv_item
LEFT OUTER JOIN #TableC AS c
ON c.currency = b.cash_ccy
LEFT OUTER JOIN #TableD as d
ON d.tx_code = b.cash_receipt;
-- CORRECT ANSI 92 REWRITE (32 ROWS)
SELECT b.Amount
FROM #TableA AS a
CROSS JOIN #TableD AS d
CROSS JOIN #TableC AS c
LEFT JOIN #TableB AS B
ON c.currency = b.cash_ccy
AND d.tx_code = b.cash_receipt
AND a.INV_NO = b.INV_NO
AND a.inv_item = b.inv_item;
Are there any case where these two are not equivalent?
A OUTER JOIN (B JOIN C)
A OUTER JOIN C OUTER JOIN B
Yes:
In your first example (A OUTERJOIN (B JOIN C)), if either B or C does not have a matching record, both B and C are omitted.
In your second example (A OUTERJOIN C OUTERJOIN B), C can be returned even if B does not have a matching record.
Still same as my comment. If B join C produces an empty result set, then A outer join "empty result set" is the same as just A.
A outer join B outer join C is something different (as least if one of B and C are not empty.)
Since A seems to connect to C, it would be clearer to write the first option as:
A OUTERJOIN (C JOIN B)
The key difference between your two options is whether data from C is returned when there is a match between A and C. Just looking at this aspect the two options could be viewed as the sets:
intersect(A,intersect(C,B))
intersect(A,C)
Clearly, the two are different since the first form can eliminate rows from C before it is intersected with A.
What fields you join on can make a difference as tables are not guaranteed to have only one possible field to join to another table on. And in this case does table b have a field to join to either table c or table a? That makes a differnce. What fields you want returned make a difference to the results set and whther two things will return the same results. The state of the data makes a differnce as some queries will appear to be equivalent until the data changes. SO understnding that these are not equivalent queires helps you avoid these mistakes. Whether you use a full, left or right outer join makes a difference as well. And finally what where clauses you add can make a differnce in whther they appear to be equivalent.
Check out these examples using temp tables (SQL server syntax)
create table #a (aid int, sometext varchar(50))
create table #b (bid int, sometext2 varchar(50), cid int, aid int)
create table #c (cid int, sometext3 varchar(50), aid int)
insert into #a
values(1, 'test') , (2, 'test2'), (3, 'test3')
insert into #b
values(1, 'test', 1, 2) , (2, 'test2', 2, 1), (3, 'test3', 2, 2)
insert into #c
values(1, 'test', 1) , (2, 'test2', 2), (3, 'test3', 1)
select *
from #a a
left outer join #c c on a.aid = c.aid
left outer join #b b on a.aid = b.aid
select *
from #a a
left outer join #c c on a.aid = c.aid
left outer join #b b on c.cid = b.cid
select *
from #a a
left outer join #b b
join #c c on b.cid = c.cid
on a.aid = b.aid
select *
from #a a
right outer join #c c on a.aid = c.aid
right outer join #b b on a.aid = b.aid
select *
from #a a
right outer join #c c on a.aid = c.aid
right outer join #b b on c.cid = b.cid
select *
from #a a
right outer join #b b
join #c c on b.cid = c.cid
on a.aid = b.aid
select *
from #a a
full outer join #c c on a.aid = c.aid
full outer join #b b on a.aid = b.aid
select *
from #a a
full outer join #c c on a.aid = c.aid
full outer join #b b on c.cid = b.cid
select *
from #a a
full outer join #b b
join #c c on b.cid = c.cid
on a.aid = b.aid
Lets say I have a database that looks like this:
tblA:
ID, Name, Sequence, tblBID
1 a 5 14
2 b 3 15
3 c 3 16
4 d 3 17
tblB:
ID, Group
14 1
15 1
16 2
17 3
I would like to sequence A so that the sequences go 1...n for each group of B.
So in this case, the sequences going down should be 1,2,1,1.
The ordering needs to be consistent with the current ordering, but there are no guarantees as to the current ordering.
I am not exactly a sql master and I am sure there is a fairly easy way to do this, but I really don't know the right route to take. Any hints?
If you are using SQL Server 2005+ or higher, you can use a ranking function:
Select tblA.Id, tblA.Name
, Row_Number() Over ( Partition By tblB.[Group] Order By tblA.Id ) As Sequence
, tblA.tblBID
From tblA
Join tblB
On tblB.tblBID = tblB.ID
Row_Number ranking function.
Here's another solution that would work in SQL Server 2000 and prior.
Select A.Id, A.Name
, (Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID) + 1 As Sequence
, A.tblBID
From tblA As A
Join tblB As B
On B.Id = A.tblBID
EDIT
Also want to make it clear that I want to actually update tblA to reflect the proper sequences.
In SQL Server, you can use their proprietary From clause in an Update statement like so:
Update tblA
Set Sequence = (
Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID
) + 1
From tblA As A
Join tblB As B
On B.Id = A.tblBID
The Hoyle ANSI solution might be something like:
Update tblA
Set Sequence = (
Select (Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID) + 1
From tblA As A
Join tblB As B
On B.Id = A.tblBID
Where A.Id = tblA.Id
)
EDIT
Can we do that [the inner group] comparison based on A.Sequence instead of B.ID?
Select A1.*
, (Select Count(*)
From tblB As B2
Join tblA As A2
On A2.tblBID = B2.Id
Where B2.[Group] = B1.[Group]
And A2.Sequence < A1.Sequence) + 1
From tblA As A1
Join tblB As B1
On B1.Id = A1.tblBID
Because it's SQL 2000, we can't use a windowing function. That's okay.
Thomas's queries are good and will work. However, they will get worse and worse as the number of rows increases—with different characteristics depending on how wide (the number of groups) and how deep (the number of items per group). This is because those queries use a partial cross-join, perhaps we could call it a "pyramidal cross-join" where the crossing part is limited to right side values less than left side values rather than left crossing to all right values.
What to do?
I think you will be surprised to find that the following long and painful-looking script will outperform the pyramidal join at a certain size of data (which may not be all that big) and eventually, with really large data sets must be considered a screaming performer:
CREATE TABLE #tblA (
ID int identity(1,1) NOT NULL,
Name varchar(1) NOT NULL,
Sequence int NOT NULL,
tblBID int NOT NULL,
PRIMARY KEY CLUSTERED (ID)
)
INSERT #tblA VALUES ('a', 5, 14)
INSERT #tblA VALUES ('b', 3, 15)
INSERT #tblA VALUES ('c', 3, 16)
INSERT #tblA VALUES ('d', 3, 17)
CREATE TABLE #tblB (
ID int NOT NULL PRIMARY KEY CLUSTERED,
GroupID int NOT NULL
)
INSERT #tblB VALUES (14, 1)
INSERT #tblB VALUES (15, 1)
INSERT #tblB VALUES (16, 2)
INSERT #tblB VALUES (17, 3)
CREATE TABLE #seq (
seq int identity(1,1) NOT NULL,
ID int NOT NULL,
GroupID int NOT NULL,
PRIMARY KEY CLUSTERED (ID)
)
INSERT #seq
SELECT
A.ID,
B.GroupID
FROM
#tblA A
INNER JOIN #tblB B ON A.tblBID = b.ID
ORDER BY B.GroupID, A.Sequence
UPDATE A
SET A.Sequence = S.seq - X.MinSeq + 1
FROM
#tblA A
INNER JOIN #seq S ON A.ID = S.ID
INNER JOIN (
SELECT GroupID, MinSeq = Min(seq)
FROM #seq
GROUP BY GroupID
) X ON S.GroupID = X.GroupID
SELECT * FROM #tblA
DROP TABLE #seq
DROP TABLE #tblB
DROP TABLE #tblA
If I understood you correctly, then ORDER BY B.GroupID, A.Sequence is correct. If not, you can switch A.Sequence to B.ID.
Also, my index on the temp table should be experimented with. For a certain quantity of rows, and also the width and depth characteristics of those rows, clustering on one of the other two columns in the #seq table could be helpful.
Last, there is a possible different data organization possible: leaving GroupID out of the #seq table and joining again. I suspect it would be worse, but am not 100% sure.
Something like:
SELECT a.id, a.name, row_number() over (partition by b.group order by a.id)
FROM tblA a
JOIN tblB on a.tblBID = b.ID;