Strategies to reduce redundant data on join

Strategies to reduce redundant data on join - sql

My production query has 10 joins with a select which has large text columns.
The problem is that some searches returns more that 8mb of data.. 7mb from duplicate data because of join. something like the image above but much bigger.
My example of tables structure:
create table A (
Id int primary key identity(1,1),
Value varchar(100)
)
create table B(
Id int primary key identity(1,1),
AId int,
Value varchar(100)
)
create table C(
Id int primary key identity(1,1),
BId int,
Value varchar(100)
)
Inserts:
insert into A values ('value A1')
insert into A values ('value A2')
insert into A values ('value A3')
insert into B values(1, 'value B1')
insert into B values(1, 'value B2')
insert into B values(1, 'value B3')
insert into C values(1, 'value C1')
insert into C values(1, 'value C2')
insert into C values(1, 'value C3')
How can I return only necessary data in a smart way with the same performance of join or better?
Command used to execute the query was:
select * from A
left join B on A.Id = B.AId
left join C on C.BId = B.Id
where A.Id = 1

You can use row_number() to return only the first appearance of the values:
select A.Id,
(case when row_number() over (partition by a.id order by b.id, c.id) = 1
then a.value
end) as a_value,
B.id,
(case when row_number() over (partition by a.id, b.id order by c.id) = 1
then b.value
end) as b_value,
C.id,
C.value
from A left join
B
on A.Id = B.AId left join
C
on C.BId = B.Id
where A.Id = 1
order by a.id, b.id, c.id;

Related

how to capture the data that is outside of the join scope into a tempTBL

Let's say we have the tables A and B, where A is the parent table of B
TableA:
ID | VAL
1 | "foo"
2 | "bar"
TableB:
ID | aID
1 | 2
OK?
Now lets have an join:
select *
from A
inner join B on a.Id = b.aID
Is there a way to use the INTO keyword to immediately store the failed join record into a temporary table. Something similar using the OUTPUT clause?
I know that it is a bit far fetched, but maybe there is a way I'm not aware of. Pays off to try.

CREATE TABLE ##tmp (
ID int,
VAL nvarchar(3),
IDD int,
aID int
)
CREATE TABLE ##tmp1 (
ID int,
VAL nvarchar(3)
)
;WITH TableA AS (
SELECT *
FROM (VALUES
(1, 'foo'),(2, 'bar')) as t(ID, VAL)
), TableB AS (
SELECT *
FROM (VALUES
(1, 2)) as t(ID, aID)
)
INSERT INTO ##tmp
select a.ID,
a.VAL,
b.ID AS IDD,
b.aID
from TableA a
FULL OUTER JOIN TableB B on a.Id = b.aID
DELETE FROM ##tmp
OUTPUT deleted.ID, deleted.VAL INTO ##tmp1
WHERE IDD IS NULL
Data in ##tmp:
ID VAL IDD aID
----------- ---- ----------- -----------
2 bar 1 2
(1 row(s) affected)
Data in ##tmp1:
ID VAL
----------- ----
1 foo
(1 row(s) affected)

Failed join record ? do you mean non matching records ?
select *
from A
left join B on a.Id = b.ID
where b.ID IS NULL
To store in temporary table , create table structure with required columns from rows retrived in join operation then do
INSERT INTO #temp
SELECT * from A
left join B on a.Id = b.ID
where b.ID IS NULL
or if you require all the columns then do select * into
SELECT * INTO #temp from A
left join B on a.Id = b.ID
where b.ID IS NULL

Parent Child Query without using SubQuery

Let say I have two tables,
Table A
ID Name
-- ----
1 A
2 B
Table B
AID Date
-- ----
1 1/1/2000
1 1/2/2000
2 1/1/2005
2 1/2/2005
Now I need this result without using sub query,
ID Name Date
-- ---- ----
1 A 1/2/2000
2 B 1/2/2005
I know how to do this using sub query but I want to avoid using sub query for some reason?

If I got your meaning right and you need the latest date from TableB, then the query below should do it:
select a.id,a.name,max(b.date)
from TableA a
join TableB b on b.aid = a.id
group by a.id,a.name

SELECT a.ID, a.Name, MAX(B.Date)
FROM TableA A
INNER JOIN TableB B
ON B.ID = A.ID
GROUP BY A.id, A.name
It's a simple aggregation. Looks like you want the highest date per id/name combo.

create table #t1 (id int, Name varchar(10))
create table #t2 (Aid int, Dt date)
insert #t1 values (1, 'A'), (2, 'B')
insert #t2 values (1, '1/1/2000'), (1, '1/2/2000'), (2, '1/1/2005'), (2, '1/2/2005')
;WITH cte (AId, MDt)
as
(
select Aid, MAX(Dt) from #t2 group by AiD
)
select #t1.Id, #t1.Name, cte.MDt
from #t1
join cte
on cte.AId = #t1.Id

SQL - 1 Parent Table, 2 Child Tables - return single row for each row in the child table

Table A
ParentID
Name
Table B
BKey
ParentID
DescB
Table C
CKey
ParentID
DescC
I need to return 1 row for each combined row of data in B/A that match the parent id and if one of the child tables has more rows than the other, a row should be returned with nulls for that description.
For example, if the data was as following
Table A
1 FirstParent
2 Second Parent
Table B
1 1 BDesc1
2 1 BDesc2
3 2 P2BDesc1
Table C
1 1 CDesc1
2 2 P2CDesc1
3 2 P2CDesc2
If I retrieve based on FirstParent, the results should be:
1 FirstParent BDesc1 CDesc1
1 FirstParent BDesc2 NULL
If I retrieve based on SecondParent, the results should be:
2 SecondParent P2BDesc1 P2CDesc1
2 SecondParent NULL P2CDesc2
Is there anyway of doing this without having to unions?

declare #ParentID int
set #ParentID = 1
select a.name,
bc.descb,
bc.descc
from TableA as a
cross join (select b.descb,
c.descc
from (select *,
row_number() over(order by b.bkey) as rn
from TableB as b
where b.parentid = #parentid) as b
full outer join
(select *,
row_number() over(order by c.ckey) as rn
from TableC as c
where c.parentid = #parentid) as c
on b.rn = c.rn) as bc
where a.parentid = #parentid
Try here: https://data.stackexchange.com/stackoverflow/qt/112538/
Edit: A version using ExternalKey to query multiple ParentID's
Suggested indexes:
create index IX_B_ParentID on TableB(ParentID) include (DescB)
create index IX_C_ParentID on TableC(ParentID) include (DescC)
I would create a table variable that holds the ParentID's that matches the ExternalKey and then use that instead of TableA in the query.
declare #ExternalKey int = 1
declare #T table(ParentID int primary key, Name varchar(20))
insert into #T (ParentID, Name)
select ParentID, NAme
from TableA
where ExternalKey = #ExternalKey
select a.name,
bc.descb,
bc.descc
from #T as a
inner join (select b.descb,
c.descc,
coalesce(b.ParentID, c.ParentID) as ParentID
from (select b.ParentID,
b.DescB,
row_number() over(partition by b.ParentID order by b.bkey) as rn
from TableB as b
where b.parentid in (select ParentID from #T)) as b
full outer join
(select c.ParentID,
c.DescC,
row_number() over(partition by c.ParentID order by c.ckey) as rn
from TableC as c
where c.parentid in (select ParentID from #T)) as c
on b.rn = c.rn and
b.ParentID = c.ParentID) as bc
on a.ParentID = bc.ParentID

I truely hope this is MSSQL question
declare #a table(
ParentID int,
Name varchar(15))
declare #b table(
BKey int,
ParentID int,
DescB varchar(10))
declare #c table(
CKey int,
ParentID int,
DescC varchar(10))
insert #a values (1,'FirstParent')
insert #a values (2,'SecondParent')
insert #b values(1, 1, 'BDesc1')
insert #b values(2, 1, 'BDesc2')
insert #b values(3, 2, 'P2BDesc1')
insert #c values(1, 1, 'CDesc1')
insert #c values(2, 2, 'P2CDesc1')
insert #c values(3, 2, 'P2CDesc2')
;with b as
(
select DescB, ParentID, row_number() over (partition by parentid order by DescB) rn from #b
),
c as
(
select DescC, ParentID, row_number() over (partition by parentid order by DescC) rn from #c
),
d as (
select DescB, DescC, coalesce(b.parentid, c.parentid) parentid from b
full outer join c
on c.parentid = b.parentid and c.rn = b.rn
)
select a.ParentID, a.Name, d.DescB, d.DescC from #a a
join d
on a.parentid = d.parentid
order by 1
Try here:
https://data.stackexchange.com/stackoverflow/q/112537/

You can implement it in 2 steps:
1) Calculate number of records in every child tables.
2) Join 1st or 2nd table regarding to number of records from 1st step
select a.ParentId, a.Name, b.DescB, c.DescC
from (
select ParentId, (select count(*) from b where a.ParentId = b.ParentId) as cntB,
(select count(*) from c where a.ParentId = b.ParentId) as cntC
from a
left join b cntB >= cntC and a.ParentId = b.ParentId
left join c cntB < cntC and a.ParentId = c.ParentId

Join to only the "latest" record with t-sql

I've got two tables. Table "B" has a one to many relationship with Table "A", which means that there will be many records in table "B" for one record in table "A".
The records in table "B" are mainly differentiated by a date, I need to produce a resultset that includes the record in table "A" joined with only the latest record in table "B". For illustration purpose, here's a sample schema:
Table A
-------
ID
Table B
-------
ID
TableAID
RowDate
I'm having trouble formulating the query to give me the resultset I'm looking for any help would be greatly appreciated.

SELECT *
FROM tableA A
OUTER APPLY (SELECT TOP 1 *
FROM tableB B
WHERE A.ID = B.TableAID
ORDER BY B.RowDate DESC) as B

select a.*, bm.MaxRowDate
from (
select TableAID, max(RowDate) as MaxRowDate
from TableB
group by TableAID
) bm
inner join TableA a on bm.TableAID = a.ID
If you need more columns from TableB, do this:
select a.*, b.* --use explicit columns rather than * here
from (
select TableAID, max(RowDate) as MaxRowDate
from TableB
group by TableAID
) bm
inner join TableB b on bm.TableAID = b.TableAID
and bm.MaxRowDate = b.RowDate
inner join TableA a on bm.TableAID = a.ID

table B join is optional: it depends if there are other columns you want
SELECT
*
FROM
tableA A
JOIN
tableB B ON A.ID = B.TableAID
JOIN
(
SELECT Max(RowDate) AS MaxRowDate, TableAID
FROM tableB
GROUP BY TableAID
) foo ON B.TableAID = foo.TableAID AND B.RowDate= foo.MaxRowDate

With ABDateMap AS (
SELECT Max(RowDate) AS LastDate, TableAID FROM TableB GROUP BY TableAID
),
LatestBRow As (
SELECT MAX(ID) AS ID, TableAID FROM ABDateMap INNER JOIN TableB ON b.TableAID=a.ID AND b.RowDate = LastDate GROUP BY TableAID
)
SELECT columns
FROM TableA a
INNER JOIN LatestBRow m ON m.TableAID=a.ID
INNER JOIN TableB b on b.ID = m.ID

Just for the clarity's sake and to benefit those who will stumble upon this ancient question. The accepted answer would return duplicate rows if there are duplicate RowDate in Table B. A safer and more efficient way would be to utilize ROW_NUMBER():
Select a.*, b.* -- Use explicit column list rather than * here
From [Table A] a
Inner Join ( -- Use Left Join if the records missing from Table B are still required
Select *,
ROW_NUMBER() OVER (PARTITION BY TableAID ORDER BY RowDate DESC) As _RowNum
From [Table B]
) b
On b.TableAID = a.ID
Where b._RowNum = 1

Try using this:
BEGIN
DECLARE #TB1 AS TABLE (ID INT, NAME VARCHAR(30) )
DECLARE #TB2 AS TABLE (ID INT, ID_TB1 INT, PRICE DECIMAL(18,2))
INSERT INTO #TB1 (ID, NAME) VALUES (1, 'PRODUCT X')
INSERT INTO #TB1 (ID, NAME) VALUES (2, 'PRODUCT Y')
INSERT INTO #TB2 (ID, ID_TB1, PRICE) VALUES (1, 1, 3.99)
INSERT INTO #TB2 (ID, ID_TB1, PRICE) VALUES (2, 1, 4.99)
INSERT INTO #TB2 (ID, ID_TB1, PRICE) VALUES (3, 1, 5.99)
INSERT INTO #TB2 (ID, ID_TB1, PRICE) VALUES (1, 2, 0.99)
INSERT INTO #TB2 (ID, ID_TB1, PRICE) VALUES (2, 2, 1.99)
INSERT INTO #TB2 (ID, ID_TB1, PRICE) VALUES (3, 2, 2.99)
SELECT A.ID, A.NAME, B.PRICE
FROM #TB1 A
INNER JOIN #TB2 B ON A.ID = B.ID_TB1 AND B.ID = (SELECT MAX(ID) FROM #TB2 WHERE ID_TB1 = A.ID)
END

This will fetch the latest record with JOIN. I think this will help someone
SELECT cmp.*, lr_entry.lr_no FROM
(SELECT * FROM lr_entry ORDER BY id DESC LIMIT 1)
lr_entry JOIN companies as cmp ON cmp.id = lr_entry.company_id

SQL to resequence items by groups

Lets say I have a database that looks like this:
tblA:
ID, Name, Sequence, tblBID
1 a 5 14
2 b 3 15
3 c 3 16
4 d 3 17
tblB:
ID, Group
14 1
15 1
16 2
17 3
I would like to sequence A so that the sequences go 1...n for each group of B.
So in this case, the sequences going down should be 1,2,1,1.
The ordering needs to be consistent with the current ordering, but there are no guarantees as to the current ordering.
I am not exactly a sql master and I am sure there is a fairly easy way to do this, but I really don't know the right route to take. Any hints?

If you are using SQL Server 2005+ or higher, you can use a ranking function:
Select tblA.Id, tblA.Name
, Row_Number() Over ( Partition By tblB.[Group] Order By tblA.Id ) As Sequence
, tblA.tblBID
From tblA
Join tblB
On tblB.tblBID = tblB.ID
Row_Number ranking function.
Here's another solution that would work in SQL Server 2000 and prior.
Select A.Id, A.Name
, (Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID) + 1 As Sequence
, A.tblBID
From tblA As A
Join tblB As B
On B.Id = A.tblBID
EDIT
Also want to make it clear that I want to actually update tblA to reflect the proper sequences.
In SQL Server, you can use their proprietary From clause in an Update statement like so:
Update tblA
Set Sequence = (
Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID
) + 1
From tblA As A
Join tblB As B
On B.Id = A.tblBID
The Hoyle ANSI solution might be something like:
Update tblA
Set Sequence = (
Select (Select Count(*)
From tblB As B1
Where B1.[Group] = B.[Group]
And B1.Id < B.ID) + 1
From tblA As A
Join tblB As B
On B.Id = A.tblBID
Where A.Id = tblA.Id
)
EDIT
Can we do that [the inner group] comparison based on A.Sequence instead of B.ID?
Select A1.*
, (Select Count(*)
From tblB As B2
Join tblA As A2
On A2.tblBID = B2.Id
Where B2.[Group] = B1.[Group]
And A2.Sequence < A1.Sequence) + 1
From tblA As A1
Join tblB As B1
On B1.Id = A1.tblBID

Because it's SQL 2000, we can't use a windowing function. That's okay.
Thomas's queries are good and will work. However, they will get worse and worse as the number of rows increases—with different characteristics depending on how wide (the number of groups) and how deep (the number of items per group). This is because those queries use a partial cross-join, perhaps we could call it a "pyramidal cross-join" where the crossing part is limited to right side values less than left side values rather than left crossing to all right values.
What to do?
I think you will be surprised to find that the following long and painful-looking script will outperform the pyramidal join at a certain size of data (which may not be all that big) and eventually, with really large data sets must be considered a screaming performer:
CREATE TABLE #tblA (
ID int identity(1,1) NOT NULL,
Name varchar(1) NOT NULL,
Sequence int NOT NULL,
tblBID int NOT NULL,
PRIMARY KEY CLUSTERED (ID)
)
INSERT #tblA VALUES ('a', 5, 14)
INSERT #tblA VALUES ('b', 3, 15)
INSERT #tblA VALUES ('c', 3, 16)
INSERT #tblA VALUES ('d', 3, 17)
CREATE TABLE #tblB (
ID int NOT NULL PRIMARY KEY CLUSTERED,
GroupID int NOT NULL
)
INSERT #tblB VALUES (14, 1)
INSERT #tblB VALUES (15, 1)
INSERT #tblB VALUES (16, 2)
INSERT #tblB VALUES (17, 3)
CREATE TABLE #seq (
seq int identity(1,1) NOT NULL,
ID int NOT NULL,
GroupID int NOT NULL,
PRIMARY KEY CLUSTERED (ID)
)
INSERT #seq
SELECT
A.ID,
B.GroupID
FROM
#tblA A
INNER JOIN #tblB B ON A.tblBID = b.ID
ORDER BY B.GroupID, A.Sequence
UPDATE A
SET A.Sequence = S.seq - X.MinSeq + 1
FROM
#tblA A
INNER JOIN #seq S ON A.ID = S.ID
INNER JOIN (
SELECT GroupID, MinSeq = Min(seq)
FROM #seq
GROUP BY GroupID
) X ON S.GroupID = X.GroupID
SELECT * FROM #tblA
DROP TABLE #seq
DROP TABLE #tblB
DROP TABLE #tblA
If I understood you correctly, then ORDER BY B.GroupID, A.Sequence is correct. If not, you can switch A.Sequence to B.ID.
Also, my index on the temp table should be experimented with. For a certain quantity of rows, and also the width and depth characteristics of those rows, clustering on one of the other two columns in the #seq table could be helpful.
Last, there is a possible different data organization possible: leaving GroupID out of the #seq table and joining again. I suspect it would be worse, but am not 100% sure.

Something like:
SELECT a.id, a.name, row_number() over (partition by b.group order by a.id)
FROM tblA a
JOIN tblB on a.tblBID = b.ID;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Strategies to reduce redundant data on join - sql

Related

how to capture the data that is outside of the join scope into a tempTBL

Parent Child Query without using SubQuery

SQL - 1 Parent Table, 2 Child Tables - return single row for each row in the child table

Join to only the "latest" record with t-sql

SQL to resequence items by groups

Categories

Resources