SQL joins with multiple records into one with a default - sql

My 'people' table has one row per person, and that person has a division (not unique) and a company (not unique).
I need to join people to p_features, c_features, d_features on:
people.person=p_features.num_value
people.division=d_features.num_value
people.company=c_features.num_value
... in a way that if there is a record match in p_features/d_features/c_features only, it would be returned, but if it was in 2 or 3 of the tables, the most specific record would be returned.
From my test data below, for example, query for person=1 would return
'FALSE'
person 3 returns maybe, person 4 returns true, and person 9 returns default
The biggest issue is that there are 100 features and I have queries that need to return all of them in one row. My previous attempt was a function which queried on feature,num_value in each table and did a foreach, but 100 features * 4 tables meant 400 reads and it brought the database to a halt it was so slow when I loaded up a few million rows of data.
create table p_features (
num_value int8,
feature varchar(20),
feature_value varchar(128)
);
create table c_features (
num_value int8,
feature varchar(20),
feature_value varchar(128)
);
create table d_features (
num_value int8,
feature varchar(20),
feature_value varchar(128)
);
create table default_features (
feature varchar(20),
feature_value varchar(128)
);
create table people (
person int8 not null,
division int8 not null,
company int8 not null
);
insert into people values (4,5,6);
insert into people values (3,5,6);
insert into people values (1,2,6);
insert into p_features values (4,'WEARING PANTS','TRUE');
insert into c_features values (6,'WEARING PANTS','FALSE');
insert into d_features values (5,'WEARING PANTS','MAYBE');
insert into default_features values('WEARING PANTS','DEFAULT');

You need to transpose the features into rows with a ranking. Here I used a common-table expression. If your database product does not support them, you can use temporary tables to achieve the same effect.
;With RankedFeatures As
(
Select 1 As FeatureRank, P.person, PF.feature, PF.feature_value
From people As P
Join p_features As PF
On PF.num_value = P.person
Union All
Select 2, P.person, PF.feature, PF.feature_value
From people As P
Join d_features As PF
On PF.num_value = P.division
Union All
Select 3, P.person, PF.feature, PF.feature_value
From people As P
Join c_features As PF
On PF.num_value = P.company
Union All
Select 4, P.person, DF.feature, DF.feature_value
From people As P
Cross Join default_features As DF
)
, HighestRankedFeature As
(
Select Min(FeatureRank) As FeatureRank, person
From RankedFeatures
Group By person
)
Select RF.person, RF.FeatureRank, RF.feature, RF.feature_value
From people As P
Join HighestRankedFeature As HRF
On HRF.person = P.person
Join RankedFeatures As RF
On RF.FeatureRank = HRF.FeatureRank
And RF.person = P.person
Order By P.person

I don't know if I had understood very well your question, but to use JOIN, you need your table loaded already and then use the SELECT statement with INNER JOIN, LEFT JOIN or whatever you need to show.
If you post some more information, maybe turn it easy to understand.

There are some aspects of your schema I'm not understanding, like how to relate to the default_features table if there's no match in any of the specific tables. The only possible join condition is on feature, but if there's no match in the other 3 tables, there's no value to join on. So, in my example, I've hard-coded the DEFAULT since I can't think of how else to get it.
Hopefully this can get you started and if you can clarify the model a bit more, the solution can be refined.
select p.person, coalesce(pf.feature_value, df.feature_value, cf.feature_value, 'DEFAULT')
from people p
left join p_features pf
on p.person = pf.num_value
left join d_features df
on p.division = df.num_value
left join c_features cf
on p.company = cf.num_value

Related

Efficient/optimized query for my query using multiple UNIONS with JOIN

Can someone please have a look into query and suggest any improvement or optimized query for the same so that query runs faster .
So basically, I have 2 table Survey and SurveyInvite.
Sample data for Table Survey
CREATE TABLE dbo.Survey
(
createdate date,
emailinvite char(4),
phoneinvite char(4),
smsinvite char(4),
surveyid int
);
INSERT dbo.Survey VALUES
('20220201','12ab','12bc', null ,1),
('20220210','23be','45hg','45tr',2),
('20220220','65hg', null ,'89kj',3);
Sample data for Table SurveyInvite
CREATE TABLE dbo.SurveyInvite
(
sentdate date,
id char(4)
);
INSERT dbo.SurveyInvite VALUES
('20220201','12ab'),
('20220205','12bc'),
('20220210','23be'),
('20220214','45hg'),
('20220218','45tr'),
('20220220','65hg'),
('20220224','89kj');
The output should be
Type
sentdate
inviteid
surveyid
Email
2022-02-01
12ab
1
Email
2022-02-10
23be
2
Email
2022-02-20
65hg
3
Phone
2022-02-05
12bc
1
Phone
2022-02-14
45hg
2
SMS
2022-02-18
45tr
2
SMS
2022-02-24
89kj
3
So basically, I have to get sentdate from SurveyInvite table against each type(email,phone,sms).
Survey table should be unpivoted on email,phone and sms to transform column into rows.
Here's my query
SELECT 'Email' as Type,esi.sentdate,emailinvite as inviteid,s.surveyid
FROM Survey s
INNER JOIN SurveyInvite esi on s.emailinvite=esi.id
UNION
SELECT 'SMS' as Type,ssi.sentdate,smsinvite as inviteid,s.surveyid
FROM Survey s
INNER JOIN SurveyInvite ssi on s.smsinvite=ssi.id
UNION
SELECT 'Phone' as Type,psi.sentdate,phoneinvite as inviteid,s.surveyid
FROM Survey s
INNER JOIN SurveyInvite psi on s.phoneinvite=psi.id
Please suggest other way to write query if that makes query faster. I am still trying using UNPIVOT,left join,CTE to avoid using UNION.
Sample setup here
You don't need to query the tables three times, you can just unpivot. The easiest way to do this is with a CROSS APPLY (VALUES
SELECT
v.Type,
ssi.sentdate,
v.inviteid,
s.surveyid
FROM Survey s
CROSS APPLY (VALUES
('Email', s.emailinvite),
('Phone', s.phoneinvite),
('SMS', s.smsinvite)
) v (Type, inviteid)
INNER JOIN SurveyInvite ssi on v.inviteid = ssi.id;
I suggest you consider normalizing your database in the first place by storing the data unpivoted in a separate table.
Another way (again the key is to only read either table once instead of three times):
SELECT i.sentdate,
[Type] = REPLACE(u.Types, 'invite', ''),
inviteid = u.id,
u.surveyid
FROM dbo.Survey AS s
UNPIVOT (Id FOR Types IN
(emailinvite, phoneinvite, smsinvite)) AS u
INNER JOIN dbo.SurveyInvite AS i ON u.Id = i.id;
As you can see from the db<>fiddle, this eliminates 4 of the 6 table scans and also an expensive distinct sort.
I assume that you have set the primary and foreign keys correctly. It might also beneficial to have indexes on the foreign keys. See: Should every SQL Server foreign key have a matching index?.
As always with these performance questions. Only benchmarking different variants can tell you which one is the fastest. The same query can perform very differently with a different set of data.
One possibility is to use joins and base the query on SurveyInvite:
SELECT
I.sentdate,
CASE WHEN SE.id IS NOT NULL THEN 'Email'
WHEN SP.id IS NOT NULL THEN 'Phone'
ELSE 'SMS'
END AS Type,
I.id AS inviteid,
CASE WHEN SE.id IS NOT NULL THEN SE.surveyid
WHEN SP.id IS NOT NULL THEN SP.surveyid
ELSE SS.surveyid
END AS surveyid
FROM
SurveyInvite I
LEFT JOIN Survey SE
ON I.emailinvite = SE.id
LEFT JOIN Survey SP
ON I.phoneinvite = SP.id
LEFT JOIN Survey SS
ON I.smsinvite = SS.id

What is the best way to join tables

this is more like a general question.
I am looking for the best way to join 4, maybe 5 different tables. I am trying to create a Power Bi pulling live information from an IBM AS400 where customer service can type one of our parts number,
see how many parts we have in inventory, if none, see the lead time and if there are any orders already already entered for the typed part number.
SERI is our inventory table with 37180 records.
(active inventory that is available)
METHDM is our kit table with 37459 records.
(this table contains the bill of materials for custom kits, KIT A123 contains different part numbers in it witch are in SERI as well.)
STKA is our part lead time table with 76796 records.
(lead time means how long will it take for parts to come in)
OCRI is our sales order table with 6497 records.
(This table contains all customer orders)
I have some knowledge in writing queries but this one is more challenging of what I have created in the past. Should I start with the table that has the most records and start left joining the rest ?
From STKA 76796 records
Left join METHDM 37459 records on STKA
left join SERI 37180 records on STKA
left join OCRI 6497 records on STAK
Select
STKA.v6part as part,
STKA.v6plnt as plant,
STKA.v6tdys as pur_leadtime,
STKA.v6prpt as Pur_PrepLeadtime,
STKA.v6lead as Mfg_leadtime,
STKA.v6prpt as Mfg_PrepLeadTime,
METHDM.AQMTLP AS COMPONENT,
METHDM.AQQPPC AS QTYNEEDED,
SERI.HTLOTN AS BATCH,
SERI.HTUNIT AS UOM,
(HTQTY - HTQTYC) as ONHAND,
OCRI.DDORD# AS SALESORDER,
OCRI.DDRDAT AS PROMISED
from stka
left join METHDM on STKA.V6PART = METHDM.AQPART
left join SERI on STKA.V6PART = SERI.HTPART
left join OCRI on STKA.V6PART = OCRI.DDPART
Is this the best way to join the tables?
I think you already have your answer, but conceptually, there are a few issues here to deal with, and I figured I would give you a few examples, using data a little bit like yours, but massively simplified.
CREATE TABLE #STKA (V6PART INT, OTHER_DATA VARCHAR(50));
CREATE TABLE #METHDM (AQPART INT, KIT_ID INT, SOME_DATE DATETIME, OTHER_DATA VARCHAR(50));
CREATE TABLE #SERI (HTPART INT, OTHER_DATA VARCHAR(50));
CREATE TABLE #OCRI (DDPART INT, OTHER_DATA VARCHAR(50));
INSERT INTO #STKA SELECT 1, NULL UNION ALL SELECT 2, NULL UNION ALL SELECT 3, NULL; --1, 2, 3 Ids
INSERT INTO #METHDM SELECT 1, 1, '20200108 10:00', NULL UNION ALL SELECT 1, 2, '20200108 11:00', NULL UNION ALL SELECT 2, 1, '20200108 13:00', NULL; --1 Id appears twice, 2 Id once, no 3 Id
INSERT INTO #SERI SELECT 1, NULL UNION ALL SELECT 3, NULL; --1 and 3 Ids
INSERT INTO #OCRI SELECT 1, NULL UNION ALL SELECT 4, NULL; --1 and 4 Ids
So fundamentally we have a few issues here:
o the first problem is that the IDs in the tables differ, one table has an ID #4 but this isn't in any of the others;
o the second issue is that we have multiple rows for the same ID in one table;
o the third issue is that some tables are "missing" IDs that are in other tables, which you already covered by using LEFT JOINs, so I will ignore this.
--This will select ID 1 twice, 2 once, 3 once, and miss 4 completely
SELECT
*
FROM
#STKA
LEFT JOIN #METHDM ON #METHDM.AQPART = #STKA.V6PART
LEFT JOIN #SERI ON #SERI.HTPART = #STKA.V6PART
LEFT JOIN #OCRI ON #OCRI.DDPART = #STKA.V6PART;
So the problem here is that we don't have every ID in our "anchor" table STKA, and in fact there's no single table that has every ID in it. Now your data might be fine here, but if it isn't then you can simply add a step to find every ID, and use this as the anchor.
--This will select each ID, but still doubles up on ID 1
WITH Ids AS (
SELECT V6PART AS ID FROM #STKA
UNION
SELECT AQPART AS ID FROM #METHDM
UNION
SELECT HTPART AS ID FROM #SERI
UNION
SELECT DDPART AS ID FROM #OCRI)
SELECT
*
FROM
Ids I
LEFT JOIN #STKA ON #STKA.V6PART = I.Id
LEFT JOIN #METHDM ON #METHDM.AQPART = I.Id
LEFT JOIN #SERI ON #SERI.HTPART = I.Id
LEFT JOIN #OCRI ON #OCRI.DDPART = I.Id;
That's using a common-table expression, but a subquery would also do the job. However, this still leaves us with an issue where ID 1 appears twice in the list, because it has multiple rows in one of the sub-tables.
One way to fix this is to pick the row with the latest date, or any other ORDER you can apply to the data:
--Pick the best row for the table where it has multiple rows, now we get one row per ID
WITH Ids AS (
SELECT V6PART AS ID FROM #STKA
UNION
SELECT AQPART AS ID FROM #METHDM
UNION
SELECT HTPART AS ID FROM #SERI
UNION
SELECT DDPART AS ID FROM #OCRI),
BestMETHDM AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY AQPART ORDER BY SOME_DATE DESC) AS ORDER_ID
FROM
#METHDM)
SELECT
*
FROM
Ids I
LEFT JOIN #STKA ON #STKA.V6PART = I.Id
LEFT JOIN BestMETHDM ON BestMETHDM.AQPART = I.Id AND BestMETHDM.ORDER_ID = 1
LEFT JOIN #SERI ON #SERI.HTPART = I.Id
LEFT JOIN #OCRI ON #OCRI.DDPART = I.Id;
Of course you could also add some aggregation (SUM, MAX, MIN, AVG, etc.) to fix this problem (if it is indeed an issue). Also, I used a common-table expression, but this would work just as well with a subquery.
Expanding on a comment made on the question..
I would say I will start with SERI as that table contains the entire inventory for our facility and should cover the other tables
However the question said
SERI is our inventory table with 37180 records. (active inventory that is available)
In my experience, active inventory, isn't the same as all parts.
Normally, in a query like this, I'd expect the first table to be a Parts Master table of some sort that contains every possible part ID.

Matching similar entities based on many to many relationship

I have two entities in my database that are connected with a many to many relationship. I was wondering what would be the best way to list which entities have the most similarities based on it?
I tried doing a count(*) with intersect, but the query takes too long to run on every entry in my database (there are about 20k records). When running the query I wrote, CPU usage jumps to 100% and the database has locking issues.
Here is some code showing what I've tried:
My tables look something along these lines:
/* 20k records */
create table Movie(
Id INT PRIMARY KEY,
Title varchar(255)
);
/* 200-300 records */
create table Tags(
Id INT PRIMARY KEY,
Desc varchar(255)
);
/* 200,000-300,000 records */
create table TagMovies(
Movie_Id INT,
Tag_Id INT,
PRIMARY KEY (Movie_Id, Tag_Id),
FOREIGN KEY (Movie_Id) REFERENCES Movie(Id),
FOREIGN KEY (Tag_Id) REFERENCES Tags(Id),
);
(This works, but it is terribly slow)
This is the query that I wrote to try and list them:
Usually I also filter with top 1 & add a where clause to get a specific set of related data.
SELECT
bk.Id,
rh.Id
FROM
Movies bk
CROSS APPLY (
SELECT TOP 15
b.Id,
/* Tags Score */
(
SELECT COUNT(*) FROM (
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = bk.Id
INTERSECT
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = b.Id
) Q1
)
as Amount
FROM
Movies b
WHERE
b.Id <> bk.Id
ORDER BY Amount DESC
) rh
Explanation:
Movies have tags and the user can get try to find movies similar to the one that they selected based on other movies that have similar tags.
Hmm ... just an idea, but maybe I didnt understand ...
This query should return best matched movies by tags for a given movie ID:
SELECT m.id, m.title, GROUP_CONCAT(DISTINCT t.Descr SEPARATOR ', ') as tags, count(*) as matches
FROM stack.Movie m
LEFT JOIN stack.TagMovies tm ON m.Id = tm.Movie_Id
LEFT JOIN stack.Tags t ON tm.Tag_Id = t.Id
WHERE m.id != 1
AND tm.Tag_Id IN (SELECT Tag_Id FROM stack.TagMovies tm WHERE tm.Movie_Id = 1)
GROUP BY m.id
ORDER BY matches DESC
LIMIT 15;
EDIT:
I just realized that it's for M$ SQL ... but maybe something similar can be done...
You should probably decide on a naming convention and stick with it. Are tables singular or plural nouns? I don't want to get into that debate, but pick one or the other.
Without access to your database I don't know how this will perform. It's just off the top of my head. You could also limit this by the M.id value to find the best matches for a single movie, which I think would improve performance by quite a bit.
Also, TOP x should let you get the x closest matches.
SELECT
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title,
COUNT(*) AS matched_tags
FROM
Movie M
INNER JOIN TagsMovie TM1 ON TM1.movie_id = M.movie_id
INNER JOIN TagsMovie TM2 ON
TM2.tag_id = TM1.tag_id AND
TM2.movie_id <> TM1.movie_id
INNER JOIN Movie SM ON SM.movie_id = TM2.movie_id
GROUP BY
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title
ORDER BY
COUNT(*) DESC

Join on resultant table of another join without using subquery,CTE or temp tables

My question is can we join a table A to resultant table of inner join of table A and B without using subquery, CTE or temp tables ?
I am using SQL Server.
I will explain the situation with an example
The are two tables GoaLScorers and GoalScoredDetails.
GoaLScorers
gid Name
-----------
1 A
2 B
3 A
GoalScoredDetails
DetailId gid stadium goals Cards
---------------------------------------------
1 1 X 2 1
2 2 Y 5 2
3 3 Y 2 1
The result I am expecting is if I select a stadium 'X' (or 'Y')
I should get name of all who may or may not have scored there, also aggregate total number of goals,total cards.
Null value is acceptable for names if no goals or no cards.
I can get the result I am expecting with the below query
SELECT
gs.name,
SUM(goal) as TotalGoals,
SUM(cards) as TotalCards
FROM
(SELECT
gid, stadium, goal, cards
FROM
GoalScoredDetails
WHERE
stadium = 'Y') AS vtable
RIGHT OUTER JOIN
GoalScorers AS gs ON vtable.gid = gs.gid
GROUP BY
gs.name
My question is can we get the above result without using a subquery or CTE or temp table ?
Basically what we need to do is OUTER JOIN GoalScorers to resultant virtual table of INNER JOIN OF GoalScorers and GoalScoredDetails.
But I am always faced with ambiguous column name error as "gid" column is present in GoalScorers and also in resultant table. Error persists even if I try to use alias for column names.
I have created a sql fiddle for this her: http://sqlfiddle.com/#!3/40162/8
SELECT gs.name, SUM(gsd.goal) AS totalGoals, SUM(gsd.cards) AS totalCards
FROM GoalScorers gs
LEFT JOIN GoalScoredDetails gsd ON gsd.gid = gs.gid AND
gsd.Stadium = 'Y'
GROUP BY gs.name;
IOW, you could push your where criteria onto joining expression.
The error Ambiguous column name 'ColumnName' occurs when SQL Server encounters two or more columns with the same and it hasn't been told which to use. You can avoid the error by prefixing your column names with either the full table name, or an alias if provided. For the examples below use the following data:
Sample Data
DECLARE #GoalScorers TABLE
(
gid INT,
Name VARCHAR(1)
)
;
DECLARE #GoalScoredDetails TABLE
(
DetailId INT,
gid INT,
stadium VARCHAR(1),
goals INT,
Cards INT
)
;
INSERT INTO #GoalScorers
(
gid,
Name
)
VALUES
(1, 'A'),
(2, 'B'),
(3, 'A')
;
INSERT INTO #GoalScoredDetails
(
DetailId,
gid,
stadium,
goals,
Cards
)
VALUES
(1, 1, 'x', 2, 1),
(2, 2, 'y', 5, 2),
(3, 3, 'y', 2, 1)
;
In this first example we recieve the error. Why? Because there is more than one column called gid it cannot tell which to use.
Failed Example
SELECT
gid
FROM
#GoalScoredDetails AS gsd
RIGHT OUTER JOIN #GoalScorers as gs ON gs.gid = gsd.gid
;
This example works because we explicitly tell SQL which gid to return:
Working Example
SELECT
gs.gid
FROM
#GoalScoredDetails AS gsd
RIGHT OUTER JOIN #GoalScorers as gs ON gs.gid = gsd.gid
;
You can, of course, return both:
Example
SELECT
gs.gid,
gsd.gid
FROM
#GoalScoredDetails AS gsd
RIGHT OUTER JOIN #GoalScorers as gs ON gs.gid = gsd.gid
;
In multi table queries I would always recommend prefixing every column name with a table/alias name. This makes the query easier to follow, and reduces the likelihood of this sort of error.

SELECT Statement in CASE

Please don't downgrade this as it is bit complex for me to explain. I'm working on data migration so some of the structures look weird because it was designed by someone like that.
For ex, I have a table Person with PersonID and PersonName as columns. I have duplicates in the table.
I have Details table where I have PersonName stored in a column. This PersonName may or may not exist in the Person table. I need to retrieve PersonID from the matching records otherwise put some hardcode value in PersonID.
I can't write below query because PersonName is duplicated in Person Table, this join doubles the rows if there is a matching record due to join.
SELECT d.Fields, PersonID
FROM Details d
JOIN Person p ON d.PersonName = p.PersonName
The below query works but I don't know how to replace "NULL" with some value I want in place of NULL
SELECT d.Fields, (SELECT TOP 1 PersonID FROM Person where PersonName = d.PersonName )
FROM Details d
So, there are some PersonNames in the Details table which are not existent in Person table. How do I write CASE WHEN in this case?
I tried below but it didn't work
SELECT d.Fields,
CASE WHEN (SELECT TOP 1 PersonID
FROM Person
WHERE PersonName = d.PersonName) = null
THEN 123
ELSE (SELECT TOP 1 PersonID
FROM Person
WHERE PersonName = d.PersonName) END Name
FROM Details d
This query is still showing the same output as 2nd query. Please advise me on this. Let me know, if I'm unclear anywhere. Thanks
well.. I figured I can put ISNULL on top of SELECT to make it work.
SELECT d.Fields,
ISNULL(SELECT TOP 1 p.PersonID
FROM Person p where p.PersonName = d.PersonName, 124) id
FROM Details d
A simple left outer join to pull back all persons with an optional match on the details table should work with a case statement to get your desired result.
SELECT
*
FROM
(
SELECT
Instance=ROW_NUMBER() OVER (PARTITION BY PersonName),
PersonID=CASE WHEN d.PersonName IS NULL THEN 'XXXX' ELSE p.PersonID END,
d.Fields
FROM
Person p
LEFT OUTER JOIN Details d on d.PersonName=p.PersonName
)AS X
WHERE
Instance=1
Ooh goody, a chance to use two LEFT JOINs. The first will list the IDs where they exist, and insert a default otherwise; the second will eliminate the duplicates.
SELECT d.Fields, ISNULL(p1.PersonID, 123)
FROM Details d
LEFT JOIN Person p1 ON d.PersonName = p1.PersonName
LEFT JOIN Person p2 ON p2.PersonName = p1.PersonName
AND p2.PersonID < p1.PersonID
WHERE p2.PersonID IS NULL
You could use common table expressions to build up the missing datasets, i.e. your complete Person table, then join that to your Detail table as follows;
declare #n int;
-- set your default PersonID here;
set #n = 123;
-- Make sure previous SQL statement is terminated with semilcolon for with clause to parse successfully.
-- First build our unique list of names from table Detail.
with cteUniqueDetailPerson
(
[PersonName]
)
as
(
select distinct [PersonName]
from [Details]
)
-- Second get unique Person entries and record the most recent PersonID value as the active Person.
, cteUniquePersonPerson
(
[PersonID]
, [PersonName]
)
as
(
select
max([PersonID]) -- if you wanted the original Person record instead of the last, change this to min.
, [PersonName]
from [Person]
group by [PersonName]
)
-- Third join unique datasets to get the PersonID when there is a match, otherwise use our default id #n.
-- NB, this would also include records when a Person exists with no Detail rows (they are filtered out with the final inner join)
, cteSudoPerson
(
[PersonID]
, [PersonName]
)
as
(
select
coalesce(upp.[PersonID],#n) as [PersonID]
coalesce(upp.[PersonName],udp.[PersonName]) as [PersonName]
from cteUniquePersonPerson upp
full outer join cteUniqueDetailPerson udp
on udp.[PersonName] = p.[PersonName]
)
-- Fourth, join detail to the sudo person table that includes either the original ID or our default ID.
select
d.[Fields]
, sp.[PersonID]
from [Details] d
inner join cteSudoPerson sp
on sp.[PersonName] = d.[PersonName];