Excluding results that appear in another column of a CONNECT BY query - sql

Have a heavy query (takes 15 minutes to run), but it's returning more results than I need. It's a CONNECT BY query, and I'm getting nodes that are descendants in the root node results. I.E.:
Ted
Bob
John
Bob
John
John
Normally, the way to resolve this is using a START WITH condition, typically requiring the parent of a node to be null. But due to the nature of the query, I don't have the START WITH values I need to compare to until I have the full resultset. I'm basically trying to double-query my results to say QUERY STUFF START WITH RECORDS THAT AREN'T IN THAT STUFF.
Here's the query (built with the help of Nicholas Krasnov, here: Oracle Self-Join on multiple possible column matches - CONNECT BY?):
select cudroot.root_user, cudroot.node_level, cudroot.user_id, cudroot.new_user_id,
cudbase.* -- Not really, just simplyfing
from css.user_desc cudbase
join (select connect_by_root(user_id) root_user,
user_id user_id,
new_user_id new_user_id,
level node_level
from (select cudordered.user_id,
coalesce(cudordered.new_user_id, cudordered.nextUser) new_user_id
from (select cud.user_id,
cud.new_user_id,
decode(cud.global_hr_id, null, null, lead(cud.user_id ignore nulls) over (partition by cud.global_hr_id order by cud.user_id)) nextUser
from css.user_desc cud
left join gsu.stg_userdata gstgu
on (gstgu.user_id = cud.user_id
or (gstgu.sap_asoc_global_id = cud.global_hr_id))
where upper(cud.user_type_code) in ('EMPLOYEE','CONTRACTOR','DIV_EMPLOYEE','DIV_CONTRACTOR','DIV_MYTEAPPROVED')) cudordered)
connect by nocycle user_id = prior new_user_id) cudroot
on cudbase.user_id = cudroot.user_id
order by
cudroot.root_user, cudroot.node_level, cudroot.user_id;
This gives me results about related users (based off of user_id renames or associated SAP IDs) that look like this:
ROOT_ID LEVEL USER_ID NEW_USER_ID
------------------------------------------------
A5093522 1 A5093522 FG096489
A5093522 2 FG096489 A5093665
A5093522 3 A5093665
FG096489 1 FG096489 A5093665
FG096489 2 A5093665
A5093665 1 A5093665
What I need is a way to filter the first join (select connect_by_root(user_id)... to exclude FG096489 and A5093665 from the root list.
The best START WITH I can think of would look like this (not tested yet):
start with user_id not in (select new_user_id
from (select coalesce(cudordered.new_user_id, cudordered.nextUser) new_user_id
from (select cud.new_user_id,
decode(cud.global_hr_id, null, null, lead(cud.user_id ignore nulls) over (partition by cud.global_hr_id order by cud.user_id)) nextUser
from css.user_desc cud
where upper(cud.user_type_code) in ('EMPLOYEE','CONTRACTOR','DIV_EMPLOYEE','DIV_CONTRACTOR','DIV_MYTEAPPROVED')) cudordered)
connect by nocycle user_id = prior new_user_id)
... but I'm effectively executing my 15 minute query twice.
I've looked at using partitions in the query, but there's not really a partition... I want to look at the full resultset of new_user_ids. Have also explored analytical functions such as rank()... my bag of tricks is empty.
Any ideas?
Clarification
The reason I don't want the extra records in the root list is because I only want one group of results for each user. I.E., if Bob Smith has had four accounts during his career here (people come and go frequently, as employees and/or contractors), I want to work with a set of accounts that all belong(ed) to Bob Smith.
If Bob came here as an contractor, converted to an employee, left, came back as a contractor in another country, and left/returned to a legal org that is now in our SAP system, his account rename/chain might look like:
Bob Smith CONTRACTOR ---- US0T0001 -> US001101 (given a new ID as an employee)
Bob Smith EMPLOYEE ---- US001101 -> EB0T0001 (contractor ID for the UK)
Bob Smith CONTRACTOR SAP001 EB0T000T (no rename performed)
Bob Smith EMPLOYEE SAP001 TE110001 (currently-active ID)
In the above example, the four accounts are linked by either a new_user_id field that was set when the user was renamed or through having the same SAP ID.
Because HR frequently fails to follow the business process, returning users may end up with any of those four ID being restored to them. I have to analyze all the IDs for Bob Smith and say "Bob Smith can only have TE110001 restored", and kick back an error if they try to restore something else. I have to do it for 90,000+ records.
The first column, "Bob Smith", is just an identifier to the group of associated accounts. In my original example, I'm using the root User ID as the identifier (e.g. US0T0001). If I use first/last names to identify users, I end up with collisions.
So Bob Smith would look like this:
US0T0001 1 CONTRACTOR ---- US0T0001 -> US001101 (given a new ID as an employee)
US0T0001 2 EMPLOYEE ---- US001101 -> EB0T0001 (contractor ID for the UK)
US0T0001 3 CONTRACTOR SAP001 EB0T0001 (no rename performed)
US0T0001 4 EMPLOYEE SAP001 TE110001 (currently-active ID)
... where 1, 2, 3, 4 are the levels in the heirarchy.
Since US0T0001, US001101, EB0T0001, and TE110001 are all accounted for, I don't want another group for them. But the results I have now have those accounts listed in multiple groups:
US001101 1 EMPLOYEE ---- US001101 -> EB0T0001 (
US001101 2 CONTRACTOR SAP001 EB0T0001
US001101 3 EMPLOYEE SAP001 TE110001
EB0T0001 1 CONTRACTOR SAP001 EB0T0001
EB0T0001 2 EMPLOYEE SAP001 TE110001
US001101 1 EMPLOYEE SAP001 TE110001
This causes two problems:
When I query the results for a User ID, I get hits from multiple groups
Each group will report a different expected user ID for Bob Smith.
You asked for an expanded set of records... here are some actual data:
-- NumRootUsers tells me how many accounts are associated with a user.
-- The new user ID field is explicitly set in the database, but may be null.
-- The calculated new user ID analyzes records to determine what the next related record is
NumRoot New User Calculated
RootUser Users Level UserId ID Field New User ID SapId LastName FirstName
-----------------------------------------------------------------------------------------------
BG100502 3 1 BG100502 BG1T0873 BG1T0873 GRIENS VAN KION
BG100502 3 2 BG1T0873 BG103443 BG103443 GRIENS VAN KION
BG100502 3 3 BG103443 41008318 VAN GRIENS KION
-- This group causes bad matches for Kion van Griens... the IDs are already accounted for,
-- and this group doesn't even grab all of the accounts for Kion. It's also using a new
-- ID to identify the group
BG1T0873 2 1 BG1T0873 BG103443 BG103443 GRIENS VAN KION
BG1T0873 2 2 BG103443 41008318 VAN GRIENS KION
-- Same here...
BG103443 1 1 BG103443 41008318 VAN GRIENS KION
-- Good group of records
BG100506 3 1 BG100506 BG100778 41008640 MALEN VAN LARS
BG100506 3 2 BG100778 BG1T0877 41008640 MALEN VAN LARS
BG100506 3 3 BG1T0877 41008640 VAN MALEN LARS
-- Bad, unwanted group of records
BG100778 2 1 BG100778 BG1T0877 41008640 MALEN VAN LARS
BG100778 2 2 BG1T0877 41008640 VAN MALEN LARS
-- Third group for Lars
BG1T0877 1 1 BG1T0877 41008640 VAN MALEN LARS
-- Jan... fields are set differently than the above examples, but the chain is calculated correctly
BG100525 3 1 BG100525 BG1T0894 41008651 ZANWIJK VAN JAN
BG100525 3 2 BG1T0894 TE035165 TE035165 41008651 VAN ZANWIJK JAN
BG100525 3 3 TE035165 41008651 VAN ZANWIJK JAN
-- Bad
BG1T0894 2 1 BG1T0894 TE035165 TE035165 41008651 VAN ZANWIJK JAN
BG1T0894 2 2 TE035165 41008651 VAN ZANWIJK JAN
-- Bad bad
TE035165 1 1 TE035165 41008651 VAN ZANWIJK JAN
-- Somebody goofed and gave Ziano a second SAP ID... but we still matched correctly
BG100527 3 1 BG100527 BG1T0896 41008652 STEFANI DE ZIANO
BG100527 3 2 BG1T0896 TE033030 TE033030 41008652 STEFANI DE ZIANO
BG100527 3 3 TE033030 42006172 DE STEFANI ZIANO
-- And we still got extra, unwanted groups
BG1T0896 3 2 BG1T0896 TE033030 TE033030 41008652 STEFANI DE ZIANO
BG1T0896 3 3 TE033030 42006172 DE STEFANI ZIANO
TE033030 3 3 TE033030 42006172 DE STEFANI ZIANO
-- Mark's a perfect example of the missing/frustrating data I'm dealing with... but we still matched correctly
BG102188 3 1 BG102188 BG1T0543 41008250 BULINS MARK
BG102188 3 2 BG1T0543 TE908583 41008250 BULINS R.J.M.A.
BG102188 3 3 TE908583 41008250 BULINS RICHARD JOHANNES MARTINUS ALPHISIUS
-- Not wanted
BG1T0543 3 2 BG1T0543 TE908583 41008250 BULINS R.J.M.A.
BG1T0543 3 3 TE908583 41008250 BULINS RICHARD JOHANNES MARTINUS ALPHISIUS
TE908583 3 3 TE908583 41008250 BULINS RICHARD JOHANNES MARTINUS ALPHISIUS
-- One more for good measure
BG1T0146 3 1 BG1T0146 BG105905 BG105905 LUIJENT VALERIE
BG1T0146 3 2 BG105905 TE034165 42006121 LUIJENT VALERIE
BG1T0146 3 3 TE034165 42006121 LUIJENT VALERIE
BG105905 3 2 BG105905 TE034165 42006121 LUIJENT VALERIE
BG105905 3 3 TE034165 42006121 LUIJENT VALERIE
TE034165 3 3 TE034165 42006121 LUIJENT VALERIE
Not sure if all that info makes it clearer or will make your eyes roll back into your head : )
Thanks for looking at this!

I think I have it. We have allowed ourselves to become fixated on the chronological order whereas in fact it doesn't matter. Your START WITH clause should be 'NEW_USER_ID IS NULL'.
To get chronological order you could 'ORDER BY cudroot.node_level * -1'.
I would also recommend that you look at using a WITH clause to form your base data and perform the heirarchical query on that.

Perhaps what you need here is multiple queries. Each query will find a subset of the records you are trying to find. Each query will hopefully be simpler and faster than a single, ginormous query. Something like:
where new_user_id is null and SAP ID is null
where new_user_id is not null and SAP ID is null
where new_user_id is null and SAP ID is not null
where new_user_id is not null and SAP ID is not null
(these are of the cuff examples)
I think part of the problem with solving this conundrum is that the problem space is too large. By subdividing this problem into smaller pieces, each piece will be workable.

Related

Self JOIN to find the parent detail which matches with the row data -

I am trying to query in MS SQL and I can not resolve it. I have a table employees:
Id Name Surname FatherName MotherName WifeName Pincode isChild
-- ------- ------- ---------- ---------- -------- ------- -------
1 John Green James Sue null 101011 1
2 Michael Sloan Barry Lilly null 101011 1
3 Sally Green Andrew Molly Jemi 101011 1
4 Barry Sloan Soul Paul Lilly 101011 0
5 James Green Ned White Sue 101011 0
I want a query that selects rows where the father name and mother name of child matches with name and wife name. For the example table, where I want to return the result of rows where father and mother name matches the name and wife name column. For eg. id=1, where John's father name James and mother name Sue matches with id 5 which returns James as first name and Sue as wife name. So my query should return (this is my expected result)
Id Name Surname FatherName MotherName WifeName Pincode isChild
-- ------- ------- ---------- ---------- -------- ------- -------
5 James Green Ned White Sue 101011 0
4 Barry Sloan Soul Paul Lilly 101011 0
I tried with the below query but it checks for James only. How to change my query so that it checks all the names and returns the expected result.
select * FROM employees
where first_name like '%James%'
and wife_name like '%Sue%'
and pincode=101011;
Any tips on this will be really helpful. I am new to joins, need help on writing self join to get the result.
…
select *
from thetable as p -- the parent/father
where exists -- with one child at least
(
select *
from thetable as c
where c.fathername = p.name
and c.mothername = p.wifename
-- lastname?
)
Too long for a comment, but also not intended as a slam against what you are working with. Please take as constructive criticism.
Aside from VERY POOR DESIGN on the table content, getting that corrected before you get too deep into whatever you are working should be done first. A more typical design might be having a table of people. Now, to get the relationships you could do a couple ways. One is that on each individual person's record, you add 2 additional IDs. FatherID, MotherID. These IDs would join directly back to the child vs hard strings to match against. Take a surname like Smith or Jones. Then, look at the many instances of a "John Smith" may exist, yes a lot, and lower probability of finding a matching wife's name of Sue, Mary or whatever else name. But even that could lead to multiple possibilities. Yes, you are adding a PIN, but even a computer can generate a random pin of 1234.
By having the IDs, there is NO ambiguity of who the relationship is with.
If the data were slightly altered to something like
Id Name Surname FatherID MotherID SpouseID
-- ------- ------- ---------- ---------- --------
1 John Green 5 6 null
2 Michael Sloan 4 3 null
3 Lilly Sloan null null 4
4 Barry Sloan null null 3
5 James Green 9 10 6
6 Sue Green 7 8 5
7 Bill Jones null null 8
8 Martha Jones null null 7
9 Brian Green null null 10
10 Beth Smith-Green null null 9
So, in this modified example, you can see right away that ID#1 John Green has parents of Father (ID#5) is James and Mother (ID#6) is Sue. But even from this, James is a child to Father (ID#9) Brian and Mother (ID#10) Beth. This scenario is showing to a grand-parent level capacity and that each of James and Sue are also children but to their respective parents. Sue's parents of the Jones surname.
For Michael Sloan, parents of #4 Barry, and #3 Lilly.
And I additionally added a spouse ID. This prevents redundancy of people's names copied all over. Then you can query based on the child's parent's respective IDs to find out vs a hopeful name LIKE guess.
So, even though not solving a relatively simple query, fixing the underlying foundation of your database and is relations will, long-term, help ease your querying in the future.
Try this:
SELECT
T2.*
FROM Employee T1
JOIN Employee T2 ON T2.Name = T1.FatherName
AND T2.WifeName = T1.MotherName

Is this multiple join on 2 tables possible?

I have 2 tables and I am having trouble joining it to give me the desired output.
First table is called Future. It is future meetings I have.
Date Name Subject Importance Location
7/08/2020 David Work 1 London
7/08/2020 George Updates 2 New York
7/08/2020 Frank New Appointments 5 London
7/08/2020 Steph Policy 1 Paris
The second table is called Previous. It is previous meetings I have had.
Date Name Subject Importance Location Time Rating
1/08/2020 David Work 3 London 23.50 4
2/10/2018 David Emails 3 New York 18.20 3
1/08/2019 George New Appointments5 London 55.10 2
3/04/2020 Steph Dismissal 1 Paris 33.20 5
Now what I need to is to reference my previous table by name to see the previous meetings I have had with this person and I want all the data from the Previous Table there. I also need to limit it to only showing maximum 5 previous meetings with each person.
Date Name Subject Importance Location Time Rating
7/08/2020 David Work 1 London - -
1/08/2020 David Work 3 London 23.50 4
2/10/2018 David Emails 3 New York 18.20 3
7/08/2020 George Updates 2 New York - -
1/08/2019 George New Appointments5 London 55.10 2
The Name column will need to be a left join, but then i need to just do a regular join on the other columns. Also unsure how to limit the name results to a maximum of 5 of the same value. Thanks for your help in advance.
Basically, you want union all:
select m.*
from ((select Date, Name, Subject, Importance, Location, NULL as time, NULL as rating
from future
) union all
(select Date, Name, Subject, Importance, Location, time, rating
from previous
)
) m
group by name, date desc;
You can apply other conditions to this result. It is not clear what other conditions you really want, but this is a start.

Identifying Records Where a String Appears More Than Once

I have a following dataset that looks like:
ID Medication Dose
1 Aspirin 4
1 Tylenol 7
1 Aspirin 2
1 Ibuprofen 1
2 Aspirin 6
2 Aspirin 2
2 Ibuprofen 6
2 Tylenol 4
3 Tylenol 3
3 Tylenol 7
3 Tylenol 2
I would like to develop a code that would identify patients who have been administered a medication more than once. So for example, ID 1 had Aspirin twice, ID 2 had Aspirin twice and ID 3 had Tylenol three times.
I could be wrong but I think the easiest way to do this would be to concatenate each ID based on Medication using a code similar to the one below; but I'm not quite sure what to do after that - is it possible to count if a string appears twice within a cell?
SELECT DISTINCT ST2.[ID],
SUBSTRING(
(
SELECT ','+ST1.Medication AS [text()]
FROM ED_NOTES_MASTER ST1
WHERE ST1.[ID] = ST2.[ID]
Order BY [ID]
FOR XML PATH ('')
), 1, 200000) [Result]
FROM ED_NOTES_MASTER ST2
I would like the output to look like the following:
ID MEDICATION Aspirin2x Tylenol2x Ibuprofen2x
1 Aspirin, Tylenol , Aspirin YES NO NO
2 Ibuprofen, Aspirin, Aspirin YES NO NO
3 Tylenol, Tylenol ,Tylenol NO YES NO
For the first part of your question (identify patients that have had a particular medication more than once), you can do this using GROUP BY to group by the ID and medication, and then using COUNT to get how many times each medication was given to each patient. For example:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
This will give you a list of all ID - Medication combinations that appear in the table and a count of how many times each combo appears. To limit these results down to just those that are greater than 2, you can add a condition to the COUNTed field using HAVING:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
HAVING amount >= 2
The problem now is formatting the results in the way you want. What you will get from the query above is a list of all patient - medication combinations that came up in the table more than once, like this:
ID | Medication | Count
------+---------------+-------
1 | Aspirin | 2
2 | Aspirin | 2
3 | Tylenol | 3
I'd suggest that you try and work with this format if possible, because as you have found, to get multiple values returned in a comma delimited list as you have in your Medication column you have to resort to some hacks to get it to work (although a recent version of SQL Server does implement some sort of proper group concatenation functionality.). If you really need the Aspirin2x etc. columns, take a look at the PIVOT operation in SQL Server.

Add incremental number by matching the value of another column

Below is my SQL Server 2012 query example. How do I add an extra column “StaffNo” to show an incremental integer?
This int always starts with 1, it doesn’t need to be matching with the staff name. for example “Joe” in this query running shows 1 in staff No, in next query running he could be 2 or 3 or any other number.
Same user always appear same staff No. Different user appear different No.
The number must be sequential and the increment is 1.
Because Staff is more than 100, so don’t write the query like “select case when staff = ‘Joe’ then 1 End”.
my query:
Staff CaseNumber
Joe 5880
Joe 4489
Joe 2235
Emily 7790
Emily 8813
expected result:
Staff CaseNumber StaffNo
Joe 5880 1
Joe 4489 1
Joe 2235 1
Emily 7790 2
Emily 8813 2
Use DENSE_RANK over the entire table, without a partiton, and order by the staff member's name.
SELECT
Staff,
CaseNumber,
DENSE_RANK() OVER (ORDER BY Staff) StaffNo
FROM yourTable;
Demo here:
Rextester

SQL - Find duplicate children

I have a table containing meetings:
MeetID Description
-----------------------------------------------------
1 SQL Workshop
2 Cake Workshop
I have another table containing all participants in the meetings:
PartID MeetID Name Role
-----------------------------------------------------
1 1 Jan Coordinator
2 1 Peter Participant
3 1 Eva Participant
4 1 Michael Coordinator
5 2 Jan Coordinator
6 2 Peter Participant
I want to find is a list of all meetings that have 2 or more participants with Role = 'Coordinator'.
Eg. in the example above that would be the meeting with MeetID=1 and not 2.
I cannot for the life of me figure out how to do this, allthough I think it should be simple :-)
(I am using SQL Server 2012)
This is easy to do using group by and having:
select MeetId
from participants p
where Role = 'Coordinator'
group by MeetId
having count(*) >= 2;
Note: Role is a potential keyword/reserved word, so it is a bad choice for a column name.