Fix One-One relationship to be One-Many by removing duplicates

Fix One-One relationship to be One-Many by removing duplicates - sql

The Mistake
Originally, there was a one-one relationship between Orgs and Servers where the key of Server was simply an OrganizationId, well this was a pretty bad design as business logic changed and now multiple Orgs can have the same server. Before we made the changes, we just duplicated servers for each org, so multiple Orgs would have Servers with the same Subdomains. Below is the current setup.
Requirements
First off, this is unfortunately on prod with a lot of data, so deleting the whole database with the correct model is kind of off the table.
What we would like to do is now remove duplicate Servers on distinct Subdomains, for example if Org1 and Org2 had Ser1 and Ser2 both with the subdomain "test", we would make the FK Org.Server_Id be the lowest occurrence of a server with that domain, in this case Ser1, so that for both Org1 and Org2 their servers would be Ser1. Below is a high tech excel example:
Things we have tried
We were able to get as far as getting Org.Server_Id to be the correct value based on Server.OrganizationId via:
UPDATE Organization
SET Server_Id = t.Id
FROM(
SELECT Id, OrganizationId
FROM Server
) t
WHERE t.OrganizationId = Organization.Id
but whenever we try and go further, we get stuck because we cant use ORDER BY in the inner FROM to try and grab the first occurrence in some aggregate way.
This is finally what we got to, but of course it doesn't work because we cant access t inside the inner from, and I also don't think this is even the correct path to be following:
UPDATE Organization SET Server_Id = t.Id
FROM
(
SELECT Id, OrganizationId
FROM (
SELECT TOP(1) Subdomain, Id, OrganizationId
FROM Server
WHERE Subdomain = t.Subdomain
) a
) t
WHERE t.OrganizationId = Organization.Id

I can't say I completely understand everything you have going on, but in the past when I have needed to grab the first entry of duplicate information I used a Partition function in the inner query. I don't know how you want to order the results but it would look something like this:
(
SELECT ROW_NUMBER() OVER (PARTITION BY column1, column2, etc... ORDER BY columnX DESC/ASC) As row_num, Id, OrganizationId
FROM Server
) t
WHERE t.OrganizationId = Organization.Id AND row_num = 1
That would be essentially the same thing as what you tried to do in your second code block (I believe). The column1 and column2 would be the set of columns of duplicate data that you want to collapse into one entry and columnX would be the column to order the results by. By having row_num = 1 in the WHERE statement, you would only get back the first result for each unique column1, column2, etc.. combo from the inner query.

After using the partition suggested by #user2731076, we were able to modify our query to this:
UPDATE Organization SET Server_Id = t.Id
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY Subdomain ORDER BY [Server].Id ASC) As row_num, [Server].Id, OrganizationId, Subdomain
FROM Organization
INNER JOIN [Server] ON [Server].OrganizationId = [Organization].Id
) t
WHERE t.Subdomain IN(SELECT Subdomain FROM Server WHERE OrganizationId = Organization.Id) AND row_num = 1
The issue we had with our code:
t.OrganizationId = Organization.Id
Was that since there was always a Server associated with an Org, it would just set the value of Org.Server_Id to what it was already set to. So what we wanted to find the first instance of is the row_num = 1 of the Server that had a subdomain similar to the subdomain of the current Org's server. This required the Inner Join to grab it from the partition, and to grab it from the current org via the IN statement in the WHERE clause, so we could do t.Subdomain = subdomain for our Org.
There is probably a more efficient way to do this, and we will look into it in the future.

Related

SQL Select Count Subquery, Joins messing up everything

I've got a task to work with 4 different tables. I think I've got the "logic" correct, but I think I'm failing on joining the various separately working things together.
The Case somehow returns two rows when the comparison is true; if it isn;t, it displays (correctly) just one. Works fine without joins.
The count subquery works when by itself, but when I'm trying to tie it together, it displays anything from showing the same number everywhere or displaying far too large numbers (likely multiples or multiples).
Select Distinct RPD_PERSONS.PERSON_ID "id",
RPD_PERSONS.SURN_TXT ||' '|| RPD_PERSONS.NAME_TXT "Name",
Case ADD_ROLE_PERS.ROLE_CODE When 'Manager'
Then 'yes'
Else 'no'
End "Manager",
(
Select Count(LDD_CERTS.Cert_ID)
From LDD_CERTS
Join LDD_PERS_CERTS
On LDD_PERS_CERTS.CERT_ID = LDD_CERTS.CERT_ID
Where MONTHS_BETWEEN(LDD_CERTS.VALID_TO,SYSDATE)>0
And LDD_PERS_CERTS.CERT_CHANGE_TYPE>=0
) "no. of certificates"
From RPD_PERSONS
Join ADD_ROLE_PERS
On ADD_ROLE_PERS.Person_ID = RPD_PERSONS.Person_ID
Where RPD_PERSONS.Partic_ID = 1
Group By RPD_PERSONS.PERSON_ID, RPD_PERSONS.SURN_TXT ||' '|| RPD_PERSONS.NAME_TXT, ADD_ROLE_PERS.ROLE_CODE
Order By RPD_PERSONS.Person_ID;
This is the subquery that, by itself, seems to work perfectly.
Select LDD_PERS_CERTS.PERSON_UID,Count(LDD_CERTS.Cert_ID)
From LDD_CERTS
Join LDD_PERS_CERTS
ON LDD_PERS_CERTS.CERT_ID = LDD_CERTS.CERT_ID
Where MONTHS_BETWEEN(LDD_CERTS.VALID_TO,SYSDATE)>0
AND LDD_PERS_CERTS.CERT_CHANGE_TYPE>=0
Group By LDD_PERS_CERTS.PERSON_UID
order by LDD_PERS_CERTS.PERSON_UID;

You have a lot of things going on although a short query to get it, but let me try to summarize what I THINK you are trying to get.
You want a list of distinct people within the company with a count of how many ACTIVE certs (not expired) per person. From that, you also want to know if they are in a management position or not (via roles).
Q: For a person who may be a manager, but also an under-manager to a higher-up, do you want to see that person in both roles as typical business structures could have multiple layers of management, OR... Do you only care to see a person once, and if they are a manager OR some other level. What if a person has 3 or more roles, do you want to see them every instance? If your PRIMARY care is Manager Yes or No, the query gets even more simplified.
Now, your query of counts for valid certs. The MONTHS_BETWEEN() function appears to be you are running in Oracle. Based on the two parameters of the Valid_To date compared to sysdate is an indication that the valid to is always intended to be in the future (ie: Still has an active cert). If this is the case, you will not be able to optimize query as function calling is not Sargable
Instead, you should only have to do where Valid_To > SysDate, in other words, only those that have not yet expired. You MIGHT even be better served by pre-aggregating all counts of still active cert counts per Cert ID, then joining to the person certs table since the person cert check is for all where the cert_change_type >= 0 which could imply ALL. What condition would a Cert_Change_Type be anything less than zero, and if never, that where clause is pointless.
Next, your SELECT DISTINCT query needs a bit of adjustments. Your column-based select has no context to the outer person ID and is just aggregating the total certs. There is no correlation to the person ID to the certs being counted for. I can only GUESS that there is some relationship such as
RPD_Persons.Person_id = LDD_Pers_Certs.Person_UID
Having stated all that, I would have the following table/indexes
table index
LDD_PERS_CERTS ( CERT_CHANGE_TYPE, PERSON_UID, CERT_ID )
LDD_CERTS ( valid_to, cert_id )
RPD_PERSONS ( partic_id, person_id, surn_txt, name_txt )
ADD_ROLE_PERS ( person_id, role_code )
I would try something like
Select
lpc.PERSON_UID,
ValCerts.CertCount
From
( select
Cert_id,
count(*) CertCounts
from
LDD_CERTS
where
Valid_To > sysDate
group by
Cert_id ) ValCerts
JOIN LDD_PERS_CERTS lpc
on ValCerts.Cert_id = lpc.cert_id
Where
lpc.CERT_CHANGE_TYPE >= 0
Now, if you only care if a given person is a manager or not, I would pre-query that only as you were not actually returning a person's SPECIFIC ROLE, just the fact they were a manager or not. My final query might look like'
select
p.PERSON_ID id,
max( p.SURN_TXT || ' ' || p.NAME_TXT ) Name,
max( Case when arp.Person_id IS NULL
then 'no' else 'yes' end ) Manager,
max( coalesce( certs.CertCount, 0 )) ActiveCertsForUser
from
RPD_PERSONS p
LEFT Join ADD_ROLE_PERS arp
On p.Person_ID = arp.Person_ID
AND arp.role_code = 'Manager'
LEFT JOIN
( Select
lpc.PERSON_UID,
ValCerts.CertCount
From
( select
Cert_id,
count(*) CertCounts
from
LDD_CERTS
where
Valid_To > sysDate
group by
Cert_id ) ValCerts
JOIN LDD_PERS_CERTS lpc
on ValCerts.Cert_id = lpc.cert_id
AND lpc.CERT_CHANGE_TYPE >= 0 )
) Certs
on p.Person_id = Certs.Person_uid
Where
p.Partic_ID = 1
GROUP BY
p.PERSON_ID
Now, if the p.partic_id = 1 represents only 1 person, then that wont make as much sense to query all people with a given certificate status, etc. But if Partic_id = 1 represents a group of people such as within a given association / division of a company, then it should be fine.
Any questions, let me know and I can revise / update answer

CASE issue: there can be, presumably, be multiple records in ADD_ROLE_PERS for each person. If a person can have two or more roles running concurrently then you need to decide what the business logic is that you need to use to handle this. If a person can only have one active role at a time presumably there is a "active/disabled" column or effective date columns you should be using to identify the active record (or, potentially, there is a data issue).
The subquery should return the same value for every single row in your resultset, as it is completely isolated/standalone from your main query. If you want it to produce counts that are relevant to each row then you will need to connect it to the tables in the main table (look up correlated subqueries if you don't know how to so this)

Selecting rows from other tables based on the first table using SQL

I have three T-SQL statements that I'd like to combine into one, so it is just a single call to the database, not three.
SELECT * FROM Clients
The first one, selects every client from the Clients table.
SELECT * FROM History
The second one, selects all the history entries from the History table. I then use some code to find the first history for each client. i.e. first history in the table for ClientID gets set into the HasHistory column for that ClientID.
SELECT * FROM Actions
The final one, I get all the actions from the action table. I then use some code to find the last action for each client. i.e. last action in the table for ClientID gets set into the LastAction column for that ClientID.
So I'm wondering if there is a way to write an SQL statement like this for example? Note this is not real SQL, just pseudo code to illustrate what I'm trying to achieve.
SELECT *
FROM Clients
AND
SELECT First History Row
FROM History
WHERE History.ClientID = Clients.ClientID
AND
SELECT Last Action Row
FROM Actions
WHERE Actions.ClientID = Clients.ClientID

There are a number of ways you can do this, but here is one example. I'll work on it a bit at a time to explain what we are doing. You haven't shown us the table design, so the column names are a guess, but you should get the idea.
First, you have to somehow mark which history rows you care about. One way to do this is to do a query that puts an order number on every history row, that starts from 1 with every new client, and orders them by date. This way, the first history row for each client (the one you want) always has a row number of one. This would look something like
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY historyDate) AS orderNo
FROM
History
You would do something similar with actions, except you want the latest action, not the first one, so your order by column has to be in reverse order - you do this by telling the ORDER BY to use descending order, something like this
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY actionDate DESC) AS orderNo
FROM Actions
You should now have two queries where the only rows you want are marked with a order number of one. What you do now is start with your first query, and join to these other two queries so that you only join to the orderno = 1 rows. Then all the data you want will be available in one row. You have to decide which join type to use - an inner join will only return Clients that actually have a history and an action. If you want to see clients that have no rows at all in the other tables, you need to use a left outer join. But your final query (you only need this one) will look something like
SELECT
C.*, H.*, A.*
FROM
Clients C
LEFT OUTER JOIN
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY historyDate) AS orderNo
FROM History) H ON H.clientID = C.clientID AND H.orderNo = 1
LEFT OUTER JOIN
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY actionDate DESC) AS orderNo
FROM Actions) A ON A.clientID = C.clientID AND A.orderNo = 1
What this says is: take Clients (which we'll call C), then for each row, try and join to (match a row from) the History query we looked at above (which we'll call H) where the client ID is the same and the orderNo is 1 - ie the first history row. It also does the same for the Actions query.

SQL Server: Counting Inventory Issue (with subquery)

I currently have a query that is going out into an inventory table (of servers), filtering which ones are 'Developer', and producing a list of distinct users from an audit-related table. Essentially, trying to find out who has access to development servers in this particular inventory.
Everything worked until I added the second line, which I commented out in the code below:
select distinct tAudit.[USER_ID]
--, count(tAudit.[USER_ID]) AS [USER_COUNT]
from table_audit as tAudit
where tAudit.inst_name IN (
SELECT (SUBSTRING([Computer Name],0,CHARINDEX('.',[Computer Name],0))) AS INST_NAME
FROM table_server_inventory
WHERE [SQL Server Edition] = 'Developer'
)
order by tAudit.user_id asc
So, the question is: How can I count how many times a particular user appears? Is there a conflict with the fact I am using distinct? There's another query I produced, purely to see if I was on the right track. This is an example:
select tAudit.[USER_ID]
, count(tAudit.[USER_ID]) AS [USER_COUNT]
from table_audit as tAudit
where tAudit.user_id IN ('user_001', 'user_009', 'user_199', 'user_222')
group by tAudit.user_id
And it looked something like this:
USER_ID USER_COUNT
user_001 5
user_009 32
user_199 14
user_222 8
Ideally, when the primary query is working it'll look like the example above, just with dozens more results.
NOTE: The table_audit is actually very large and lists servers and users each time. Example:
COMPUTER_NAME USER_ID
serverAA user_001
serverAA user_009
serverAA user_199
serverAA user_222
serverBB user_001
serverBB user_009
serverCC user_001
serverCC user_199
serverCC user_222

You just want a GROUP BY query, not SELECT DISTINCT:
select tAudit.[USER_ID], count(tAudit.[USER_ID]) AS [USER_COUNT]
from table_audit as tAudit
where tAudit.inst_name IN (
SELECT (SUBSTRING([Computer Name],0,CHARINDEX('.',[Computer Name],0))) AS INST_NAME
FROM table_server_inventory
WHERE [SQL Server Edition] = 'Developer'
)
group by tAudit.[USER_ID]
order by tAudit.user_id asc

How do I identify like sets using SQL?

Using SQL Server, I have a table that looks like this:
What I need to do is write a query to identify scenarios where the Name and Permissions field are equal so that I can give give them a unique Set ID.
For instance, rows 2 and 4 would be a set I can give a SetID as well as rows 6 and 7 are a set that I can give another SetID. But rows 2 and 3 are NOT a set.
So far I have tried using DENSE_RANK () Over(Order by Name) which helps to add an id based on like Names but doesn't take into account matching permissions. And have tried joining the table on itself but with millions of rows of data I end up with unwanted duplicates.
The logic I am following is this:
If (Name and Permissions) of one row = (Name and Permissions) of another row give them a SetID to share.
Please help I have been banging my head against the wall with this one. Ideally a SQL query would accomplish this but am open to anything.
Thank you!

You could do it for example like this:
select
Name,
Permission,
row_number() over (order by Name, Permission) as RN
from (
select distinct
Name,
Permission
from
permissions
) TMP
order by Name, Permission
The inner select gets the distinct combinations, and the outer one assigns the numbers.
SQL Fiddle: http://sqlfiddle.com/#!6/c8319/3

This will probably do something similar to what you want.
SELECT
name,
permissions,
accountname,
ROW_NUMBER() OVER (PARTITION BY name,permissions ORDER By name,permissions) as SetID
FROM table;

SQL: select a user with entries in ALL of a list of sites

If I have a table1 with columns table1.user and table1.site, how can I return a list of distinct users that have access to ALL of a list of sites?
Let me clarify. I could start with the following code:
SELECT DISTINCT user
FROM table1
WHERE (site IN ('site1','site2','site3'))
Of course, this will display all distinct users that have entries for ANY of the three listed sites. I only want the users who have entries for ALL of the three listed sites.
I feel like there's probably an obvious way to do this, and I'm probably going to feel quite stupid once someone points it out. Still, I'm drawing a blank.

SELECT user
FROM table1
WHERE site IN ('site1','site2','site3')
GROUP BY user
HAVING count(*) = 3
This assumes that the user cannot be assigned to the same site more than once. If that is not the case, than Gordon's comment holds true, and the having expression must be replaced with HAVING count(distinct site) = 3
A more dynamic approach where the sites are only specified once and the count() adjusts automatically could be this:
WITH sites (site_name) AS (
VALUES ('site1'),
('site2'),
('site3')
)
SELECT user
FROM table1
WHERE site IN (SELECT site_name FROM sites)
GROUP BY user
HAVING count(*) = (select count(*) from sites);
(This is ANSI SQL, as no DBMS has been specified)

This should work in most DBMS:
select user from table1
where site in ('site1', 'site2', 'site3')
group by user
having count(distinct site) = 3
Note that you'll have to match the amount of sites (3 in this case) with the amount of sites listed in the in clause.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Fix One-One relationship to be One-Many by removing duplicates - sql

Related

SQL Select Count Subquery, Joins messing up everything

Selecting rows from other tables based on the first table using SQL

SQL Server: Counting Inventory Issue (with subquery)

How do I identify like sets using SQL?

SQL: select a user with entries in ALL of a list of sites

Categories

Resources