Oracle - Join multiple columns trying different combinations - sql

I'll try to explain my problem:
I need to find the most efficient way to join two table on 4 columns, but data is really crappy so there could be cases where I can join only on 3 or 2 columns because the fourth and/or third were stored badly (with spaces, zeros, dashes,...)
I should try to achieve something like this:
select * from table a
join table b
on a.key1=b.key1
and a.key2=b.key2
or a.key3=b.key3
or a.key4=b.key4```
I already performed some data quality but the number of records is really high (table a is 300k records and table b is about 25M records).
I know that the example I provided is not efficient and it would be better making separate joins and then "union" them, but I'm asking you if there could be some better way to do it.
Thanks in advance

You haven't explained your problem very well, so let's create an example:
There is a table of clients and a table of orders. Both are not related via keys, because both are imported from different systems. Your task is now to find the client per order.
Both tables contain the client's last name, first name, city, and a client number. However, these columns are optional in the order table (but either last name or client number are always given). And sometimes a first name or city may be abbreviated or misspelled (e.g. J./James, NY/New York, Cris/Chris).
So, if the order contains a client number, we have a match and are done. Otherwise the last name must match. In the latter case we look at first name and city, too. Do both match? Only one? Neither?
We use RANK to rank the clients per order and pick the best matches. Some orders will end up with exactly one match, others will have ties and we must examine the data manually then (the worst case being no client number and no last name match because of a misspelled name).
select *
from
(
select
o.*,
c.*,
rank() over
(
partition by o.order_number
order by
case
when c.client_number = o.client_number then 1
when c.last_name = o.last_name and c.first_name = o.first_name and c.city = o.city then 2
when c.last_name = o.last_name and (c.first_name = o.first_name or c.city = o.city) then 3
when c.last_name = o.last_name then 4
else 5
end
) as rnk
from orders o
left join clients c on c.client_number = o.client_number or c.last_name = o.last_name
) ranked
where rnk = 1
order by order_number;
I hope this gets you an idea how to write such a query and you will be able to adapt this concept to your case.

Related

SQL Select Count Subquery, Joins messing up everything

I've got a task to work with 4 different tables. I think I've got the "logic" correct, but I think I'm failing on joining the various separately working things together.
The Case somehow returns two rows when the comparison is true; if it isn;t, it displays (correctly) just one. Works fine without joins.
The count subquery works when by itself, but when I'm trying to tie it together, it displays anything from showing the same number everywhere or displaying far too large numbers (likely multiples or multiples).
Select Distinct RPD_PERSONS.PERSON_ID "id",
RPD_PERSONS.SURN_TXT ||' '|| RPD_PERSONS.NAME_TXT "Name",
Case ADD_ROLE_PERS.ROLE_CODE When 'Manager'
Then 'yes'
Else 'no'
End "Manager",
(
Select Count(LDD_CERTS.Cert_ID)
From LDD_CERTS
Join LDD_PERS_CERTS
On LDD_PERS_CERTS.CERT_ID = LDD_CERTS.CERT_ID
Where MONTHS_BETWEEN(LDD_CERTS.VALID_TO,SYSDATE)>0
And LDD_PERS_CERTS.CERT_CHANGE_TYPE>=0
) "no. of certificates"
From RPD_PERSONS
Join ADD_ROLE_PERS
On ADD_ROLE_PERS.Person_ID = RPD_PERSONS.Person_ID
Where RPD_PERSONS.Partic_ID = 1
Group By RPD_PERSONS.PERSON_ID, RPD_PERSONS.SURN_TXT ||' '|| RPD_PERSONS.NAME_TXT, ADD_ROLE_PERS.ROLE_CODE
Order By RPD_PERSONS.Person_ID;
This is the subquery that, by itself, seems to work perfectly.
Select LDD_PERS_CERTS.PERSON_UID,Count(LDD_CERTS.Cert_ID)
From LDD_CERTS
Join LDD_PERS_CERTS
ON LDD_PERS_CERTS.CERT_ID = LDD_CERTS.CERT_ID
Where MONTHS_BETWEEN(LDD_CERTS.VALID_TO,SYSDATE)>0
AND LDD_PERS_CERTS.CERT_CHANGE_TYPE>=0
Group By LDD_PERS_CERTS.PERSON_UID
order by LDD_PERS_CERTS.PERSON_UID;
You have a lot of things going on although a short query to get it, but let me try to summarize what I THINK you are trying to get.
You want a list of distinct people within the company with a count of how many ACTIVE certs (not expired) per person. From that, you also want to know if they are in a management position or not (via roles).
Q: For a person who may be a manager, but also an under-manager to a higher-up, do you want to see that person in both roles as typical business structures could have multiple layers of management, OR... Do you only care to see a person once, and if they are a manager OR some other level. What if a person has 3 or more roles, do you want to see them every instance? If your PRIMARY care is Manager Yes or No, the query gets even more simplified.
Now, your query of counts for valid certs. The MONTHS_BETWEEN() function appears to be you are running in Oracle. Based on the two parameters of the Valid_To date compared to sysdate is an indication that the valid to is always intended to be in the future (ie: Still has an active cert). If this is the case, you will not be able to optimize query as function calling is not Sargable
Instead, you should only have to do where Valid_To > SysDate, in other words, only those that have not yet expired. You MIGHT even be better served by pre-aggregating all counts of still active cert counts per Cert ID, then joining to the person certs table since the person cert check is for all where the cert_change_type >= 0 which could imply ALL. What condition would a Cert_Change_Type be anything less than zero, and if never, that where clause is pointless.
Next, your SELECT DISTINCT query needs a bit of adjustments. Your column-based select has no context to the outer person ID and is just aggregating the total certs. There is no correlation to the person ID to the certs being counted for. I can only GUESS that there is some relationship such as
RPD_Persons.Person_id = LDD_Pers_Certs.Person_UID
Having stated all that, I would have the following table/indexes
table index
LDD_PERS_CERTS ( CERT_CHANGE_TYPE, PERSON_UID, CERT_ID )
LDD_CERTS ( valid_to, cert_id )
RPD_PERSONS ( partic_id, person_id, surn_txt, name_txt )
ADD_ROLE_PERS ( person_id, role_code )
I would try something like
Select
lpc.PERSON_UID,
ValCerts.CertCount
From
( select
Cert_id,
count(*) CertCounts
from
LDD_CERTS
where
Valid_To > sysDate
group by
Cert_id ) ValCerts
JOIN LDD_PERS_CERTS lpc
on ValCerts.Cert_id = lpc.cert_id
Where
lpc.CERT_CHANGE_TYPE >= 0
Now, if you only care if a given person is a manager or not, I would pre-query that only as you were not actually returning a person's SPECIFIC ROLE, just the fact they were a manager or not. My final query might look like'
select
p.PERSON_ID id,
max( p.SURN_TXT || ' ' || p.NAME_TXT ) Name,
max( Case when arp.Person_id IS NULL
then 'no' else 'yes' end ) Manager,
max( coalesce( certs.CertCount, 0 )) ActiveCertsForUser
from
RPD_PERSONS p
LEFT Join ADD_ROLE_PERS arp
On p.Person_ID = arp.Person_ID
AND arp.role_code = 'Manager'
LEFT JOIN
( Select
lpc.PERSON_UID,
ValCerts.CertCount
From
( select
Cert_id,
count(*) CertCounts
from
LDD_CERTS
where
Valid_To > sysDate
group by
Cert_id ) ValCerts
JOIN LDD_PERS_CERTS lpc
on ValCerts.Cert_id = lpc.cert_id
AND lpc.CERT_CHANGE_TYPE >= 0 )
) Certs
on p.Person_id = Certs.Person_uid
Where
p.Partic_ID = 1
GROUP BY
p.PERSON_ID
Now, if the p.partic_id = 1 represents only 1 person, then that wont make as much sense to query all people with a given certificate status, etc. But if Partic_id = 1 represents a group of people such as within a given association / division of a company, then it should be fine.
Any questions, let me know and I can revise / update answer
CASE issue: there can be, presumably, be multiple records in ADD_ROLE_PERS for each person. If a person can have two or more roles running concurrently then you need to decide what the business logic is that you need to use to handle this. If a person can only have one active role at a time presumably there is a "active/disabled" column or effective date columns you should be using to identify the active record (or, potentially, there is a data issue).
The subquery should return the same value for every single row in your resultset, as it is completely isolated/standalone from your main query. If you want it to produce counts that are relevant to each row then you will need to connect it to the tables in the main table (look up correlated subqueries if you don't know how to so this)

How do I get a query of clients with many contact numbers? SQL

I have 2 tables, clients and contact numbers. Each client has one or many contact number, its a one to many relationship. And I need to make an excel document that for each row it has one client and its contact numbers. For example:
client name | contact_number_1 | contact_number_2| ...
I want to make it in POSTGRESQL to be fast. Doesn't matter the way that I make the excel file. I just need the query to make the rest.
Thank you!
If you can parse the result and create the Excel file from there, the most flexible solution is to aggregate the numbers into an array:
select c.client_id,
c.client_name,
array_agg(cn.number) as contact_numbers
from client c
join concat_number cn on cn.client_id = c.client_id
group by c.client_id, c.client_name;
Another alternative is to use string_agg(cn.number, ',') to get a comma-separated list (but the array is more robust against embedded commas in the names).
If you really do need to get the numbers in separate columns, you need to decide on a sensible upper limit of columns, then you can use the first query and extract the array elements as columns:
select client_id,
client_name,
contact_numbers[1] as contact_number_1,
contact_numbers[2] as contact_number_2,
contact_numbers[3] as contact_number_3,
...
from (
select c.client_id,
c.client_name,
array_agg(cn.number) as contact_numbers
from client c
join concat_number cn on cn.client_id = c.client_id
group by c.client_id, c.client_name
) t
If you actually want a dynamic number of columns returned, it gets a bit complicated cause you have to know the maximum number of columns for the returned results, or you hard-code a set number for the highest number you think will exist.
If you can live with having one column represent all of the possible contacts, then you can aggregate them all into a single column:
select c.clientName, STRING_AGG(COALESCE(con.contact_number,''),'|') as contact_numbers
from clients c
left join contacts con on c.clientId = con.clientId
group by c.clientName
order by c.clientName

Finding duplicate entries in any of several columns

I have code that identifies potential duplicate records based on the fact that several rows (with different IDs) have the same value in various other columns. This info gets manually reviewed, so I am not worried about the fact that a husband and wife could legitimately share an email address, for example. An example of the query I am using is this:
SELECT DISTINCT ID, Email
FROM Customers
WHERE Email IS NOT NULL AND Email != '' AND Email IN
(SELECT Email FROM Customers GROUP BY Email HAVING COUNT(DISTINCT ID) > 1)
ORDER BY Email;
Which gives me results like this:
ID Email
108 bob#hotmail.com
381 bob#hotmail.com
205 mary#gmail.com
772 mary#gmail.com
908 mary#gmail.com
This works great for my purposes, except when I try matching by phone number, which has multiple columns (HomePhone, BusinessPhone, CellPhone). This creates two problems - the first, which has been pretty well documented on this forum, is how to identify rows in which any of three columns contain a matching value (If a value in [row 1 column A, B, or C] matches a column in [row 2 column A, B, or C] then I want to select both rows). The second problem, which I haven't figured out yet and haven't found an answer to, is how to select [ID], [Value that Matched] as my output.
I suppose that I could select all three columns and do some further code magic in my program to make sense of it, but that prevents me from reusing existing code and also seems like the type of hack that a developer would use to keep from admitting that he needs help from a DBA. (Help!) In all seriousness, though, I am stuck trying to find an elegant solution, and any help would be appreciated.
Based on my understanding of the question,
You can initially use union all and get the different phone numbers into one column and group by that column to see if there are duplicates. Thereafter, join on the original table to get the customer id.
with cnts as (
select phone
from (select id,homephone phone from customers
union all
select id,businessphone from customers
union all
select id,cellphone from customers) x
group by phone
having count(distinct id) > 1
)
select c.id,cn.phone value_matched
from customers c
join cnts cn on cn.phone in (c.homephone,c.businessphone,c.cellphone)
order by 1,2
I would do this with apply:
select c.*, phone
from (select c.*, count(*) over (partition by phone) as cnt
from customers c cross apply
(select distinct v.phone
from (values (homephone), (businessphone), (cellphone)
) v(phone)
where v.phone is not null
) v(phone)
) c
where cnt > 1
order by phone;
The innermost subquery selects the distinct phones for each customer. The count(*) over . . . then counts the number of times that the phone appears (which because of the distinct is for different customers). The final where chooses phones that appear for multiple customers.

SQL JOIN returning multiple rows when I only want one row

I am having a slow brain day...
The tables I am joining:
Policy_Office:
PolicyNumber OfficeCode
1 A
2 B
3 C
4 D
5 A
Office_Info:
OfficeCode AgentCode OfficeName
A 123 Acme
A 456 Acme
A 789 Acme
B 111 Ace
B 222 Ace
B 333 Ace
... ... ....
I want to perform a search to return all policies that are affiliated with an office name. For example, if I search for "Acme", I should get two policies: 1 & 5.
My current query looks like this:
SELECT
*
FROM
Policy_Office P
INNER JOIN Office_Info O ON P.OfficeCode = O.OfficeCode
WHERE
O.OfficeName = 'Acme'
But this query returns multiple rows, which I know is because there are multiple matches from the second table.
How do I write the query to only return two rows?
SELECT DISTINCT a.PolicyNumber
FROM Policy_Office a
INNER JOIN Office_Info b
ON a.OfficeCode = b.OfficeCode
WHERE b.officeName = 'Acme'
SQLFiddle Demo
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins
Simple join returns the Cartesian multiplication of the two sets and you have 2 A in the first table and 3 A in the second table and you probably get 6 results. If you want only the policy number then you should do a distinct on it.
(using MS-Sqlserver)
I know this thread is 10 years old, but I don't like distinct (in my head it means that the engine gathers all possible data, computes every selected row in each record into a hash and adds it to a tree ordered by that hash; I may be wrong, but it seems inefficient).
Instead, I use CTE and the function row_number(). The solution may very well be a much slower approach, but it's pretty, easy to maintain and I like it:
Given is a person and a telephone table tied together with a foreign key (in the telephone table). This construct means that a person can have more numbers, but I only want the first, so that each person only appears one time in the result set (I ought to be able concatenate multiple telephone numbers into one string (pivot, I think), but that's another issue).
; -- don't forget this one!
with telephonenumbers
as
(
select [id]
, [person_id]
, [number]
, row_number() over (partition by [person_id] order by [activestart] desc) as rowno
from [dbo].[telephone]
where ([activeuntil] is null or [activeuntil] > getdate()
)
select p.[id]
,p.[name]
,t.[number]
from [dbo].[person] p
left join telephonenumbers t on t.person_id = p.id
and t.rowno = 1
This does the trick (in fact the last line does), and the syntax is readable and easy to expand. The example is simple but when creating large scripts that joins tables left and right (literally), it is difficult to avoid that the result contains unwanted duplets - and difficult to identify which tables creates them. CTE works great for me.

Select based on the number of appearances of an id in another table

I have a table B with cids and cities. I also have a table C that has these cids with extra information. I want to list all the cids in table C that are associated with ALL appearances of a given city in Table B.
My current solution relies on counting the number of times the given city appears in Table B and selecting only the cids that appear that many times. I don't know all the SQL syntax yet, but is there a way to select for this kind of pattern?
My current solution:
SELECT Agents.aid
FROM Agents, Customers, Orders
WHERE (Customers.city='Duluth')
AND (Agents.aid = Orders.aid)
AND (Customers.cid = Orders.cid)
GROUP BY Agents.aid
HAVING count(Agents.aid) > 1
It only works because I know right now with the HAVING statement.
Thanks for the help. I wasn't sure how to google this problem, since it's pretty specific.
EDIT: I'm pinpointing my problem a bit. I need to know how to determine if EVERY row in a table has a certain value for a field. Declaring a variable and counting the rows in a sub-selection and filtering out my results by IDs that appear that many times works, but It's really ugly.
There HAS to be a way to do this without explicitly count()ing rows. I hope.
Not an answer to your question, but a general improvement.
I'd recommend using JOIN syntax to join your tables together.
This would change your query to be:
SELECT Agents.aid
FROM Agents
INNER JOIN Orders
ON Agents.aid = Orders.aid
INNER JOIN Customers
ON Customers.cid = Orders.cid
WHERE Customers.city='Duluth'
GROUP BY Agents.aid
HAVING count(Agents.aid) > 1
What variant of SQL are you using?
To start with, you can (and should) use JOIN instead of doing it in the WHERE clause, e.g.,
select Agents.aid
from Agents
join Orders on Agents.aid = Orders.aid
join Customers on Customers.cid = Orders.cid
where Customers.city = 'Duluth'
group by Agents.aid
having count(Agents.aid) > 1
After that, I'm afraid I might be a little lost. Using the table names in your example query, what (in English, not pseudocode) are you trying to retrieve? For example, I think your sample query is retrieving the PK for all Agents that have been involved in at least 2 Orders involving Customers in Duluth.
Also, some table definitions for Agents, Orders, and Customers might help (then again, they might be irrelevant).
I'm not sure if I understood you problem, but I think the following query is what you want:
SELECT *
FROM customers b
INNER JOIN orders c USING (cid)
WHERE b.city = 'Duluth'
AND NOT EXISTS (SELECT 1
FROM customers b2
WHERE b2.city = b.city
AND b2.cid <> cid);
Probably you will need some indexes on these columns.