How can I choose the closest match in SQL Server 2005? - sql

In SQL Server 2005, I have a table of input coming in of successful sales, and a variety of tables with information on known customers, and their details. For each row of sales, I need to match 0 or 1 known customers.
We have the following information coming in from the sales table:
ServiceId,
Address,
ZipCode,
EmailAddress,
HomePhone,
FirstName,
LastName
The customers information includes all of this, as well as a 'LastTransaction' date.
Any of these fields can map back to 0 or more customers. We count a match as being any time that a ServiceId, Address+ZipCode, EmailAddress, or HomePhone in the sales table exactly matches a customer.
The problem is that we have information on many customers, sometimes multiple in the same household. This means that we might have John Doe, Jane Doe, Jim Doe, and Bob Doe in the same house. They would all match on on Address+ZipCode, and HomePhone--and possibly more than one of them would match on ServiceId, as well.
I need some way to elegantly keep track of, in a transaction, the 'best' match of a customer. If one matches 6 fields, and the others only match 5, that customer should be kept as a match to that record. In the case of multiple matching 5, and none matching more, the most recent LastTransaction date should be kept.
Any ideas would be quite appreciated.
Update: To be a little more clear, I am looking for a good way to verify the number of exact matches in the row of data, and choose which rows to associate based on that information. If the last name is 'Doe', it must exactly match the customer last name, to count as a matching parameter, rather than be a very close match.

for SQL Server 2005 and up try:
;WITH SalesScore AS (
SELECT
s.PK_ID as S_PK
,c.PK_ID AS c_PK
,CASE
WHEN c.PK_ID IS NULL THEN 0
ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
+CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
+CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
+CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
END AS Score
FROM Sales s
LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
OR (s.Address=c.Address AND s.Zip=c.Zip)
OR s.EmailAddress=c.EmailAddress
OR s.HomePhone=c.HomePhone
)
SELECT
s.*,c.*
FROM (SELECT
S_PK,MAX(Score) AS Score
FROM SalesScore
GROUP BY S_PK
) dt
INNER JOIN Sales s ON dt.s_PK=s.PK_ID
INNER JOIN SalesScore ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID
EDIT
I hate to write so much actual code when there was no shema given, because I can't actually run this and be sure it works. However to answer the question of the how to handle ties using the last transaction date, here is a newer version of the above code:
;WITH SalesScore AS (
SELECT
s.PK_ID as S_PK
,c.PK_ID AS c_PK
,CASE
WHEN c.PK_ID IS NULL THEN 0
ELSE CASE WHEN s.ServiceId=c.ServiceId THEN 1 ELSE 0 END
+CASE WHEN (s.Address=c.Address AND s.Zip=c.Zip) THEN 1 ELSE 0 END
+CASE WHEN s.EmailAddress=c.EmailAddress THEN 1 ELSE 0 END
+CASE WHEN s.HomePhone=c.HomePhone THEN 1 ELSE 0 END
END AS Score
FROM Sales s
LEFT OUTER JOIN Customers c ON s.ServiceId=c.ServiceId
OR (s.Address=c.Address AND s.Zip=c.Zip)
OR s.EmailAddress=c.EmailAddress
OR s.HomePhone=c.HomePhone
)
SELECT
*
FROM (SELECT
s.*,c.*,row_number() over(partition by s.PK_ID order by s.PK_ID ASC,c.LastTransaction DESC) AS RankValue
FROM (SELECT
S_PK,MAX(Score) AS Score
FROM SalesScore
GROUP BY S_PK
) dt
INNER JOIN Sales s ON dt.s_PK=s.PK_ID
INNER JOIN SalesScore ss ON dt.s_PK=s.PK_ID AND dt.Score=ss.Score
LEFT OUTER JOIN Customers c ON ss.c_PK=c.PK_ID
) dt2
WHERE dt2.RankValue=1

Here's a fairly ugly way to do this, using SQL Server code. Assumptions:
- Column CustomerId exists in the Customer table, to uniquely identify customers.
- Only exact matches are supported (as implied by the question).
SELECT top 1 CustomerId, LastTransaction, count(*) HowMany
from (select Customerid, LastTransaction
from Sales sa
inner join Customers cu
on cu.ServiceId = sa.ServiceId
union all select Customerid, LastTransaction
from Sales sa
inner join Customers cu
on cu.EmailAddress = sa.EmailAddress
union all select Customerid, LastTransaction
from Sales sa
inner join Customers cu
on cu.Address = sa.Address
and cu.ZipCode = sa.ZipCode
union all [etcetera -- repeat for each possible link]
) xx
group by CustomerId, LastTransaction
order by count(*) desc, LastTransaction desc
I dislike using "top 1", but it is quicker to write. (The alternative is to use ranking functions and that would require either another subquery level or impelmenting it as a CTE.) Of course, if your tables are large this would fly like a cow unless you had indexes on all your columns.

Frankly I would be wary of doing this at all as you do not have a unique identifier in your data.
John Smith lives with his son John Smith and they both use the same email address and home phone. These are two people but you would match them as one. We run into this all the time with our data and have no solution for automated matching because of it. We identify possible dups and actually physically call and find out id they are dups.

I would probably create a stored function for that (in Oracle) and oder on the highest match
SELECT * FROM (
SELECT c.*, MATCH_CUSTOMER( Customer.Id, par1, par2, par3 ) matches FROM Customer c
) WHERE matches >0 ORDER BY matches desc
The function match_customer returns the number of matches based on the input parameters... I guess is is probably slow as this query will always scan the complete customer table

For close matches you can also look at a number of string similarity algorithms.
For example, in Oracle there is the UTL_MATCH.JARO_WINKLER_SIMILARITY function:
http://www.psoug.org/reference/utl_match.html

There is also the Levenshtein distance algorithym.

Related

SQL COUNT By month

I have the following table that registers the days students went to school. I need to count the days they are PRESENT, but also need to count the total of school days for each month. (When ASISTENCIA is either 0 or 1)
This what I have so far, but it doesn't count the total.
BEGIN
SELECT
u.user_id user_id,
u.user_first_name as names,
u.user_last_name_01 as lastname1,
u.user_last_name_02 as lastname2,
MONTH(a.FECHA_ASISTENCIA) month,
COUNT(*) as absent_days,
p.PHONE as phone,
p.CITY as city,
#EDUCATION_LEVEL_ID
FROM
users u
inner join asistencia a ON u.user_id = a.USER_ID
inner join profile p ON u.rut_SF = p.RUT_SF
WHERE
a.ASISTENCIA = 0 -- NOT PRESENT
AND a.EDUCATION_LEVEL_ID = #EDUCATION_LEVEL_ID
AND YEAR(a.FECHA_ASISTENCIA) = #EDUCATION_LEVEL_YEAR
GROUP BY
u.user_id,
u.user_first_name,
u.user_last_name_01,
u.user_last_name_02,
MONTH(a.FECHA_ASISTENCIA),
p.TELEFONO,
p.CIUDAD_DOM
ORDER BY mes
END
ATTENDANCE
USER_ID
DATE
ATTENDANCE
EDUCATION_LEVEL_ID
123
2021-04-13
0
1
123
2021-04-14
1
1
DESIRED OUTPUT
names
lastname1
lastname2
month
absent_days
total_class_days
city
JOHN
SMITH
SMITH
3
10
24
CITY
JOHN
SMITH
SMITH
4
8
24
CITY
Without examples of what these tables look like, it is hard to give you a solid answer. However, it appears as though your biggest challenge is that ABSTENTIA does not exist.
This is a common problem for analysis - you need to create rows that do not exist (when the user was absent).
The general approach is to:
create a list of unique users
create a list of unique dates you care about
Cross Join (cartesian join) these to create every possible combination of user and date
Outer Join #3 to #4 so you can populate a PRESENT flag, and now you can see both who were PRESENT and which were not.
Filter out rows which don't apply (for instance if a user joined on 3/4/2021 then ignore the blank rows before this date)
You can accomplish this with some SQL that looks like this:
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID,
DATEADD('DAY', (INTERVAL_ID - 1), '2021-01-01'::timestamp_ntz) as SPINE_START,
DATEADD('DAY', INTERVAL_ID, '2022-01-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 365))
),
GROUPS AS (
SELECT
USERID,
MIN(DESIRED_INTERVAL) AS LOCAL_START,
MAX(DESIRED_INTERVAL) AS LOCAL_END
FROM RASGO.PUBLIC.RASGO_SDK__OP4__AGGREGATE_TRANSFORM__8AB1FEDF90
GROUP BY
USERID
),
GROUP_SPINE AS (
SELECT
USERID,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM GROUPS G
CROSS JOIN LATERAL (
SELECT
SPINE_START, SPINE_END
FROM GLOBAL_SPINE S
WHERE S.SPINE_START BETWEEN G.LOCAL_START AND G.LOCAL_END
)
)
SELECT
G.USERID AS GROUP_BY_USERID,
GROUP_START,
GROUP_END,
T.*
FROM GROUP_SPINE G
LEFT JOIN {{ your_table }} T
ON DESIRED_INTERVAL >= G.GROUP_START
AND DESIRED_INTERVAL < G.GROUP_END
AND G.USERID = T.USERID;
The above script works on Snowflake, but the syntax might be slightly different depending on your RDBMS. There are also some other tweaks you can make regarding when you insert the blank rows, I chose 'local' which means that we begin inserting rows for each user on their very first day. You could change this to global if you wanted to populate data from every single day between 1/1/2021 and 1/1/2022.
I used Rasgo to generate this SQL

SQL aggregate functions and sorting

I am still new to SQL and getting my head around the whole sub-query aggregation to display some results and was looking for some advice:
The tables might look something like:
Customer: (custID, name, address)
Account: (accountID, reward_balance)
Shop: (shopID, name, address)
Relational tables:
Holds (custID*, accountID*)
With (accountID*, shopID*)
How can I find the store that has the least reward_balance?
(The customer info is not required at this point)
I tried:
SELECT accountID AS ACCOUNT_ID, shopID AS SHOP_ID, MIN(reward_balance) AS LOWEST_BALANCE
FROM Account, Shop, With
WHERE With.accountID = Account.accountID
AND With.shopID=Shop.shopID
GROUP BY
Account.accountID,
Shop.shopID
ORDER BY MIN(reward_balance);
This works in a way that is not intended:
ACCOUNT_ID | SHOP_ID | LOWEST_BALANCE
1 | 1 | 10
2 | 2 | 40
3 | 3 | 100
4 | 4 | 1000
5 | 4 | 5000
As you can see Shop_ID 4 actually has a balance of 6000 (1000+5000) as there are two customers registered with it. I think I need to SUM the lowest balance of the shops based on their balance and display it from low-high.
I have been trying to aggregate the data prior to display but this is where I come unstuck:
SELECT shopID AS SHOP_ID, MIN(reward_balance) AS LOWEST_BALANCE
FROM (SELECT accountID, shopID, SUM(reward_balance)
FROM Account, Shop, With
WHERE
With.accountID = Account.accountID
AND With.shopID=Shop.shopID
GROUP BY
Account.accountID,
Shop.shopID;
When I run something like this statement I get an invalid identifier error.
Error at Command Line : 1 Column : 24
Error report -
SQL Error: ORA-00904: "REWARD_BALANCE": invalid identifier
00904. 00000 - "%s: invalid identifier"
So I figured I might have my joining condition incorrect and the aggregate sorting incorrect, and would really appreciate any general advice.
Thanks for the lengthy read!
Approach this problem one step at time.
We're going to assume (and we should probably check this) that by least reward_balance, that refers to the total of all reward_balance associated with a shop. And we're not just looking for the shop that has the lowest individual reward balance.
First, get all of the individual "reward_balance" for each shop. Looks like the query would need to involve three tables...
SELECT s.shop_id
, a.reward_balance
FROM `shop` s
LEFT
JOIN `with` w
ON w.shop_id = s.shop_id
LEFT
JOIN `account` a
ON a.account_id = w.account_id
That will get us the detail rows, every shop along with the individual reward_balance amounts associated with the shop, if there are any. (We're using outer joins for this query, because we don't see any guarantee that a shops is going to be related to at least one account. Even if it's true for this use case, that's not always true in the more general case.)
Once we have the individual amounts, the next step is to total them for each shop. We can do that using a GROUP BY clause and a SUM() aggregate.
SELECT s.shop_id
, SUM(a.reward_balance) AS tot_reward_balance
FROM `shop` s
LEFT
JOIN `with` w
ON w.shop_id = s.shop_id
LEFT
JOIN `account` a
ON a.account_id = w.account_id
GROUP BY s.shop_id
At this point, with MySQL we could add an ORDER BY clause to arrange the rows in ascending order of tot_reward_balance, and add a LIMIT 1 clause if we only want to return a single row. We can also handle the case when tot_reward_balance is NULL, assigning a zero in place of the NULL.
SELECT s.shop_id
, IFNULL(SUM(a.reward_balance),0) AS tot_reward_balance
FROM `shop` s
LEFT
JOIN `with` w
ON w.shop_id = s.shop_id
LEFT
JOIN `account` a
ON a.account_id = w.account_id
GROUP BY s.shop_id
ORDER BY tot_reward_amount ASC, s.shop_id ASC
LIMIT 1
If there are two (or more) shops with the same least value of tot_reward_amount, this query returns only one of those shops.
Oracle doesn't have the LIMIT clause like MySQL, but we can get equivalent result using analytic function (which is not available in MySQL). We also replace the MySQL IFNULL() function with the Oracle equivalent NVL() function...
SELECT v.shop_id
, v.tot_reward_balance
, ROW_NUMBER() OVER (ORDER BY v.tot_reward_balance ASC, v.shop_id ASC) AS rn
FROM (
SELECT s.shop_id
, NVL(SUM(a.reward_balance),0) AS tot_reward_balance
FROM shop s
LEFT
JOIN with w
ON w.shop_id = s.shop_id
LEFT
JOIN account a
ON a.account_id = w.account_id
GROUP BY s.shop_id
) v
HAVING rn = 1
Like the MySQL query, this returns at most one row, even when two or more shops have the same "least" total of reward_balance.
If we want to return all of the shops that have the lowest tot_reward_balance, we need to take a slightly different approach.
The best approach to building queries is step wise refinement; in this case, start by getting all of the individual reward_amount for each shop. Next step is to aggregate the individual reward_amount into a total. The next steps is to pickout the row(s) with the lowest total reward_amount.
In SQL Server, You can try using a CTE:
;with cte_minvalue as
(
select rank() over (order by Sum_Balance) as RowRank,
ShopId,
Sum_Balance
from (SELECT Shop.shopID, SUM(reward_balance) AS Sum_Balance
FROM
With
JOIN Shop ON With.ShopId = Shop.ShopId
JOIN Account ON With.AccountId = Account.AccountId
GROUP BY
Shop.shopID)ShopSum
)
select ShopId, Sum_Balance from cte_minvalue where RowRank = 1

SQL: multiple counts from same table

I am having a real problem trying to get a query with the data I need. I have tried a few methods without success. I can get the data with 4 separate queries, just can't get hem into 1 query. All data comes from 1 table. I will list as much info as I can.
My data looks like this. I have a customerID and 3 columns that record who has worked on the record for that customer as well as the assigned acct manager
RecID_Customer___CreatedBy____LastUser____AcctMan
1-------1374----------Bob Jones--------Mary Willis------Bob Jones
2-------1375----------Mary Willis------Bob Jones--------Bob Jones
3-------1376----------Jay Scott--------Mary Willis-------Mary Willis
4-------1377----------Jay Scott--------Mary Willis------Jay Scott
5-------1378----------Bob Jones--------Jay Scott--------Jay Scott
I want the query to return the following data. See below for a description of how each is obtained.
Employee___Created__Modified__Mod Own__Created Own
Bob Jones--------2-----------1---------------1----------------1
Mary Willis------1-----------2---------------1----------------0
Jay Scott--------2-----------1---------------1----------------1
Created = Counts the number of records created by each Employee
Modified = Number of records where the Employee is listed as Last User
(except where they created the record)
Mod Own = Number of records for each where the LastUser = Acctman
(account manager)
Created Own = Number of Records created by the employee where they are
the account manager for that customer
I can get each of these from a query, just need to somehow combine them:
Select CreatedBy, COUNT(CreatedBy) as Created
FROM [dbo].[Cust_REc] GROUP By CreatedBy
Select LastUser, COUNT(LastUser) as Modified
FROM [dbo].[Cust_REc] Where LastUser != CreatedBy GROUP By LastUser
Select AcctMan, COUNT(AcctMan) as CreatePort
FROM [dbo].[Cust_REc] Where AcctMan = CreatedBy GROUP By AcctMan
Select AcctMan, COUNT(AcctMan) as ModPort
FROM [dbo].[Cust_REc] Where AcctMan = LastUser AND NOT AcctMan = CreatedBy GROUP By AcctMan
Can someone see a way to do this? I may have to join the table to itself, but my attempts have not given me the correct data.
The following will give you the results you're looking for.
select
e.employee,
create_count=(select count(*) from customers c where c.createdby=e.employee),
mod_count=(select count(*) from customers c where c.lastmodifiedby=e.employee),
create_own_count=(select count(*) from customers c where c.createdby=e.employee and c.acctman=e.employee),
mod_own_count=(select count(*) from customers c where c.lastmodifiedby=e.employee and c.acctman=e.employee)
from (
select employee=createdby from customers
union
select employee=lastmodifiedby from customers
union
select employee=acctman from customers
) e
Note: there are other approaches that are more efficient than this but potentially far more complex as well. Specifically, I would bet there is a master Employee table somewhere that would prevent you from having to do the inline view just to get the list of names.
this seems pretty straight forward. Try this:
select a.employee,b.created,c.modified ....
from (select distinct created_by from data) as a
inner join
(select created_by,count(*) as created from data group by created_by) as b
on a.employee = b.created_by)
inner join ....
This highly inefficient query may be a rough start to what you are looking for. Once you validate the data then there are things you can do to tidy it up and make it more efficient.
Also, I don't think you need the DISTINCT on the UNION part because the UNION will return DISTINCT values unless UNION ALL is specified.
SELECT
Employees.EmployeeID,
Created =(SELECT COUNT(*) FROM Cust_REc WHERE Cust_REc.CreatedBy=Employees.EmployeeID),
Mopdified =(SELECT COUNT(*) FROM Cust_REc WHERE Cust_REc.LastUser=Employees.EmployeeID AND Cust_REc.CreateBy<>Employees.EmployeeID),
ModOwn =
CASE WHEN NOT Empoyees.IsManager THEN NULL ELSE
(SELECT COUNT(*) FROM Cust_REc WHERE AcctMan=Employees.EmployeeID)
END,
CreatedOwn=(SELECT COUNT(*) FROM Cust_REc WHERE AcctMan=Employees.EmployeeID AND CReatedBy=Employees.EMployeeID)
FROM
(
SELECT
EmployeeID,
IsManager=CASE WHEN EXISTS(SELECT AcctMan FROM CustRec WHERE AcctMan=EmployeeID)
FROM
(
SELECT DISTINCT
EmployeeID
FROM
(
SELECT EmployeeID=CreatedBy FROM Cust_Rec
UNION
SELECT EmployeeID=LastUser FROM Cust_Rec
UNION
SELECT EmployeeID=AcctMan FROM Cust_Rec
)AS Z
)AS Y
)
AS Employees
I had the same issue with the Modified column. All the other columns worked okay. DCR example would work well with the join on an employees table if you have it.
SELECT CreatedBy AS [Employee],
COUNT(CreatedBy) AS [Created],
--Couldn't get modified to pull the right results
SUM(CASE WHEN LastUser = AcctMan THEN 1 ELSE 0 END) [Mod Own],
SUM(CASE WHEN CreatedBy = AcctMan THEN 1 ELSE 0 END) [Created Own]
FROM Cust_Rec
GROUP BY CreatedBy

Big performance loss when using IN(subquery). Why?

I am using SQL Server 2005 and I have noticed something strange when I want to filter some results when using a subquery within an IN clause. For instance, this is my current query and it runs in 70 seconds on average:
select Phone, ZipCode, sum(Calls) as Calls, sum(Sales) as Sales
from Archive
where CustomerID = 20
and ReportDate = '2/3/2011'
and Phone in (
select Phone
from PlanDetails
where Phone is not null
and Length is not null
and PlannedImp > 0
and CustomerID = 20
and (StatusID <> 2 and StatusID <> 7)
and SubcategoryID = 88
)
group by Phone, ZipCode
However, if I break them down into 2 separate queries, they take under 1 second each to run.
select Phone
from PlanDetails
where Phone is not null
and Length is not null
and PlannedImp > 0
and CustomerID = 20
and (StatusID <> 2 and StatusID <> 7)
and SubcategoryID = 88
and
select Phone, ZipCode, sum(Calls) as Calls, sum(Sales) as Sales
from Archive
where CustomerID = 20
and ReportDate = '2/3/2011'
group by Phone, ZipCode
Finally, if I do this it returns the same results as the first query, but in about 2-3 seconds:
select Phone
into #tempTable
from PlanDetails
where Phone is not null
and Length is not null
and PlannedImp > 0
and CustomerID = 20
and (StatusID <> 2 and StatusID <> 7)
and SubcategoryID = 88
select Phone, ZipCode, sum(Calls) as Calls, sum(Sales) as Sales
from Archive
where CustomerID = 20
and ReportDate = '2/3/2011'
and Phone in (
select Phone
from #tempTable
)
group by Phone, ZipCode
For the past few weeks I have been noticing that not just this query is slow, but any query using a (somewhat complicated) subquery inside of an IN clause just ruins the performance. What is the reason for that?
The only indexes that are available for either of these queries to use is a nonclustered index on CustomerID for both tables. I looked at the execution plans of both the slow query and the fast ones, and saw that the nonclustered index seek on the Archive table is what takes the highest percentage of cost by far (80-90%). However, the only difference is that that step in the slow query has a CPU cost of 7.1 while the fast ones have a CPU cost of 1.7.
It depends upon the database system, the version, the settings, etc. but generally what ends up happening is the database fails (or refuses) to cache that inner query, so it is being executed every single iteration of the outer query. You are changing your problem from a O(n) efficiency class to O(n^2).
Quote IN vs. JOIN vs. EXISTS:
We see now that contrary to the popular opinion, IN / EXISTS queries are not less efficient than a JOIN query in SQL Server.
In fact, JOIN queries are less efficient on non-indexed tables, since Semi Join methods allow aggregation and matching against a single hash table, while a JOIN needs to do these two operations in two steps.
Beyond that, indexing and how current table statistics are plays a large role in how the optimizer decides to execute a query.
What if you rewrite the query with a join?
select a.Phone, a.ZipCode, sum(a.Calls) as Calls, sum(a.Sales) as Sales
from Archive a
inner join PlanDetails pd
on a.CustomerID = pd.CustomerID
and a.Phone = pd.Phone
where a.CustomerID = 20
and a.ReportDate = '2/3/2011'
and pd.Length is not null
and pd.PlannedImp > 0
and (pd.StatusID <> 2 and pd.StatusID <> 7)
and pd.SubcategoryID = 88
group by a.Phone, a.ZipCode
I'd propose 2 solutions :
1. Try to rewrite your query using EXISTS instead of IN. It might help if you use the older SQL Server versions (If my memory serves me well, prior to SQL Server 2005 EXITST and IN generated different execution plans).
2. Try to use INNER JOIN (you can also use CTE):
select Phone, ZipCode, sum(Calls) as Calls, sum(Sales) as Sales
from Archive
INNER JOIN
(
select DISTINCT Phone // DISTINCT to avoid duplicates
from PlanDetails
where Phone is not null
and Length is not null
and PlannedImp > 0
and CustomerID = 20
and (StatusID <> 2 and StatusID <> 7)
and SubcategoryID = 88
)XX ON (XX.Phone = Archive.Phone)
where CustomerID = 20 and ReportDate = '2/3/2011'
group by Phone, ZipCode
Personally, I'd expect the second approach to give you much better results.

Select exactly one row for each employee using unordered field as criteria

I have a data set that looks like the following.
EMPLID PHONE_TYPE PHONE
------ ---------- --------
100 HOME 111-1111
100 WORK 222-2222
101 HOME 333-3333
102 WORK 444-4444
103 OTHER 555-5555
I want to select exactly one row for each employee using the PHONE_TYPE field to establish preferences. I want the HOME phone number if the employee has one as is the case for employee 100 and 101. If the HOME number is not present, I want the WORK number (employee 102), and as a last resort I'll take the OTHER number as with employee 103. In reality my table has about a dozen values for the PHONE_TYPE field, so I need to be able to extend any solution to include more than just the three values I've shown in the example. Any thoughts? Thanks.
You need to add a phone_types table (Phone_Type TEXT(Whatever), Priority INTEGER). In this table, list each Phone_Type value once and assign a priority to it (in your example, HOME would be 1, WORK 2, OTHER 3 and so on).
Then, create a view that joins the Priority column from Phone_Types to your Phone_Numbers table (imagine we call it Phone_Numbers_Ex).
Now, you have several options for how to get record from Phone_Numbers_Ex with the MIN(Priority) for a given emplID, of which probably the clearest is:
SELECT * FROM Phone_Numbers_Ex P1 WHERE NOT EXISTS
(SELECT * FROM Phone_Numbers_Ex P2 WHERE P2.EmplID = P1.EmplID AND P2.Priority < P1.Priority)
Another way is to declare another view, or inner query, along the lines of SELECT EmplID, MIN(Priority) AS Priority FROM Phone_Numbers_Ex GROUP BY EmplID and then joining this back Phone_Numbers_Ex on both EmplID and Priority.
I forget, does Server 2000 support Coalesce? If it does, I think this will work:
Select Distinct EmplID, Coalesce(
(Select Phone from Employees where emplid = e1.emplid and phone_type = 'HOME'),
(Select Phone from Employees where emplid = e1.emplid and phone_type = 'WORK'),
(Select Phone from Employees where emplid = e1.emplid and phone_type = 'OTHER')
) as Phone
From Employees e1
Your requirements may not be complete if an employee is allowed to have more than one phone number for a given phone type. I've added a phone_number_id just to make things unique and assumed that you would want the lowest id if the person has two phones of the same type. That's pretty arbitrary, but you can replace it with your own business logic.
I've also assumed some kind of a Phone_Types table that includes your priority for which phone number should be used. If you don't already have this table, you should probably add it. If nothing else, it lets you constrain the phone types with a foreign key.
SELECT
PN1.employee_id,
PN1.phone_type,
PN1.phone_number
FROM
Phone_Numbers PN1
INNER JOIN Phone_Types PT1 ON
PT1.phone_type = PN1.phone_type
WHERE
NOT EXISTS
(
SELECT *
FROM
Phone_Numbers PN2
INNER JOIN Phone_Types PT2 ON
PT2.phone_type = PN2.phone_type AND
(
(PT2.priority < PT1.priority)
--OR (PT2.priority = PT1.priority AND PN2.phone_number_id > PN1.phone_number_id)
)
)
You could also implement this with a LEFT JOIN instead of the NOT EXISTS or you could use TOP if you were looking for the phone number for a single employee. Just do a TOP 1 ORDER BY priority, phone_number_id.
Finally, if you were to move up to SQL 2005 or SQL 2008, you could use a CTE with ROWNUMBER() OVER (ORDER BY priority, phone_number, PARTITION BY employee_id) <- I think my syntax may be slightly off with the parentheses on that, but hopefully it's clear enough. That would allow you to get the top one for all employees by checking that ROWNUMBER() = 1.
As an alternative g.d.d.c's answer that uses queries in the Select clause you could use left joins. You might get better perf, but you should test of course.
SELECT
e1.iD,
Coalesce(phoneHome.Phone,phoneWork.Phone,phoneOther) phone
FROm
employees e1
LEFT JOIN phone phoneHome
ON e1.emplId = phoneHome
and phone_type = 'HOME'
LEFT JOIN phone phoneWork
ON e1.emplId = phoneWork
and phone_type = 'WORK'
LEFT JOIN phone phoneWork
ON e1.emplId = phoneOTHER
and phone_type = 'OTHER'