Aggregate value by any of two columns - sql

Suppose I have a Customers Table:
Customers
-----------------------------------------------
Id INTEGER
SSN NCHAR(11)
FullName NVARCHAR(100)
LastPurchaseDate DATETIME
There are many stores around the city, and the customer can be registered in any of them, each one giving him a different Id. Wherever he buys, the corresponding Id gets it's LastPurchaseDate updated.
Now I need to get the Id corresponding to the 'latest' LastPurchaseDate by person. Problem is, due to X different reasons, there can be typos on either the SSN or the FullName. Let's say I have the next data:
Id SSN FullName LastPurchaseDate
----------- ----------- ------------- -----------------
200123 123-45-6789 John Doe 10-09-2015
201978 456-78-9012 Mary Jane 15-08-2015
380789 789-01-2345 Pete Zahut 01-08-2015
389236 123-45-6789 Jhon Doe 23-07-2015
215875 456-87-9012 Mary Jane 30-08-2015
974186 123456789 John Doe 28-04-2015
123758 789-01-2345 Pete Zaut 18-08-2015
A customer is considered to be the same person if it has either the same SSN or the same FullName. So in this sample, customers 200123, 389236 and 974186 are the same person. Therefore, the resulting Ids should be
200123
215875
123758
How can I achieve this?
Edit
So, the match has to be on either SSN or FullName, but it has to be exact; if both fields are different, even if it's by one character, it will be considered a different person. I hope the data will be eventually cleansed but it'll take it's time as it is a lot of info to trace and correct.

The first data cleaning will be:
(select REPLACE(SSN, '-', '') as SSN ,
Min(Id) as Id, Max(FullName) as FullName
max(LastPurchaeDate) as LastPurchaeDate
from Customers group by 1)
That will merge all the SSN numbers. In addition, it will go on the assumption that the lowest Id is the real Id and made max on name to avoid nulls.
You can go to further purification by assuming that the longer name length is the better by Length functions.

Related

Combining two mostly identical rows in SQL

I have a table that contains data like below:
Name
ID
Dept
Joe
1001
Accounting
Joe
1001
Marketing
Mary
1003
Administration
Mary
1009
Accounting
Each row is uniquely identified with a combo of Name and ID. I want the resulting table to combine rows that have same Name and ID and put their dept's together separated by a comma in alpha order. So the result would be:
Name
ID
Dept
Joe
1001
Accounting, Marketing
Mary
1003
Administration
Mary
1009
Accounting
I am not sure how to approach this. So far I have this, which doesn't really do what I need:
SELECT Name, ID, COUNT(*)
FROM employees
GROUP BY Name, ID
I know COUNT(*) is irrelevant here, but I am not sure what to do. Any help is appreciated! By the way, I am using PostgreSQL and I am new to the language.
Apparently there is an aggregate function for string concatenation with PostgreSQL. Find documentation here. Try the following:
SELECT Name, ID, string_agg(Dept, ', ' ORDER BY Dept ASC) AS Departments
FROM employees
GROUP BY Name, ID

Historical sql Table With Bits Of User Information - Make New Table With 1 Entry & All Information

I have a table (customers) that has 43 columns of user information (first name, last name, address, city, state, zip, phone, email, visitDate, lastActive, etc...)
Every night, I'm getting a feed from our clients with the customers that visited them that day. These visits are stored into the customers table without removing the old record. The old record is marked lastActive = 0 and the new one is marked lastActive = 1. Any null fields are stored as "Unknown".
Obviously this results in a very large table that takes a while to query. So, I plan on making a new table that is only the distinct users and their most complete information.
For example: If Bob Smith was imported on January 1st with no phone or email, and then he was imported again on August 1st with a phone, but no email, and then imported again on September 1st with no phone, but an email, my customers table would look something like this:
CustImportID CustomerKey FirstName LastName Phone Email visitDate lastActive
1 1 Bob Smith Unknown Unknown 2016-01-01 0
2 1 Bob Smith 5551231234 Unknown 2016-08-01 0
3 1 Bob Smith Unknown 1#2.io 2016-09-01 1
So my question is this, what's the best way to get the distinct people from the customers table, and insert them into the new table where Bob would only be one entry, but I would have values for every field (if every entry has phone, for example, we would pull the phone from the most recent entry), resulting is something like this:
CustomerKey FirstName LastName Phone Email visitDate
1 Bob Smith 5551231234 1#2.io 2016-09-01
You can use FIRST_VALUE with a trick to ignore 'Uknown' values:
SELECT FirstName, LastName,
FIRST_VALUE(Phone) OVER (ORDER BY CASE
WHEN Phone='Unknown' THEN 1
ELSE 0
END,
visitDate DESC) AS Phone,
FIRST_VALUE(Email) OVER (ORDER BY CASE
WHEN Email='Unknown' THEN 1
ELSE 0
END,
visitDate DESC) AS Email
FROM mytable
FIRST_VALUE is available from SQL Server 2012. It picks the latest field value as specified by the ORDER BY of the OVER clause. Due the CASE in the ORDER BY clause, 'Unknown' values will have to lowest priority.
you can use max of values from all records which will result this:
select customerkey, max(firstname), max(lastname), max(phone), max(email), max(visitdate) from yourtablename
If you have two are more valid entries then use row_number and select max of that based on recent values

How do I make a query for if value exists in row add a value to another field?

I have a database on access and I want to add a value to a column at the end of each row based on which hospital they are in. This is a separate value. For example - the hospital called "St. James Hospital" has the id of "3" in a separate field. How do I do this using a query rather than manually going through a whole database?
example here
Not the best solution, but you can do something like this:
create table new_table as
select id, case when hospital="St. James Hospital" then 3 else null
from old_table
Or, the better option would be to create a table with the columns hospital_name and hospital_id. You can then create a foreign key relationship that will create the mapping for you, and enforce data integrity. A join across the two tables will produce what you want.
Read about this here:
http://net.tutsplus.com/tutorials/databases/sql-for-beginners-part-3-database-relationships/
The answer to your question is a JOIN+UPDATE. I am fairly sure if you looked up you would find the below link.
Access DB update one table with value from another
You could do this:
update yourTable
set yourFinalColumnWhateverItsNameIs = {your desired value}
where someColumn = 3
Every row in the table that has a 3 in the someColumn column will then have that final column set to your desired value.
If this isn't what you want, please make your question clearer. Are you trying to put the name of the hospital into this table? If so, that is not a good idea and there are better ways to accomplish that.
Furthermore, if every row with a certain value (3) gets this value, you could simply add it to the other (i.e. Hospitals) table. No need to repeat it everywhere in the table that points back to the Hospitals table.
P.S. Here's an example of what I meant:
Let's say you have two tables
HOSPITALS
id
name
city
state
BIRTHS
id
hospitalid
babysname
gender
mothersname
fathername
You could get a baby's city of birth without having to include the City column in the Births table, simply by joining the tables on hospitals.id = births.hospitalid.
After examining your ACCDB file, I suggest you consider setting up the tables differently.
Table Health_Professionals:
ID First Name Second Name Position hospital_id
1 John Doe PI 2
2 Joe Smith Co-PI 1
3 Sarah Johnson Nurse 3
Table Hospitals:
hospital_id Hospital
1 Beaumont
2 St James
3 Letterkenny Hosptial
A key point is to avoid storing both the hospital ID and name in the Health_Professionals table. Store only the ID. When you need to see the name, use the hospital ID to join with the Hospitals table and get the name from there.
A useful side effect of this design is that if anyone ever misspells a hospital name, eg "Hosptial", you need correct that error in only one place. Same holds true whenever a hospital is intentionally renamed.
Based on those tables, the query below returns this result set.
ID Second Name First Name Position hospital_id Hospital
1 Doe John PI 2 St James
3 Johnson Sarah Nurse 3 Letterkenny Hosptial
2 Smith Joe Co-PI 1 Beaumont
SELECT
hp.ID,
hp.[Second Name],
hp.[First Name],
hp.Position,
hp.hospital_id,
h.Hospital
FROM
Health_Professionals AS hp
INNER JOIN Hospitals AS h
ON hp.hospital_id = h.hospital_id
ORDER BY
hp.[Second Name],
hp.[First Name];

MySQL duplicates -- how to specify when two records actually AREN'T duplicates?

I have an interesting problem, and my logic isn't up to the task.
We have a table with that sometimes develops duplicate records (for process reasons, and this is unavoidable). Take the following example:
id FirstName LastName PhoneNumber email
-- --------- -------- ------------ --------------
1 John Doe 123-555-1234 jdoe#gmail.com
2 Jane Smith 123-555-1111 jsmith#foo.com
3 John Doe 123-555-4321 jdoe#yahoo.com
4 Bob Jones 123-555-5555 bob#bar.com
5 John Doe 123-555-0000 jdoe#hotmail.com
6 Mike Roberts 123-555-9999 roberts#baz.com
7 John Doe 123-555-1717 wally#domain.com
We find the duplicates this way:
SELECT c1.*
FROM `clients` c1
INNER JOIN (
SELECT `FirstName`, `LastName`, COUNT(*)
FROM `clients`
GROUP BY `FirstName`, `LastName`
HAVING COUNT(*) > 1
) AS c2
ON c1.`FirstName` = c2.`FirstName`
AND c1.`LastName` = c2.`LastName`
This generates the following list of duplicates:
id FirstName LastName PhoneNumber email
-- --------- -------- ------------ --------------
1 John Doe 123-555-1234 jdoe#gmail.com
3 John Doe 123-555-4321 jdoe#yahoo.com
5 John Doe 123-555-0000 jdoe#hotmail.com
7 John Doe 123-555-1717 wally#domain.com
As you can see, based on FirstName and LastName, all of the records are duplicates.
At this point, we actually make a phone call to the client to clear up potential duplicates.
After doing so, we learn (for example) that records 1 and 3 are real duplicates, but records 5 and 7 are actually two different people altogether.
So we merge any extraneously linked data from records 1 and 3 into record 1, remove record 3, and leave records 5 and 7 alone.
Now here's were the problem comes in:
The next time we re-run the "duplicates" query, it will contain the following rows:
id FirstName LastName PhoneNumber email
-- --------- -------- ------------ --------------
1 John Doe 123-555-4321 jdoe#gmail.com
5 John Doe 123-555-0000 jdoe#hotmail.com
7 John Doe 123-555-1717 wally#domain.com
They all appear to be duplicates, even though we've previously recognized that they aren't.
How would you go about identifying that these records aren't duplicates?
My first though it to build a lookup table identifying which records aren't duplicates of each other (for example, {1,5},{1,7},{5,7}), but I have no idea how to build a query that would be able to use this data.
Further, if another duplicate record shows up, it may be a duplicate of 1, 5, or 7, so we would need them all to show back up in the duplicates list so the customer service person can call the person in the new record to find out which record he may be a duplicate of.
I'm stretched to the limit trying to understand this. Any brilliant geniuses out there that would care to take a crack at this?
Interesting problem. Here's my crack at it.
How about if we approach the problem from a slightly different perspective.
Consider that the system is clean for a start i.e all records currently in the system are either with Unique First + Last name combinations OR the same first + last name ones have already been manually confirmed to be different people.
At the point of entering a NEW user in the system, we have an additional check. Can be implemented as an INSERT Trigger or just another procedure called after the insert is successfully done.
This Trigger / Procedure matches the
FIRST + LAST name combination of
"Inserted"record with all existing
records in the table.
For all the matching First + Last names, it will create an entry in a matching table (new table) with NewUserID, ExistingMatchingRecordsUserID
From an SQL perspective,
TABLE MatchingTable
COLUMNS 1. NewUserID 2. ExistingUserID
Constraint : Logical PK = NewUserID + ExistingMatchingRecordsUserID
INSERT INTO MATCHINGTABLE VALUES ('NewUserId', userId)
SELECT userId FROM User u where u.firstName = 'John' and u.LastName = 'Doe'
All entries in MatchingTable need resolution.
When say an Admin logs into the system, the admin sees the list of all entries in MatchingTable
eg: New User John Doe - (ID 345) - 3 Potential matches John Doe - ID 123 ID 231 / ID 256
The admin will check up data for 345 against data in 123 / 231 and 256 and manually confirm if duplicate of ANY / None
If Duplicate, 345 is deleted from User Table (soft / hard delete - whatever suits you)
If NOT, the entries for ID 354 are just removed from MatchingTable (i would go with hard deletes here as this is like a transactional temp table but again anything is fine).
Additionally, when entries for ID 354 are removed from MatchingTable, all other entries in MatchingTable where ExistingMatchingRecordsUserID = 354 are automatically removed to ensure that unnecessary manual verification for already verified data is not needed.
Again, this could be a potential DELETE trigger / Just logic executed additionally on DELETE of MatchingTable. The implementation is subject to preference.
At the expense of adding a single byte per row to your table, you could add a manually_verified BOOL column, with a default of FALSE. Set it to TRUE if you have manually verified the data. Then you can simply query where manually_verified = FALSE.
It's simple, effective, and matches what is actually happening in the business processes: you manually verify the data.
If you want to go a step further, you might want to store when the row was verified and who verified it. Since this might be annoying to store in the main table, you could certainly store it in a separate table, and LEFT JOIN in the verification data. You could even create a view to recreate the appearance of a single master table.
To solve the problem of a new duplicate being added: you would check non-verified data against the entire data set. So that means your main table, c1, would have the condition manually_verified = FALSE, but your INNER JOINed table, c2, does not. This way, the unverified data will still find all potential duplicate matches:
SELECT * FROM table t1
INNER JOIN table t2 ON t1.name = t2.name AND t1.id <> t2.id
WHERE t1.manually_verified = FALSE
The possible matches for the duplicates will be in the joined table.

Remapping/Concatenating in SQL

I'm trying to reorder/group a set of results using SQL. I have a few fields (which for the example have been renamed to something a bit less specific), and each logical group of records has a field which remains constant - the address field. There are also fields which are present for each address, these are the same for every address.
id forename surname address
1 John These Address1
2 Lucy Values Address1
3 Jenny Are Address1
4 John All Address2
5 Lucy Totally Address2
6 Jenny Different Address2
7 Steve And Address2
8 Richard Blah Address2
address John Lucy Jenny Steve Richard
Address1 These Values Are (null) (null)
Address2 All Totally Different And Blah
For example: John,Lucy,Jenny,Steve and Richard are the only possible names at each address. I know this because it's stored in another location.
Can I select values from the actual records in the left hand image, and return them as a result set like the one on the right? I'm using MySQL if that makes a difference.
Assuming that the column headings "john", "lucy" etc are fixed, you can group by the address field and use if() functions combined with aggregate operators to get your results:
select max(if(forename='john',surname,null)) as john,
max(if(forename='lucy',surname,null)) as lucy,
max(if(forename='jenny',surname,null)) as jenny,
max(if(forename='steve',surname,null)) as steve,
max(if(forename='richard',surname,null)) as richard,
address
from tablename
group by address;
It is a bit brittle though.
There is also the group_concat function that can be used (within limits) to do something similar, but it will be ordered row-wise rather than column-wise as you appear to require.
eg.
select address, group_concat( concat( forename, surname ) ) tenants
from tablename
group by address;
I'm not certain, but I think what you're trying to do is GROUP BY.
SELECT Address,Name FROM Table GROUP BY Name
if you want to select more columns, make sure they're included in the GROUP BY clause. Also, you can now do aggregate functions, like MAX() or COUNT().
I am not sure about the question, but from what I understand you can do:
SELECT concat(column1,column2,column3) as main_column, address from table;