SQL Join Ignore multiple matches (fuzzy results ok) - sql

I don't even know what the name of my problem is called, so I'm just gonna put some sample data. I don't mind fuzzy results on this (this is the best way I can think to express it. I don't mind if I overlook some data, this is for approximated evaluation, not for detailed accounting, if that makes sense). But I do need every record in TABLE 1, and I would like to avoid the nulls case indicated below.
IS THIS POSSIBLE?
TABLE 1
acctnum sub fname lname phone
12345 1 john doe xxx-xxx-xxxx
12346 0 jane doe xxx-xxx-xxxx
12347 0 rob roy xxx-xxx-xxxx
12348 0 paul smith xxx-xxx-xxxx
TABLE 2
acctnum sub division
12345 1 EAST
12345 2 WEST
12345 3 NORTH
12346 1 TOP
12346 2 BOTTOM
12347 2 BALLOON
12348 1 NORTH
So if we do a "regular outer" join, we'd get some results like this, since the sub 0's don't match the second table:
TABLE AFTER JOIN
acctnum sub fname lname phone division
12345 1 john doe xxx-xxx-xxxx EAST
12346 0 jane doe xxx-xxx-xxxx null
12347 0 rob roy xxx-xxx-xxxx null
12348 0 paul smith xxx-xxx-xxxx null
But I would rather get
TABLE AFTER JOIN
acctnum sub fname lname phone division
12345 1 john doe xxx-xxx-xxxx EAST
12346 0 jane doe xxx-xxx-xxxx TOP
12347 0 rob roy xxx-xxx-xxxx BALLOON
12348 0 paul smith xxx-xxx-xxxx NORTH
And I'm trying to avoid:
TABLE AFTER JOIN
acctnum sub fname lname phone division
12345 1 john doe xxx-xxx-xxxx EAST
12345 1 john doe xxx-xxx-xxxx WEST
12345 1 john doe xxx-xxx-xxxx NORTH
12346 0 jane doe xxx-xxx-xxxx TOP
12346 0 jane doe xxx-xxx-xxxx BOTTOM
12347 0 rob roy xxx-xxx-xxxx BALOON
12348 0 paul smith xxx-xxx-xxxx NORTH
So I decided to go with using a union and two if conditions. I'll accept a null for conditions where the sub account is defined in table 1 but not in table 2, and for everything else, I'll just match against the min.

If I'm understanding correctly, it looks like you're trying to join on the sub column if it matches. If there's no match on sub, then you want it to select the "first" row for that acctnum. Is this correct?
If so, you'll need to left join on the full match, then perform another left join on a select statement that determines the division that corresponds to the lowest sub value for that acctnum. The row_number() function can help you with this, like this:
select
t1.acctnum,
t1.sub,
t1.fname,
t1.lname,
t1.phone,
isnull(t2_match.division, t2_first.division) as division
from table1 t1
left join table2 t2_match on t2_match.acctnum = t1.acctnum and t2_match.sub = t1.sub
left join
(
select
acctnum,
sub,
division,
row_number() over (partition by acctnum order by sub) as rownum
from table2
) t2_first on t2_first.acctnum = t1.acctnum
EDIT
If you don't care at all about which record you get back from table 2 when a matching sub doesn't exist, you could combine two different queries (one that matches the sub and one that just takes the min or max division) with a union.
select
t1.acctnum,
t1.sub,
t1.fname,
t1.lname,
t1.phone,
t2.division
from table1 t1
join table2 t2 on t2.acctnum = t1.acctnum and t2.sub = t1.sub
union
select
t1.acctnum,
t1.sub,
t1.fname,
t1.lname,
t1.phone,
min(t2.division)
from table1 t1
join table2 t2 on t2.acctnum = t1.acctnum
left join table2 t2_match on t2_match.acctnum = t1.acctnum and t2_match.sub = t1.sub
where t2_match.acctnum is null
Personally, I don't find the union syntax any more compelling and you now have to maintain the query in two places. For this reason, I'd favor the row_number() approach.

try to use
SELECT MIN(Table_1.acctnum) as acctnum , MIN(Table_1.sub) as sub,MIN( Table_1.fname) as fname, MIN(Table_1.lname) as name, MIN(Table_1.phone) as phone, MIN(Table_2.division) as division
FROM Table_1 INNER JOIN Table_2 ON Table_1.acctnum = Table_2.acctnum AND Table_1.sub = Table_2.sub
where Table_1.sub>0
group by Table_1.acctnum
union
SELECT MIN(Table_1.acctnum) as acctnum , MIN(Table_1.sub) as sub,MIN( Table_1.fname) as fname, MIN(Table_1.lname) as name, MIN(Table_1.phone) as phone, MIN(Table_2.division) as division
FROM Table_1 INNER JOIN Table_2 ON Table_1.acctnum = Table_2.acctnum
where Table_1.sub=0
group by Table_1.acctnum
this is the result
12345 1 john doe xxxxxxxxxx EAST
12346 0 jane doe xxxxxxxxxx BOTTOM
12347 0 rob roy xxxxxxxxxx BALLOON
12348 0 paul smith xxxxxxxxxx NORTH
if you change min to max TOP will be insted of BOTTOM on the second row

It may also work for you:
SELECT t1.acctnum, t1.sub, t1.fname, t1.lname, t1.phone,
ISNULL(MAX(t2.division),MAX(t3.division)) as division
FROM table_1 t1
LEFT JOIN table_2 t2 ON (t2.acctnum = t1.acctnum AND t1.sub = t2.sub)
LEFT JOIN table_2 t3 ON (t3.acctnum = t1.acctnum)
GROUP BY t1.acctnum, t1.sub, t1.fname, t1.lname, t1.phone

This will give your desired result, exactly (for the shown data):
Updated to not assume there is always a sub==1 value:
SELECT
T1.acctnum,
T1.sub,
T1.fname,
T1.lname,
T1.phone,
T2.division
FROM
TABLE_1 T1
LEFT JOIN
TABLE_2 T2 ON T1.acctnum = T2.acctnum
AND
T2.sub = (SELECT MIN(T3.sub) FROM TABLE_2 T3 WHERE T1.acctnum = T3.acctnum)
ORDER BY
T1.lname,
T1.fname,
T1.acctnum

Related

Creating one new column out of two existing columns

I have this table t1 with c1:old_email and c2:new_email
The goal: I want to create a new column/or query this table in a way so that I can use fields from c1 and c2 (basically merge the results from c1 and c2 into one column c3) and use it for a subquery in a where statement:
Select * from t2 where t2.email=(select c3 from t1)
name |old_email |new_email
:Johnny Go: JG#yahoo.com:
:Bertie Post: Bertie#hotmail.com: Bertie#gmail.com:
can't you join using both conditions?
select t2.* from
t2 join t1 on t2.email in (t1.old_email, t1.new_email)
or
select t2.*
from t1, t2
where t2.email = t1.old_email
or t2.email = t1.new_email
The question is a little ambiguous. I am going to infer you have a table like the one below, and you want to match either email address by joining a separate table of email addresses you have to identify matches.
ID
NAME
OLD_EMAIL
NEW_EMAIL
1
David Lin
david.lin#example.com
david#fakegoofmail.com
2
Christy Thomas
christy.thomas#example.com
christy#fakegoofmail.com
3
Erin Hill
erin.hill#example.com
erin#fakegoofmail.com
4
Noah Collins
noah.collins#example.com
noah#fakegoofmail.com
5
Andrew Salazar
andrew.salazar#example.com
andrew#fakegoofmail.com
You are going to want to put both old_email and new_email in one column. We can do this with unpivot.
select
p.*
from t1
unpivot(email for email_field in (old_email, new_email)) p;
The result would look like so.
ID
NAME
EMAIL_FIELD
EMAIL
1
David Lin
OLD_EMAIL
david.lin#example.com
1
David Lin
NEW_EMAIL
david#fakegoofmail.com
2
Christy Thomas
OLD_EMAIL
christy.thomas#example.com
2
Christy Thomas
NEW_EMAIL
christy#fakegoofmail.com
3
Erin Hill
OLD_EMAIL
erin.hill#example.com
3
Erin Hill
NEW_EMAIL
erin#fakegoofmail.com
4
Noah Collins
OLD_EMAIL
noah.collins#example.com
4
Noah Collins
NEW_EMAIL
noah#fakegoofmail.com
5
Andrew Salazar
OLD_EMAIL
andrew.salazar#example.com
5
Andrew Salazar
NEW_EMAIL
andrew#fakegoofmail.com
Now you can join your secondary table of emails to perform email matching with a query like something below.
with t1_cte as (
select
p.*
from t1
unpivot(email for email_field in (old_email, new_email)) p order by 1
)
select t1.* from t1_cte t1
inner join seperate_table_of_emails t2 -- << your secondary table
on t1.email = t2.email;

Join multiple tables to return only one result for each record from main table

Currently I have three tables I am joining. I have data that was migrated from one system(old) to another system(new). I need to compare this data to ensure matches but also mismatches. I have three tables. One has the list of accounts being moved. The two systems have differnt ID types so this first table is a list of all IDs for the two tables and each account that was moved. So this is my base population.
ID1 ID2
ABC 123
ABC 123
ABC 123
DEF 456
DEF 456
DEF 456
I then have table 2 which is all the data from the old system.
ID Fname Lname
ABC John Smith
ABC Tom Smith
ABC Kate Smith
DEF Jason Thomas
DEF Ruby Thomas
DEF Alex Johnson
Then table 3 is all the data found in the new system.
ID Fname Lname
123 John Smith
123 Tom Smith
123 Kate Smith
456 Jason Thomas
456 Ruby Thomas
Right now when I join these tables on the ID I get a lot more rows than I need.
When I do my join I receive this:
ID Fname_old Lname_old ID2 Fname_new Lname_new
ABC John Smith 123 John Smith
ABC John Smith 123 Tom Smith
ABC John Smith 123 Kate Smith
I am trying to join them where it only returns the row that matches, and if it can't find a match I should still get the ID from the ID file and the data from table 2(old data) as this is the data that was sent to the new system.
ID1 ID2 Fname_old Lname_old Fname_new Lname_new
ABC 123 John Smith John Smith
ABC 123 Tom Smith Tom Smith
ABC 123 Kate Smith Kate Smith
DEF 456 Jason Thomas Jason Thomas
DEF 456 Ruby Thomas Ruby Thomas
DEF 456 Alex Johnson
The code I am using is:
Select a.ID1, a.ID2, b.fname as fname_old, b.lnam as lname_old,
c.fname as fname_new, c.lname as lname_new
from table1 a
left join table2 b
on a.ID1 = b.ID
left join table3 c
on a.ID2 = c.ID
If its just duplicate rows in your first table you could try distincting them in a derived table like below:
Select a.ID1, a.ID2, b.fname as fname_old, b.lnam as lname_old,
c.fname as fname_new, c.lname as lname_new
from (SELECT DISTINCT ID1, ID2 FROM table1) a
left join table2 b
on a.ID1 = b.ID
left join table3 c
on a.ID2 = c.ID
You are joining them on ID columns.
ID columns are usually UNIQUE while you have multiple identical IDs and specify join on those IDs.
Since you need to compare data, i suggest you lookup MATCH and how it works as that seems to be closer to what you are looking for here.
You can get a match using row_number():
Select a.ID1, a.ID2, b.fname as fname_old, b.lnam as lname_old,
c.fname as fname_new, c.lname as lname_new
from (select a.*,
row_number() over (partition by id order by id) as seqnum
from table1 a
) a left join
(select b.*,
row_number() over (partition by id order by id) as seqnum
from table2 b
) b
on a.ID1 = b.ID and a.seqnum = b.seqnum
(select c.*,
row_number() over (partition by id order by id) as seqnum
from table3 c
) c
on a.ID2 = c.ID and a.seqnum = c.seqnum;
Note: This does not preserve the "ordering" of the original values, so any rows can be matched with any other. Why? SQL tables represent unordered sets.
If there is an ordering in the tables, you can use that in the order by clauses to get a match consistent with the ordering.
If you have a compare chance for name and last name this code will work.
select DISTINCT a.ID1, a.ID2, b.fname as fname_old, b.lname as lname_old, c.fname as
fname_new, c.lname as lname_new from table2 b
left join table1 a on a.ID1=b.ID
left join table3 c on a.ID2=c.ID and b.Fname=c.Fname and b.Lname=c.Lname
My Result :
ID1 ID2 fname_old lname_old fname_new lname_new
ABC 123 John Smith John Smith
ABC 123 Kate Smith Kate Smith
ABC 123 Tom Smith Tom Smith
DEF 456 Alex Johnson NULL NULL
DEF 456 Jason Thomas Jason Thomas
DEF 456 Ruby Thomas Ruby Thomas
You say that this is data transferred to two systems. So you expect all data to match. You could hence reduce the query to only find data that doesn't match, if any.
Here is a SQL standard compliant query. You tagged your request with hive. I don't know about hive, so you may have to adjust the query.
select
t2.id as id1,
t3.id as id2,
t2.fname as fname_old,
t2.lname as lname_old,
t3.fname as fname_new,
t3.lname as lname_new
from table2 t2
full outer join t3
on t3.fname = t2.fname
and t3.lname = t2.lname
and exists (select null from table1 t1 where t1.id1 = t2.id and t1.id2 = t3.id)
where t2.id is null or t3.id is null;
This is a full anti join. It returns all rows that have no exact match in the other table. It doesn't, however guesstimate which deviating rows may be pairs. You will get a result like this:
ID1 | ID2 | Fname_old | Lname_old | Fname_new | Lname_new
----+-----+-----------+-----------+-----------+----------
DEF | | Alex | Johnson | |
GHI | | Jone | Miller | |
GHI | | Maxx | Miller | |
GHI | | Fritz | Miller | |
| 789 | | | Joan | Miller
| 789 | | | Max | Miller
| 799 | | | Fritz | Miller
As you see, you would have to examine this result manually. But ideally the query shouldn't return any row at all, which would just prove that everything went as expected and nobody (system or person) messed with the data :-)

PostgreSQL: How to join two tables using between date?

I really don't know how to ask this question of mine.
I'll illustrate it using two tables I needed to join.
TABLE_1
Name Date
John 01-01-2016
May 04-08-2015
Rose 10-25-2016
Mary 12-15-2015
Ruby 07-07-2017
TABLE_2
Signatory DateFrom DateTo
President 1 01-01-2015 12-31-2015
President 2 01-01-2016 12-31-2016
RESULT:
Name Date Signatory
John 01-01-2016 President 2
May 04-08-2015 President 1
Rose 10-25-2016 President 2
Mary 12-15-2015 President 1
Ruby 07-07-2017 NULL
All I need to check if the Date of Table_1 is within the DateFrom and DateTo of Table_2 to get the Signatory field.
How I can do that?
Thanks a lot! ^_^
Try this:
SELECT t1.*, t2.Signatory
FROM Table_1 AS t1
LEFT JOIN Table_2 AS t2
ON t1."Date" BETWEEN t2.DateFrom AND t2.DateTo
What you need is just a LEFT JOIN with BETWEEN in the ON clause in order to determine whether Date field of Table_1 falls within any [DateFrom, DateTo] interval of Table_2.
Demo here

SQL Query, GROUP/COUNT issue with INNER JOIN

I've got a data set composed primarily of dates, IDs, and addresses, that looks a bit like this:
datadate id address
20150801 Bob 123
20150801 Bob 123
20150801 Dan 345
20150801 Dan 456
20150801 Dan 567
20150801 George 234
20150801 Jim 123
20150801 Jim 123
20150801 John 678
20150801 John 123
20150802 Tom 123
20150802 Tom 234
20150802 Tom 345
My goal is to write a query which identifies any IDs which are associated with multiple distinct addresses for a specific date (or date range). I want the query results to give me the name and distinct addresses. So, for this data set, the results I'd like to see would look like this, for date 8/1/2015:
datadate id address
20150801 Dan 345
20150801 Dan 456
20150801 Dan 567
20150801 John 678
20150801 John 123
The query I've worked up so far is this, but it's not really working for me:
SELECT a.[datadate], a.[id], a.[address], b.[count1]
FROM table1 AS a INNER JOIN (SELECT [id], COUNT([address]) as [count1] FROM table1 GROUP BY [id] having count1 > 1 ) AS b ON a.[id]=b.[id]
WHERE a.[datadate] = '20150801'
ORDER BY a.[id], a.[address];
Any suggestions?
Just modifying your existing query a little bit, you can change your having to count(distinct address) and then joining back to the table to get your address values like this:
SELECT t.datadate
,t.id
,t1.address
FROM (
SELECT datadate
,id
,count(DISTINCT address) address
FROM test
WHERE datadate = '20150801'
GROUP BY datadate,id
HAVING count(DISTINCT address) > 1
) t
INNER JOIN test t1 ON t.datadate = t1.datadate
AND t.id = t1.id;
I tested this on SQL Server, but should be similar in MS-Access as well.
SQL Fiddle Demo
Edit
I just read your question again and it appears you want all duplicates. In which case I would use exists to see if another row with the same id but a different address exists.
select * from mytable t1
where datadate = '20150801'
and exists (
select 1 from mytable t2
where t2.id = t1.id
and t2.address <> t1.address
and t2.datadate = t1.datadate
)

Retrieve all distinct records from table and if any changes happen between two similar distinct record then need to consider both. Using select query

I want to convert table1 into table2. As I need to find out all distinct records excluding mis_date fro the table and most important condition is if any changes happen between two similar distinct records than in that case I want both of them as two distinct records.
Example:
i/p
empId Empname Pancard MisDate
123 alex ads234 31/11/2012
123 alex ads234 31/12/2012
123 alex ads234 31/01/2013
123 alex dds124 29/02/2013
123 alex ads234 31/03/2013
123 alex ads234 31/04/2013
123 alex dds124 30/05/2013
Expected o/p
empId Empname Pancard MisDate
123 alex ads234 31/11/2012
123 alex dds124 29/02/2013
123 alex ads234 31/03/2013
123 alex dds124 30/05/2013
Assuming there's only one row for each MisDate (otherwise you'll have to find another way to specify ordering):
SELECT t1.empId, t1.Empname, t1.Pancard
FROM Table1 t1
LEFT OUTER JOIN Table1 t2
ON t2.MisDate = (SELECT MAX(MisDate) FROM Table1 t3 WHERE t3.MisDate < t1.MisDate)
WHERE t2.empId IS NULL
OR t2.empId <> t1.empId OR t2.Empname <> t1.Empname OR t2.Pancard <> t1.Pancard
SQL Fiddle example
This performs a self-join on the previous record, as ordered by MisDate, outputting if it is different or if there is no previous record (it is the first row).
Note: You've got some funky dates. I assume these are just transcription errors and have corrected them in the fiddle.