If one join works per rep id, don't join next - sql

I am matching two datasets that I imported into a Redshift DB: both are at rep id level.
This is my initial query to match the two datasets:
select *
from #t t
join #t2 t2
on lower(trim(t.unique_id))=lower(trim(t2.unique_id))
or lower(trim(t.email))=lower(trim(t2.email))
or lower(trim(split_part(t.first_name,',',1))||trim(split_part(t.last_name,',',1)))=lower(trim(split_part(t2.first_name,',',1))||trim(split_part(t2.last_name,',',1)))
#t is the source of truth I am matching to, and unique_id is supposedly the universal identifier (though only matches about 60%) for rep id (internal identifier), however, in some cases #t2 table has (incorrectly) multiple unique_ids per rep, and incorrectly multiple emails.
How can I change it so that it is more restrictive, ie. when getting a match by unique_id- dont match next record for that rep, when matching by email- dont match next record for that rep, and lastly join by firstname/lastname.
Thank you!

I think there are a few ways to skin this cat. As one option you could add a rank for each join as a CASE statement, and then pick out the one that has the min rank:
SELECT *
FROM
(
SELECT *,
min(ranktest) OVER (PARTITION BY t1.unique_id) as minrank
FROM
(
select *,
CASE WHEN lower(trim(t.unique_id))=lower(trim(t2.unique_id)) THEN 1
WHEN lower(trim(t.email))=lower(trim(t2.email)) THEN 2
WHEN ower(trim(split_part(t.first_name,',',1))||trim(split_part(t.last_name,',',1)))=lower(trim(split_part(t2.first_name,',',1))||trim(split_part(t2.last_name,',',1))) THEN 3
END as ranktest
from #t t
join #t2 t2
on lower(trim(t.unique_id))=lower(trim(t2.unique_id))
or lower(trim(t.email))=lower(trim(t2.email))
or lower(trim(split_part(t.first_name,',',1))||trim(split_part(t.last_name,',',1)))=lower(trim(split_part(t2.first_name,',',1))||trim(split_part(t2.last_name,',',1)))
) sub1
WHERE ranktest = minrank;
You could also do this by querying twice, once to get your data, and once to get the min(ranktest). It will almost definitely be slower, but.. it's a little prettier:
WITH subquery AS
(
select *,
CASE WHEN lower(trim(t.unique_id))=lower(trim(t2.unique_id)) THEN 1
WHEN lower(trim(t.email))=lower(trim(t2.email)) THEN 2
WHEN ower(trim(split_part(t.first_name,',',1))||trim(split_part(t.last_name,',',1)))=lower(trim(split_part(t2.first_name,',',1))||trim(split_part(t2.last_name,',',1))) THEN 3
END as ranktest
from #t t
join #t2 t2
on lower(trim(t.unique_id))=lower(trim(t2.unique_id))
or lower(trim(t.email))=lower(trim(t2.email))
or lower(trim(split_part(t.first_name,',',1))||trim(split_part(t.last_name,',',1)))=lower(trim(split_part(t2.first_name,',',1))||trim(split_part(t2.last_name,',',1)))
)
SELECT *
FROM subquery t1
WHERE t1.ranktest = (SELECT min(ranktest) FROM subquery WHERE subquery.unique_id = t1.ranktest)
Alternatively, you could run this as a UNION ALL, testing for the join differently each time to avoid repeats and only allowing the top most ranked join through:
select *
from #t t
join #t2 t2
on lower(trim(t.unique_id))=lower(trim(t2.unique_id))
UNION ALL
select *
from #t t
join #t2 t2
on lower(trim(t.unique_id))<>lower(trim(t2.unique_id))
AND lower(trim(t.email))=lower(trim(t2.email))
UNION ALL
select *
FROM #t t
join #t2 t2
ON lower(trim(t.unique_id))<>lower(trim(t2.unique_id))
AND lower(trim(t.email))<>lower(trim(t2.email))
AND lower(trim(split_part(t.first_name,',',1))||trim(split_part(t.last_name,',',1)))=lower(trim(split_part(t2.first_name,',',1))||trim(split_part(t2.last_name,',',1)))

Related

Joining and grouping to equate on two tables

I've tried to minify this problem as much as possible. I've got two tables which share some Id's (among other columns)
id id
---- ----
1 1
1 1
2 1
2
2
Firstly, I can get each table to resolve to a simple count of how many of each Id there is:
select id, count(*) from tbl1 group by id
select id, count(*) from tbl2 group by id
id | tbl1-count id | tbl2-count
--------------- ---------------
1 2 1 3
2 1 2 2
but then I'm at a loss, I'm trying to get the following output which shows the count from tbl2 for each id, divided by the count from tbl1 for the same id:
id | count of id in tbl2 / count of id in tbl1
==========
1 | 1.5
2 | 2
So far I've got this:
select tbl1.Id, tbl2.Id, count(*)
from tbl1
join tbl2 on tbl1.Id = tbl2.Id
group by tbl1.Id, tbl2.Id
which just gives me... well... something nowhere near what I need, to be honest! I was trying count(tbl1.Id), count(tbl2.Id) but get the same multiplied amount (because I'm joining I guess?) - I can't get the individual representations into individual columns where I can do the division.
This gives consideration to your naming of tables -- the query from tbl2 needs to be first so the results will include all records from tbl2. The LEFT JOIN will include all results from the first query, but only join those results that exist in tbl1. (Alternatively, you could use a FULL OUTER JOIN or UNION both results together in the first query.) I also added an IIF to give you an option if there are no records in tbl1 (dividing by null would produce null anyway, but you can do what you want).
Counts are cast as decimal so that the ratio will be returned as a decimal. You can adjust precision as required.
SELECT tb2.id, tb2.table2Count, tb1.table1Count,
IIF(ISNULL(tb1.table1Count, 0) != 0, tb2.table2Count / tb1.table1Count, null) AS ratio
FROM (
SELECT id, CAST(COUNT(1) AS DECIMAL(18, 5)) AS table2Count
FROM tbl2
GROUP BY id
) AS tb2
LEFT JOIN (
SELECT id, CAST(COUNT(1) AS DECIMAL(18, 5)) AS table1Count
FROM tbl1
GROUP BY id
) AS tb1 ON tb1.id = tb2.id
(A subqquery with a LEFT JOIN will allow the query optimizer to determine how to generate the results and will generally outperform a CROSS APPLY, as that executes a calculation for every record.)
Assuming your expected results are wrong, then this is how I would do it:
CREATE TABLE T1 (ID int);
CREATE TABLE T2 (ID int);
GO
INSERT INTO T1 VALUES(1),(1),(2);
INSERT INTO T2 VALUES(1),(1),(1),(2),(2);
GO
SELECT T1.ID AS OutID,
(T2.T2Count * 1.) / COUNT(T1.ID) AS OutCount --Might want a CONVERT to a smaller scale and precision decimal here
FROM T1
CROSS APPLY (SELECT T2.ID, COUNT(T2.ID) AS T2Count
FROM T2
WHERE T2.ID = T1.ID
GROUP BY T2.ID) T2
GROUP BY T1.ID,
T2.T2Count;
GO
DROP TABLE T1;
DROP TABLE T2;
You can aggregate in subqueries and then join:
select t1.id, t2.cnt * 1.0 / t1.cnt
from (select id, count(*) as cnt
from tbl1
group by id
) t1 join
(select id, count(*) as cnt
from tbl2
group by id
) t2
on t1.id = t2.id

All possible combinations of records in table sql server

I have a table
declare #table table(t varchar(50), d varchar(50), activ varchar(10), groupid int, rownum int)
insert into #table values('ALK','ceri', '0.2',1,1)
insert into #table values('ALK','criz', '24',1,2)
insert into #table values('EGFR','erlo', '2',2,3)
insert into #table values('EGFR','gefi', '57',2,4)
insert into #table values('EGFR','ibru', '5.6',2,5)
insert into #table values('EGFR','ceri', '900',2,6)
insert into #table values('EGFR','cetu', 'NULL',2,7)
insert into #table values('EGFR','afat', '10',2,8)
insert into #table values('EGFR','lapa', '10.8',2,9)
insert into #table values('EGFR','pani', 'NULL',2,10)
insert into #table values('ERBB2','pert', 'NULL',3,11)
insert into #table values('ERBB2','tras', 'NULL',3,12)
insert into #table values('ERBB2','lapa', '9.2',3,13)
insert into #table values('ERBB2','ado-', 'NULL',3,14)
insert into #table values('ERBB2','afat', '14',3,15)
insert into #table values('ERBB2','ibru', '9.4',3,16)
in output I need all combinations by groupid or t in format
t,d,t,d,t,d,activ and so on then I will qualify best combinations.
Any help will be appreciated. This will show doctors optimum combination of drugs for cancer patients. The table is dynamic and different for every patient.
Thank you
For all possible combinations, you would use CROSS JOIN:
SELECT * FROM table1 AS t1
CROSS JOIN table2 AS t2
on t1.ID = t2.ID
Keep in mind this gives a O(n^2) result set, likely to be huge for large sets of data.
I will use #TT to represent the table var since calling it #table may be a bit confusing
I also changed the datatype of active to float
There are really 3 possible cross joins
-- #1 -- producing 256 rows
select * from #TT as T1
cross join #TT as T2
-- #2 -- produces 104 rows
select * from #TT as T1
cross join #TT as T2
where T1.GroupID = T2.GroupID
-- #3 -- produces 104
select * from #TT as T1
cross join #TT as T2
where T1.t = T2.t
The 1st is a true cross join on the whole table.
The 2nd and 3rd are cross joins on GroupID and t respectively, but they are identical since Group 1 represents T='ALK', etc. This is easily confirmed since a union of 2 & 3 3 also produces 104 rows
However, select * on a self join is silly as is obvious if you change select * to
select T1.*, '===', T2.*
You can see the columns on the left of '===' are the same as the columns to the right of '==='
Since GroupID is an integer I would write the cross join as
select T1.* from #TT as T1
cross join #TT as T2
where T1.GroupID = T2.GroupID
Now since the poster wanted to grouping based on the smallest total active, I think it makes sense to group the response by GroupID and T and D giving and report the sum of Activ and order by GroupID and sum(Activ)
-- #4 adding group by and sum -- 16 rows generated
select T1.groupid, T1.t, T1.d, sum(T1.activ) as SumActiv
from #TT as T1
cross join #TT as T2
where T1.groupid = T2.groupid
group by T1.t, T1.groupid, T1.d
order by groupid, sum(T1.Activ)
Now you are getting close except for the fact that no CROSS JOIN is needed at all
-- #5 remove the cross join
select T1.groupid, T1.t, T1.d, sum(T1.activ) as SumActiv
from #TT as T1
group by T1.t, T1.groupid, T1.d
When I remove the cross join portion of the query I get the exact same result. I think we finally have what is wanted, with the possible exception of removing all but the first row for each combination of GroupID and d

What's the best way to select data only appearing in one of two tables?

If I have two tables such as this:
CREATE TABLE #table1 (id INT, name VARCHAR(10))
INSERT INTO #table1 VALUES (1,'John')
INSERT INTO #table1 VALUES (2,'Alan')
INSERT INTO #table1 VALUES (3,'Dave')
INSERT INTO #table1 VALUES (4,'Fred')
CREATE TABLE #table2 (id INT, name VARCHAR(10))
INSERT INTO #table2 VALUES (1,'John')
INSERT INTO #table2 VALUES (3,'Dave')
INSERT INTO #table2 VALUES (5,'Steve')
And I want to see all rows which only appear in one of the tables, what would be the best way to go about this?
All I can think of is to either do:
SELECT * from #table1 except SELECT * FROM #table2
UNION
SELECT * from #table2 except SELECT * FROM #table1
Or something along the lines of:
SELECT id,MAX(name) as name FROM
(
SELECT *,1 as count from #table1 UNION ALL
SELECT *,1 as count from #table2
) data
group by id
HAVING SUM(count) =1
Which would return Alan,Fred and Steve in this case.
But these feel really clunky - is there a more efficient way of approaching this?
select coalesce(t1.id, t2.id) id,
coalesce(t1.name, t2.name) name
from #table1 t1
full outer join #table2 t2
on t1.id = t2.id
where t1.id is null
or t2.id is null
The full outer join guarantees records from both sides of the join. Whatever record that does not have in both sides (the ones you are looking for) will have NULL in one side or in other. That's why we filter for NULL.
The COALESCE is there to guarantee that the non NULL value will be displayed.
Finally, it's worth highlighting that repetitions are detected by ID. If you want it also to be by name, you should add name to the JOIN. If you only want to be by name, join by name only. This solution (using JOIN) gives you that flexibility.
BTW, since you provided the CREATE and INSERT code, I actually ran them and the code above is a fully working code.
You can use EXCEPT and INTERSECT:
-- All rows
SELECT * FROM #table1
UNION
SELECT * FROM #table2
EXCEPT -- except
(
-- those in both tables
SELECT * FROM #table1
INTERSECT
SELECT * FROM #table2
)
Not sure if this is any better than your EXCEPT and UNION example...
select id, name
from
(select *, count(*) over(partition by checksum(*)) as cc
from (select *
from #table1
union all
select *
from #table2
) as T
) as T
where cc = 1

SQLServer join two tables

I've gotta question for you, I'm getting hard times trying to combine two tables, I can't manage to find the correct query.
I have two tables:
T1: 1column, Has X records
T2: 1column, Has Y records
Note: Y could never be greater than X but it often lesser than this one
I want to join those tables in order to have a table with two columns
t3: ColumnFromT1, columnFromT2.
When Y is lesser than X, the T2 field values gets repeated and are spread over all my other values, but I want to get NULL when ALL the columns from T2 are used.
How could I achieve that?
Thanks
You could give each table a row number in a subquery. Then you can left join on that row number. To recycle rows from the second table, take the modulus % of the first table's row number.
Example:
select Sub1.col1
, Sub2.col1
from (
select row_number() over (order by col1) as rn
, *
from #T1
) Sub1
left join
(
select row_number() over (order by col1) as rn
, *
from #T2
) Sub2
on (Sub1.rn - 1) % (select count(*) from #T2) + 1 = Sub2.rn
Test data:
declare #t1 table (col1 int)
declare #t2 table (col1 datetime)
insert #t1 values (1), (2), (3), (4), (5)
insert #t2 values ('2010-01-01'), ('2012-02-02')
This prints:
1 2010-01-01
2 2012-02-02
3 2010-01-01
4 2012-02-02
5 2010-01-01
You are looking for a LEFT JOIN (http://www.w3schools.com/sql/sql_join_left.asp) eg . T1 LEFT JOIN T2
say they both have column CustomerID in common
SELECT *
FROM T1
LEFT JOIN
T2 on t1.CustomerId = T2.CustomerId
This will return all records in T1 and those that match in T2 with nulls for the T2 values where they do not match.
Make sure you are joining the tables on a common column (or common column set if more than one column are necessary to perform the join). If not, you are doing a cartesian join ( http://ezinearticles.com/?What-is-a-Cartesian-Join?&id=3560672 )

How do I compare 2 rows from the same table (SQL Server)?

I need to create a background job that processes a table looking for rows matching on a particular id with different statuses. It will store the row data in a string to compare the data against a row with a matching id.
I know the syntax to get the row data, but I have never tried comparing 2 rows from the same table before. How is it done? Would I need to use variables to store the data from each? Or some other way?
(Using SQL Server 2008)
You can join a table to itself as many times as you require, it is called a self join.
An alias is assigned to each instance of the table (as in the example below) to differentiate one from another.
SELECT a.SelfJoinTableID
FROM dbo.SelfJoinTable a
INNER JOIN dbo.SelfJoinTable b
ON a.SelfJoinTableID = b.SelfJoinTableID
INNER JOIN dbo.SelfJoinTable c
ON a.SelfJoinTableID = c.SelfJoinTableID
WHERE a.Status = 'Status to filter a'
AND b.Status = 'Status to filter b'
AND c.Status = 'Status to filter c'
OK, after 2 years it's finally time to correct the syntax:
SELECT t1.value, t2.value
FROM MyTable t1
JOIN MyTable t2
ON t1.id = t2.id
WHERE t1.id = #id
AND t1.status = #status1
AND t2.status = #status2
Some people find the following alternative syntax easier to see what is going on:
select t1.value,t2.value
from MyTable t1
inner join MyTable t2 on
t1.id = t2.id
where t1.id = #id
SELECT COUNT(*) FROM (SELECT * FROM tbl WHERE id=1 UNION SELECT * FROM tbl WHERE id=2) a
If you got two rows, they different, if one - the same.
SELECT * FROM A AS b INNER JOIN A AS c ON b.a = c.a
WHERE b.a = 'some column value'
I had a situation where I needed to compare each row of a table with the next row to it, (next here is relative to my problem specification) in the example next row is specified using the order by clause inside the row_number() function.
so I wrote this:
DECLARE #T TABLE (col1 nvarchar(50));
insert into #T VALUES ('A'),('B'),('C'),('D'),('E')
select I1.col1 Instance_One_Col, I2.col1 Instance_Two_Col from (
select col1,row_number() over (order by col1) as row_num
FROM #T
) AS I1
left join (
select col1,row_number() over (order by col1) as row_num
FROM #T
) AS I2 on I1.row_num = I2.row_num - 1
after that I can compare each row to the next one as I need