SQL CTE compare rows in the same table - sql

I have a table with customers from different data sources. There are SSN, License#, and some unique IDs but not all sources have the same IDs. I would like to compare the records on the ID columns (SSN, License, SystemID) and assign a mapped ID if same person found.
I am assuming I can use CTE but not sure where to start. Still trying to learn my way in SQL. Any help will be appreciated. Thanks.
This is how the table looks:
Source|RowID|SSN |License|SystemID
A |1 |SSN1|Lic111 |
A |2 | | |Sys666
B |3 |SSN2| |Sys777
C |4 |SSN1| |
D |5 | |Lic333 |
D |6 | |Lic333 |Sys666
E |7 | | |Sys777
Results (added MapCustomerID)
Source|RowID|SSN |License|SystemID|MapCustomerID
A |1 |SSN1|Lic111 | |1
A |2 | | |Sys666 |2
B |3 |SSN2| |Sys777 |3
C |4 |SSN1| | |1
D |5 | |Lic999 | |4
D |6 | |Lic333 |Sys666 |2
E |7 | | |Sys777 |3

Here is what may be a "good-enough" approach to the problem.
Along each of the three dimensions, find the minimum row id for that dimensions (with a special handling of NULLs). The overall customer identifier is then the minimum of these three ids. To make it sequential with no gaps, use dense_rank().
with ids as (
select t.*,
(case when SSN is not null
then min(RowId) over (partition by SSN)
end) as SSN_id,
(case when License is not null
then min(RowId) over (partition by License)
end) as License_id,
(case when SystemId is not null
then min(RowId) over (partition by SystemId)
end)as SystemId_id
from t
),
leastid as (
select ids.*,
(case when SSN_Id <= coalesce(License_Id, SSN_Id) and
SSN_Id <= coalesce(SystemId_id, SSN_Id)
then SSN_Id
when License_Id <= coalesce(SystemId_id, License_Id)
then License_Id
else SystemId_id
end) as LeastId
from ids
)
select Source, RowID, SSN, License, SystemID,
dense_rank(LeastId) over (order by LeastId) as MapCustomerId
from LeastIds;
This is not a complete solution, but it works for your data. It does not work in the following case:
A |1 |SSN1|Lic111 | |1
A |2 |SSN1| |Sys666 |2
A |3 | | |Sys666 |2
Because this requires two "hops".
When I have faced this situation in the past, I have created the extra column in the table and repeatedly used update to get the minimum id over the different dimensions. Such iteration quickly connects the different pieces. It is probably possible to write a recursive CTE to do the same thing. But, the simpler solution above may solve your problem.
EDIT:
Because I've faced this problem before, I wanted to come up with a single query solution (rather than iterating through updates). This is possible using recursive CTEs. Here is code that seems to work:
with t as (
select 'A' as source, 1 as RowId, 'SSN1' as SSN, 'Lic111' as License, 'ABC' as SystemId union all
select 'A', 2, 'SSN1', NULL, 'Sys666' union all
select 'A', 3, NULL, NULL, 'Sys666' union all
select 'A', 4, NULL, 'Lic222', 'Sys666' union all
select 'A', 5, NULL, 'Lic222', NULL union all
select 'A', 6, NULL, 'Lic444', NULL
),
first as (
select t.*,
(select min(RowId)
from t t2
where t2.SSN = t.SSN or
t2.License = t.License or
t2.SystemId = t.SystemId
) as minrowid
from t
),
cte as (
select rowid, minrowid
from first
union all
select cte.rowid, first.minrowid
from cte join
first
on cte.minrowid = first.rowid and
cte.minrowid > first.minrowid
),
lookup as (
select rowid, min(minrowid) as minrowid,
dense_rank() over (order by min(minrowid)) as MapCustomerId
from cte
group by rowid
)
select t.*, lookup.MapCustomerId
from t join
lookup
on t.rowid = lookup.rowid;

Related

How can I select rows where keeping only those that meet this criteria? sql/hive

I have a table like the following:
+-------+------+
|ID |lang |
+-------+------+
|1 |eng |
|1 |pol |
|2 |eng |
|3 |gro |
|3 |eng |
+-------+------+
I'd like to keep only those rows where IF an ID is repeated i keep the non 'eng' row, so e.g. i would like:
+-------+------+
|ID |lang |
+-------+------+
|1 |pol |
|2 |eng |
|3 |gro |
+-------+------+
is there a quick neat way i can achieve this?
Unsure how to go about this in a nice way to achieve result above! I am using hive
If you need single line per id, then use row_number(), partition by id, order by case statement in which you can have some custom ordering logic.
For example row_number below will mark any first not eng (randomly) row (per id) with rn=1, any other rows for the same id will be marked >1: 2, 3, 4... And you can filter only that single row. If you want to pick some lang preferably, add more cases to the case expression to order depending on lang, or you can add additional column or expression to the order by.
select id, lang
from ( select id, lang,
row_number() over(partition by id
order by case when lang != 'eng' then 1
else 2
end
) rn
from mytable
) s
where rn=1
If you need to keep all rows for the same id which are not 'eng', use dense_rank() or rank() instead of row_number() with the same over() as above, it will assign 1 to all rows with lang!='eng' per id.
WITH cte_temp (Id, Lang, Rank) AS
(
SELECT
Id, Lang,
DENSE_RANK() OVER (PARTITION BY Id, LANG ORDER BY LANG DESC) AS Rank
FROM
YourTable
)
SELECT *
FROM cte_temp
WHERE rank = 1

Filling in missing data in Snowflake

I have a table in Snowflake like this:
TIME USER ITEM
1 frank 1
2 frank 0
3 frank 0
4 frank 0
5 frank 2
6 alf 5
7 alf 0
8 alf 6
9 alf 0
10 alf 9
I want to be able to replace all the zeroes with the next non-zero value, so in the end I have a table like this:
TIME USER ITEM
1 frank 1
2 frank 2
3 frank 2
4 frank 2
5 frank 2
6 alf 5
7 alf 6
8 alf 6
9 alf 9
10 alf 9
How would I write a query that does that in Snowflake?
You can use conditional_change_event function for this - documented here:
with base_table as (
select
t1.*,
conditional_change_event(item) over (order by time desc) event_num
from test_table t1
order by time desc
)
select
t1.time,
t1.user,
t1.item old_item,
coalesce(t2.item, t1.item) new_item
from base_table t1
left join base_table t2 on t1.event_num = t2.event_num + 1 and t1.item = 0
order by t1.time asc
Above SQL Results:
+----+-----+--------+--------+
|TIME|USER |OLD_ITEM|NEW_ITEM|
+----+-----+--------+--------+
|1 |frank|1 |1 |
|2 |frank|0 |2 |
|3 |frank|0 |2 |
|4 |frank|0 |2 |
|5 |alf |2 |2 |
|6 |alf |5 |5 |
|7 |alf |0 |6 |
|8 |alf |6 |6 |
|9 |alf |0 |9 |
|10 |alf |9 |9 |
+----+-----+--------+--------+
You can use lead(ignore nulls):
select t.*,
(case when item = 0
then lead(nullif(item, 0) ignore nulls) over (partition by user order by time)
else item
end) as imputed_item
from t;
You can also phrase this using first_value():
select t.*,
last_value(nullif(item, 0) ignore nulls) over (partition by user order by time desc)
from t;
If you want to use first_value() or last_value() in Snowflake, please keep in mind that Snowflake supports window frames differently from the ANSI standard as documented here. This means that if you want to use the default window frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW you have to include it explicitly in the statement, otherwise, the default would be ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING and that is why the LAST_VALUE example from the previous answer would not work correctly. Here is one example that would work:
select t.*,
last_value(nullif(item, 0) ignore nulls) over (partition by user order by time desc rows between unbounded preceding and current row)
from t;
Nothing wrong with above solutions ... but here's a different approach ... I think it's simpler.
select * from good
union all
select
bad.time
,bad.user
,min(good.item)
from bad
left outer join
good on good.user=bad.user and good.time>bad.time
group by
1,2
Full COPY|PASTE|RUN SQL:
with cte as (
select * from (
select 1 time, 'frank' user , 1 item union
select 2 time, 'frank' user , 0 item union
select 3 time, 'frank' user , 0 item union
select 4 time, 'frank' user , 0 item union
select 5 time, 'frank' user , 2 item union
select 6 time, 'alf' user , 5 item union
select 7 time, 'alf' user , 0 item union
select 8 time, 'alf' user , 6 item union
select 9 time, 'alf' user , 0 item union
select 10 time, 'alf' user , 9) )
, good as (select * from cte where item<> 0)
, bad as (select * from cte where item= 0)
select * from good
union all
select
bad.time
,bad.user
,min(good.item )
from bad
left outer join
good on good.user=bad.user and good.time>bad.time
group by
1,2

SQL convert some column names to row values and one column name row values to column name

I have a table organized as follows:
Year |Account | Location| Measure1 |Measure2 |Measure3
2020 |123a |A |100 |20% |5
2020 |234b |B |75 |80% |8
2020 |122c |C |80 |78% |9
I want to create records as follows:
Year |Account | Measure |A |B |C
2020 |123a |Measure1 |100 | |
2020 |
2020 |234b |Measure2 | |80% |
2020 |122c |Measure3 | | |9
This forum isn't for solving your programming tasks. But for the fun of it:
select Year, Account, 'Measure1' as Measure, Measure1 as A, null as B, null as C
from my_table
where Location = 'A'
union
select Year, Account, 'Measure2' as Measure, Measure2 as A, null as B, null as C
from my_table
where Location = 'A'
...
union
select Year, Account, 'Measure1' as Measure, null as A, Measure1 as B, null as C
from my_table
where Location = 'B'
...
I would recommend APPLY:
select t.year, t.account, v.*
from t cross apply
(values ('Measure1', Measure1, null, null),
('Measure2', null, Measure2, null),
('Measure3', null, null, Measure3)
) v(measure, a, b, c);
This returns all but the second row. It is not clear what logic you want for that, but you can add:
union all
select distinct t.year, null, null, null, null, null
To add it explicitly.

SQL select top five most recent row and distinct by a specific column

Ok, So say I have a table as picture below name appModelFlat only with a few hundred more rows. It does not have a date field but I want to find out the five most recently created environments (EnvName). There is only 14 possible environments (EnvName). But I want to select the five most recently inserted rows that inserted different EnvName. That is to say I want to select distinct EnvName (Although distinct doesn't work this way) most recent 5 rows , and I know they are the most recent by their id. The higher the id the newer the row is. Any help on this query would be appreciated.
id|AppName|EnvName|ServerTypeName|ServerId|OS |OSVersion|CPU|Memory|ExtraStorage|MachineDesc |
----------------------------------------------------------------------------------------------------
1 |ASB |DEV |App |1 |Windows|7 |4 |4 |100 |ASB-DEV-App |
----------------------------------------------------------------------------------------------------
5 |AMS |DEV |APP |2 |RedHat |7.2 |4 |4 |50 |AMS-DEV-App |
----------------------------------------------------------------------------------------------------
6 |SPB |TST |App |1 |Windows|7 |2 |8 |50 |SPB-TST-App |
----------------------------------------------------------------------------------------------------
7 |SBI |TST |Oracle |1 |Solaris|11 |4 |8 |100 |SBI-TST-Oracle|
----------------------------------------------------------------------------------------------------
Here is my first attempt although I'm not sure if it is right. It does give me five results.
SELECT DISTINCT top 5 [ID] = ( SELECT TOP 1 [ID] FROM [AppModelFlat] Y WHERE Y.[EnvName] = X.[EnvName])
,[AppName]= ( SELECT TOP 1 [AppName] FROM [AppModelFlat] Y WHERE Y.[EnvName] = X.[EnvName])
,[EnvName]
,[ServerTypeName] = ( SELECT TOP 1 [ServerTypeName] FROM [AppModelFlat] Y WHERE Y.[EnvName] = X.[EnvName])
,[ServerId] = ( SELECT TOP 1 [ServerId] FROM [AppModelFlat] Y WHERE Y.[EnvName] = X.[EnvName])
,[OS] = ( SELECT TOP 1 [OS] FROM [AppModelFlat] Y WHERE Y.[EnvName] = X.[EnvName])
FROM [AppModelFlat] X order by id desc
edit:
For expected result. Lets say I only wanted to select the top 2 since I only gave 5 entries here. I would want to get back the following.
5 |AMS |DEV |APP |2 |RedHat |7.2 |4 |4 |50 |AMS-DEV-App |
----------------------------------------------------------------------------------------------------
7 |SBI |TST |Oracle |1 |Solaris|11 |4 |8 |100 |SBI-TST-Oracle|
Because I only have one of each EnvName and each row has the highest Id number for that row.
using row_number() to get the latest row for each EnvName, and only taking the top 5 from ordered Id desc
select top 5 *
from (
select *
, rn = row_number() over (partition by EnvName order by id desc)
from appModelFlat
) s
where rn = 1
order by id desc
top with ties version:
select top 5 *
from (
select top 1 with ties *
from appModelFlat
order by row_number() over (partition by EnvName order by id desc)
) s
order by id desc
A simple sub query would also do the trick:
SELECT TOP 5 Id, AppName, EnvName, ServerTypeName, ServerId, OS
FROM AppModelFlat Records
INNER JOIN (SELECT EnvName,
MAX(Id) as Id
FROM AppModelFlat) Latest ON Records.Id = Latest.Id

How to find rows with the sequence of values in a column using SQL?

Consider the example table name "Person".
Name |Date |Work_Hours
---------------------------
John| 22/1/13 |0
John| 23/1/13 |0
Joseph| 22/1/13 |1
Joseph| 23/1/13 |1
Johnny| 22/1/13 |0
Johnny| 23/1/13 |0
Jim| 22/1/13 |1
Jim| 23/1/13 |0
In the above table, I have to find rows with the sequence of '0' followed by '1' in the column Work_Hours. Please share the idea/Query to do it.
The output I need is
Name |Date |Work_Hours
---------------------------
John| 23/1/13 |0
Joseph| 22/1/13 |1
Johnny| 23/1/13 |0
Jim| 22/1/13 |1
To look into previous or following records, you would usually use the aggregate functions LAG and LEAD:
select first_name, work_date, work_hours
from
(
select first_name, work_date, work_hours
, lag(work_hours) over (order by first_name, work_date) as prev_work_hours
, lead(work_hours) over (order by first_name, work_date) as next_work_hours
from person
)
where (work_hours = 0 and next_work_hours = 1) or (work_hours = 1 and prev_work_hours = 0)
order by first_name, work_date;
Some thing like
select no_hours.Name, no_hours.Date, some_hours.Date
From Person no_hours
inner join Person some_hours
On no_hours.Name = some_hours.name and some_hours.Date > no_hours.date
Where no_hours.work_hours = 0 and some_hours.work_hours = 1
would be a start.
Needless to say, name is not a good unique identifier...
Also works hours going from 0 to 1 to 0 would appear, and 0 to 1 to 0 to 1 would appear a lot...
Would be >= no_hours.date if you can go from 0 to 1 on the same day.
Perhaps:
SELECT p1.Name,
p1.Date AS Date_1,
p2.Date AS Date_2,
p1.Work_Hours As Work_Hours_1,
p2.Work_Hours As Work_Hours_2
FROM Person p1
INNER JOIN Person p2
on p1.Name=p2.Name
AND p1.Work_Hours=0
AND p2.Work_Hours=1
ORDER BY p1.Name,p1.Date,p2.Date,Work_Hours_1,Work_Hours_2
Demo
Your problem (as phrased) is equivalent to asking: Is there a 1 that follows any given row with a 0 for a name?
You can do this a correlated subquery:
select Name, Date, Work_Hours
from (select t.*,
(select min(date)
from table t2
where t2.name = t.name and t2.date > t.date and t2.Work_Hours = 1
) as DateOfLater1
from table t
) t
where DateOfLater1 is not null and work_hours = 0 or
(DateOfLater1 = date and work_hours = 1);