Create column with unique identifier when adjacent fields are identical

Create column with unique identifier when adjacent fields are identical - sql

The below is an example of an existing table (except for the activity instance column). I'd like the activity_instance column to create/add a numeric identifier each time a unique combination presents itself in the three adjacent columns for each individual (unique_id), i.e. when unique_id, activity and date match, it's assigned the instance of 1 for that person, and so on. This same combination could appear more than once later in the dataset.
The idea is to distinguish which events belong together and which not. This instance identifier should be unique, also among different cases and activities.
unique_id
activity
date
activity_instance
1234
activity_a
2016-04-01
1
1234
activity_a
2016-04-01
1
1234
activity_b
2016-04-01
2
5678
activity_a
2019-09-01
1
5678
activity_a
2019-09-01
1
65431
activity_c
2019-09-01
1
1234
activity_a
2019-09-01
3

using dense_rank :
select *
, dense_rank() over (partition by unique_id order by date,activity) as activity_instance
from tablename

Related

hive - Duplicate counts check associated from one to another column

I have a table with and trying to fetch counts of distinct uniqueness from across a column by comparing to another column and the data is across millions to billions for each TMKEY partitioned column
ID TNUM TMKEY
23455 ABCD 1001
23456 ABCD 1001
23455 ABCD 1001
112233 BCDE 1001
113322 BCDE 1001
9009 DDEE 1001
9009 DDEE 1001
1009 FFGG 1001
Looking for desired output:
total_distinct_tNUM_count count_of_TNUM_which_has_more_than_disintct_ID TMKEY
4 2 1001
Here when TNUM is DDEE, the ID is fetching 9009 which has duplicates shouldn't be picked up when calculating the count of TNUM which has more than distinct ID. All I'm looking in here is get group concat counts. Any suggestions please. As I have data with more than 3 billion to 4 billions my approach is completely different and stuck.
select a.tnum,a.group_id,a.time_week from (SELECT time_week,tnum,count(*) as num_of_rows, concat_ws('|' , collect_set(id)) as group_id from source_table_test1 where time_week=1001 group by tnum,time_week) as a where length(a.group_id)>16 and num_of_rows>1

Subsetting on dates for a SQL query

Using Snowflake, I am attempting to subset on customers that have no current subscriptions, and eliminating all IDs for those which have current/active contracts.
Each ID will typically have multiple records associated with a contract/renewal history for a particular ID/customer.
It is only known if a customer is active if there is no contract that goes beyond the current date, while there are likely multiple past contracts which have lapsed, but the account is still active if one of those contract end dates goes beyond the current date.
Consider the following table:
Date_Start
Date_End
Name
ID
2015-07-03
2019-07-03
Piggly
001
2019-07-04
2025-07-04
Piggly
001
2013-10-01
2017-12-31
Doggy
031
2018-01-01
2018-06-30
Doggy
031
2020-01-01
2021-03-14
Catty
022
2021-03-15
2024-06-01
Catty
022
1999-06-01
2021-06-01
Horsey
052
2021-06-02
2022-01-01
Horsey
052
2022-01-02
2022-07-04
Horsey
052
With a desired output non-active customers that do not have an end date beyond Jan 5th 2023 (or current/arbitrary date)
Name
ID
Doggy
031
Horsey
052
My first attempt was:
SELECT Name, ID
FROM table
WHERE Date_End < GETDATE()
but the obvious problem is that I'll also be selecting past contracts of customers who haven't expired/churned and who have a contract that goes beyond the current date.
How do I resolve this?

As there are many rows per name and ID, you should aggregate the data and then use a HAVING clause to select only those you are interested in.
SELECT name, id
FROM table
GROUP BY name, id
HAVING MAX(date_end) < GETDATE();

You can work it out with an EXCEPT operator, if your DBMS supports it:
SELECT DISTINCT Name, ID FROM tab
EXCEPT
SELECT DISTINCT Name, ID FROM tab WHERE Date_end > <your_date>
This would removes the active <Name, ID> pairs from the whole.

How to identify invalid records from a dimension table?

This is my sample data. Its a slowing changing dimension (type 2).
iddim
idperson
name
role
IsActive
start
end
1
1234
jim
driver
1
2022-01-01
2022-02-03
2
1234
jim
driver
0
2022-02-03
9999-12-31
3
3456
tom
accountant
1
2022-01-01
2022-08-30
4
4567
patty
assistant
1
2022-01-01
9999-12-31
Due to a server error one of my ssis packages performed some unexpected actions and there are now idperson without the 99991231 end date (ie. Tom)
I require to identify them so I can manually modify this condition so my resulting table will be
iddim
idperson
name
role
IsActive
start
end
1
1234
jim
driver
1
2022-01-01
2022-02-03
2
1234
jim
driver
0
2022-02-04
9999-12-31
3
3456
tom
accountant
1
2022-01-01
2022-08-30
4
4567
patty
assistant
1
2022-01-01
9999-12-31
5
3456
tom
accountant
0
2022-08-31
9999-12-31

So, as I understand your requirements, you need to generate records to fill the gaps between the latest end date (per person) and '9999-12-31'. the filler records should have IsActive = 0 and should inherit the latest prior name and role for that idperson.
Perhaps something like the following:
SELECT
idperson,
name,
role,
IsActive = 0,
start = DATEADD(day, 1, [end]),
[end] = '9999-12-31'
FROM (
SELECT *, Recency = ROW_NUMBER() OVER(PARTITION BY idperson ORDER BY [End] DESC)
FROM #Data
) D
WHERE Recency = 1 AND [end] < '9999-12-31'
ORDER BY iddim
The Recency value calculated above will be 1 for the latest record per idperson ands 2, 3, etc. for records with older end dates. If the latest record isn't end-of-time, a filler record is generated.
See this db<>fiddle for a working example (which includes a few additional test data records).
Note: The two existing jim records in your original posted data have different idperson values, so they are treated as different persons and the first triggers a gap record.
UPDATE: The above was revised to allow for possible name change over time for a given idperson.

SQL merge entries with start and end dates to entries with single dates

I have two tables, lets call them A and B. Table A has data regarding specific events and has a unique key column pairing of event_date and person. Table B has aggregate data over time and thus has key columns start_date,end_date and person. The date ranges in table B will never overlap for a given person so end_date is not strictly necessary for the composite key.
Below are two examples
SELECT event_date, person
FROM A
event_date
person
2021-10-01
Alice
2021-10-01
Bob
2021-10-05
Bob
2021-11-05
Bob
SELECT start_date, end_date, person, attribute
FROM B
start_date
end_date
person
attribute
2021-10-01
2021-11-01
Alice
Attribute 1
2021-10-01
2021-11-01
Bob
Attribute 1
2021-11-01
2021-12-01
Bob
Attribute 2
I would like to add the attribute column to table A. The merger should consider in which date range the event_date column falls into and choose the appropriate attribute. The final table after the merge should look like this:
event_date
person
attribute
2021-10-01
Alice
Attribute 1
2021-10-01
Bob
Attribute 1
2021-10-05
Bob
Attribute 1
2021-11-05
Bob
Attribute 2
How would one go about solving this?

You can try to JOIN by BETWEEN dates.
SELECT a.*,b.attribute
FROM A a
JOIN B b
ON a.event_date BETWEEN b.start_date AND b.end_date
AND a.person = b.person

Identifying Records Where a String Appears More Than Once

I have a following dataset that looks like:
ID Medication Dose
1 Aspirin 4
1 Tylenol 7
1 Aspirin 2
1 Ibuprofen 1
2 Aspirin 6
2 Aspirin 2
2 Ibuprofen 6
2 Tylenol 4
3 Tylenol 3
3 Tylenol 7
3 Tylenol 2
I would like to develop a code that would identify patients who have been administered a medication more than once. So for example, ID 1 had Aspirin twice, ID 2 had Aspirin twice and ID 3 had Tylenol three times.
I could be wrong but I think the easiest way to do this would be to concatenate each ID based on Medication using a code similar to the one below; but I'm not quite sure what to do after that - is it possible to count if a string appears twice within a cell?
SELECT DISTINCT ST2.[ID],
SUBSTRING(
(
SELECT ','+ST1.Medication AS [text()]
FROM ED_NOTES_MASTER ST1
WHERE ST1.[ID] = ST2.[ID]
Order BY [ID]
FOR XML PATH ('')
), 1, 200000) [Result]
FROM ED_NOTES_MASTER ST2
I would like the output to look like the following:
ID MEDICATION Aspirin2x Tylenol2x Ibuprofen2x
1 Aspirin, Tylenol , Aspirin YES NO NO
2 Ibuprofen, Aspirin, Aspirin YES NO NO
3 Tylenol, Tylenol ,Tylenol NO YES NO

For the first part of your question (identify patients that have had a particular medication more than once), you can do this using GROUP BY to group by the ID and medication, and then using COUNT to get how many times each medication was given to each patient. For example:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
This will give you a list of all ID - Medication combinations that appear in the table and a count of how many times each combo appears. To limit these results down to just those that are greater than 2, you can add a condition to the COUNTed field using HAVING:
SELECT ID, Medication, COUNT(*) AS amount
FROM ST2
GROUP BY ID, Medication
HAVING amount >= 2
The problem now is formatting the results in the way you want. What you will get from the query above is a list of all patient - medication combinations that came up in the table more than once, like this:
ID | Medication | Count
------+---------------+-------
1 | Aspirin | 2
2 | Aspirin | 2
3 | Tylenol | 3
I'd suggest that you try and work with this format if possible, because as you have found, to get multiple values returned in a comma delimited list as you have in your Medication column you have to resort to some hacks to get it to work (although a recent version of SQL Server does implement some sort of proper group concatenation functionality.). If you really need the Aspirin2x etc. columns, take a look at the PIVOT operation in SQL Server.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create column with unique identifier when adjacent fields are identical - sql

using dense_rank : select * , dense_rank() over (partition by unique_id order by date,activity) as activity_instance from tablename

Related

hive - Duplicate counts check associated from one to another column

Subsetting on dates for a SQL query

How to identify invalid records from a dimension table?

SQL merge entries with start and end dates to entries with single dates

Identifying Records Where a String Appears More Than Once

Categories

Resources