Find duplicate ID and add new sequence ID - sql

I have a table where ID must be unique. There are some IDs that are not unique. How do I generate a new column which adds a sequence to this ID? I want to generate ID_new_generated in the table below
ID Company Name ID_new_generated
1 A 1
1 B 1_2
2 C 2

You can use a windowing function (e.g. Rank) to to generate an secondary ID, over each window defined by rows that have the same ID number, then just concatenate it to create the new one.
something like:
select
ID
, companyName
, rank() over(partition by ID ORDER BY companyName)
, concat(ID, '_', rank() over(partition by ID ORDER BY companyName)) as new_id
from test;
See this demo: https://www.db-fiddle.com/f/bd6aQKnZ7gcZCQjFpZicrp/0
Syntax will be different depending on which sql you are using.

Assumed you are looking for a solution in SQL Server:
First you will need to add a nullable column ID_Generated like below:
ALTER TABLE tablename
ADD COLUMN ID_Generated varchar(25) null
GO
Then, use row_number like below in a cte structure (you can use temp table if you are using mysql):
;with cte as (
SELECT DISTINCT t.ID,
(ROW_NUMBER() over (partition by t.ID order by t.ID)) as RowNumber
FROM tablename t
INNER JOIN (select ID, Count(*) RecCount
From tablename
group by ID
having Count(*) > 1) tt on t.ID = t.ID
ORDER BY id ASC
)
Update t
set t.ID_Generated = cte.RowNumber
from tablename t
inner join cte on t.ID = cte.ID

I think you want:
select ID, companyName,
(case when row_number() over (partition by id order by companyname) = 1
then cast(id as varchar(255))
else id || '_' || row_number() over (partition by id order by companyname)
end) as new_id
from test;
|| is the ANSI/ISO standard concatenation operator in SQL. Not all databases support it, so you might need to replace the operator with the one appropriate for your database.

Related

sql query to get latest record for each id

I have one table. From that I need to get latest "Date" for each "id". I wrote query for One id. But I don't know how to apply for multiple ids.(I mean for each id)
My query for one id is (say table name is tt):
select * from (
SELECT DISTINCT id ,date FROM tt
WHERE Trim(id) ='1000082'
ORDER BY date desc
) where rownum<=1;
If you have just two columns, aggregation is good enough:
select id, max(date) max_date
from mytable
group by id
If you have more columns and you want the entire row that has the latest date for each id, then one option uses a correlated subquery for filtering:
select t.*
from mytable t
where t.date = (select max(t1.date) from mytable t1 where t1.id = t.id)
Or you can use window functions, if your database supports them:
select *
from (select t.*, row_number() over(partition by id order by date desc) rn from mytable t) t
where rn = 1

how to remove other rows with same Word in the SQL table column

how to remove other rows with same Word in the SQL Table column
For example
StudentUserID SessionID
DSteve 101
DSteve 102
CJohn 101
For Reporting purpose we only need first ever row with StudentUserID
You can get the first row that has no overlap with other rows:
select t.*
from t
where not exists (select 1
from t t2
where (t2.name like '%' + t.name + '%' or
t.name like '%' + t2.name + '%'
) and
t2.SessionID < t.SessionID
);
This seems to be technically what you are asking for. It is not clear that this is actually useful.
EDIT:
For your revised question, I'll use a similar query:
select t.*
from t
where not exists (select 1
from t t2
where t2.StudentUserId = t2.StudentuserId and
t2.SessionID < t.SessionID
);
or make it otherwise
WITH b AS (SELECT t.*,
row_number() over(partition by student_name order by student_name ) as _rnk
from t )
SELECT * FROM b WHERE _rnk=1
although the purpose of that reporting is questionable :)
this will give you a unique student name output others will be dropped
but normally you would want to have a unique id for each student because there can be
multiple John Smiths etc.
You can use row_number() :
select t.*
from (select t.*,
row_number() over (partition by StudentUserID order by SessionID) as seq
from table t
) t
where seq = 1;

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

How to get the first not null value from a column of values in Big Query?

I am trying to extract the first not null value from a column of values based on timestamp. Can somebody share your thoughts on this. Thank you.
What have i tried so far?
FIRST_VALUE( column ) OVER ( PARTITION BY id ORDER BY timestamp)
Input :-
id,column,timestamp
1,NULL,10:30 am
1,NULL,10:31 am
1,'xyz',10:32 am
1,'def',10:33 am
2,NULL,11:30 am
2,'abc',11:31 am
Output(expected) :-
1,'xyz',10:30 am
1,'xyz',10:31 am
1,'xyz',10:32 am
1,'xyz',10:33 am
2,'abc',11:30 am
2,'abc',11:31 am
You can modify your sql like this to get the data you want.
FIRST_VALUE( column )
OVER (
PARTITION BY id
ORDER BY
CASE WHEN column IS NULL then 0 ELSE 1 END DESC,
timestamp
)
Try this old trick of string manipulation:
Select
ID,
Column,
ttimestamp,
LTRIM(Right(CColumn,20)) as CColumn,
FROM
(SELECT
ID,
Column,
ttimestamp,
MIN(Concat(RPAD(IF(Column is null, '9999999999999999',STRING(ttimestamp)),20,'0'),LPAD(Column,20,' '))) OVER (Partition by ID) CColumn
FROM (
SELECT
*
FROM (Select 1 as ID, STRING(NULL) as Column, 0.4375 as ttimestamp),
(Select 1 as ID, STRING(NULL) as Column, 0.438194444444444 as ttimestamp),
(Select 1 as ID, 'xyz' as Column, 0.438888888888889 as ttimestamp),
(Select 1 as ID, 'def' as Column, 0.439583333333333 as ttimestamp),
(Select 2 as ID, STRING(NULL) as Column, 0.479166666666667 as ttimestamp),
(Select 2 as ID, 'abc' as Column, 0.479861111111111 as ttimestamp)
))
As far as I know, Big Query has no options like 'IGNORE NULLS' or 'NULLS LAST'. Given that, this is the simplest solution I could come up with. I would like to see even simpler solutions.
Assuming the input data is in table "original_data",
select w2.id, w1.column, w2.timestamp
from
(select id,column,timestamp
from
(select id,column,timestamp, row_number()
over (partition BY id ORDER BY timestamp) position
FROM original_data
where column is not null
)
where position=1
) w1
right outer join
original_data as w2
on w1.id = w2.id
SELECT id,
(SELECT top(1) column FROM test1 where id=1 and column is not null order by autoID desc) as name
,timestamp
FROM yourTable
Output :-
1,'xyz',10:30 am
1,'xyz',10:31 am
1,'xyz',10:32 am
1,'xyz',10:33 am
2,'abc',11:30 am
2,'abc',11:31 am

SQL query: how to distinct count of a column group by another column

In my table I need to know if each ID has one and only one ID_name. How can I write such query?
I tried:
select ID, count(distinct ID_name) as count_name
from table
group by ID
having count_name > 1
But it takes forever to run.
Any thoughts?
select ID
from YourTable
group by
ID
having count(distinct ID_name) > 1
or
select *
from YourTable yt1
where exists
(
select *
from YourTable yt2
where yt1.ID = yt2.ID
and yt1.ID_Name <> yt2.ID_Name
)
Now, most ID columns are defined as primary key and are unique. So in a regular database you'd expect both queries to return an empty set.
select tt.ID,max(tt.myRank)
from
(
select
ip.ID,ip.ID_name,
ROW_Number() over (partition by ip.ID,ip.ID_nameorder by ip.ID) as myRank
from YourTable ip
) tt
group by tt.ID
This gives you every ID with it's total number of ID_Name
If you want only those ID's which have more than one name associated just add a where clause
e.g.
select tt.ID,max(tt.myRank)
from
(
select
ip.ID,ip.ID_name,
ROW_NUMBER() over (partition by ip.ID,ip.ID_nameorder by ip.ID) as myRank
from YourTable ip
) tt
**where tt.myRank > 1**
group by tt.ID