How to keep the first row of a certain group based on some condition on Teradata SQL? - sql

I have table in Teradata that looks like this
ID | Date | Values
------------------------
abc | 1Jan2015 | 1
abc | 1Dec2015 | 0
def | 2Feb2015 | 0
def | 2Jul2015 | 0
I want to write a piece of SQL that keeps only the earliest date of each ID. So the result I wanted is
ID | Date | Values
------------------------
abc | 1Jan2015 | 1
def | 2Feb2015 | 0
I know there is top n syntax but it only seems to work on the whole table not within groups.
Basically how do I do a top n within groups?

TOP can be easily rewritten using ROW_NUMBER:
select *
from tab
qualify
row_number() over (partition by id order by date) = 1

You can do this using row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by date) as seqnum
from table t
) t
where seqnum = 1;

Related

Oracle Query that fetches one row per each _id

I have a table like :
ID | Val | Kind
----------------------
1 | a | 2
2 | b | 1
3 | c | 4
3 | c | 33
and I need to fetch one row per each id in Oracle SQL.
any ideas?
You can use row_number() to enumerate the rows. For an arbitrary row:
select t.*
from (select t.*,
row_number() over (partition by id order by id) as seqnum
from t
) t
where seqnum = 1;
As I point out in a comment, though, this is unnecessary based on the data in your question. The ids are already unique.

Reissue ids to records with duplicates

We have a CRM DB which for the last 6 weeks has been creating duplicate CaseID's
I need to go in and give new case id's int he 20000000 range to all of the duplicates.
So I have found all the duplicates like this
SELECT CaseNumber,
COUNT(CaseNumber) AS NumOccurrences
FROM Goldmine.dbo.cases
WHERE CaseNumber > 9000000
GROUP BY CaseNumber
HAVING ( COUNT(CaseNumber) > 1 )
Which brought back this.
I now need to renumber each one of these like so 20000001, 20000002, etc etc
Any help would be great.
By the look of the data you have got overlaps in numbers because there are records which overlap with the "updated" values if we are to increment by 1.
Here is a way to fix this,
with data
as (select *
,count(*) over(partition by x) as cnt
,row_number() over(order by x) as rnk
from t
)
update data
set x = x+rnk;
Initial record set
+-----------+
| orig_data |
+-----------+
| 10000009 |
| 10000009 |
| 10000009 |
| 10000009 |
| 10000010 |
| 10000010 |
| 10000011 |
+-----------+
After update
+-----------+
| after_upd |
+-----------+
| 10000010 |
| 10000011 |
| 10000012 |
| 10000014 |
| 10000015 |
| 10000017 |
+-----------+
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c4ea8335abb074b8c0143e2f7c767f04
I am going to assume that you are using SQL Server. So you can use updatable CTEs:
WITH dups as (
SELECT c.*,
ROW_NUMBER() OVER (ORDER BY CaseNumber) as seqnum
FROM Goldmine.dbo.cases c
WHERE CaseNumber > 9000000
),
toupdate as (
SELECT d.*, ROW_NUMBER() OVER (PARTITION BY CaseNumber ORDER BY CaseNumber) as inc
FROM dups d
WHERE seqnum > 1
)
UPDATE toupdate
SET CaseNumber = 20000000 + inc;
The first subquery identifies the duplicates by enumerating them. Presumably, you don't want the "first" one to change. So the second CTE selects only the real duplicates and assigns a sequential number. The outer update uses that to assign the new number.

How can I create header records by taking values from one of several line items?

I have a set of sorted line items. They are sorted first by ID then by Date:
| ID | DESCRIPTION | Date |
| --- | ----------- |----------|
| 100 | Red |2019-01-01|
| 101 | White |2019-01-01|
| 101 | White_v2 |2019-02-01|
| 102 | Red_Trim |2019-01-15|
| 102 | White |2019-01-16|
| 102 | Blue |2019-01-20|
| 103 | Red_v3 |2019-01-14|
| 103 | Red_v3 |2019-03-14|
I need to insert rows in a SQL Server table, which represents a project header, so that the first row for each ID provides the Description and Date in the destination table. There should only be one row in the destination table for each ID.
For example, the source table above would result in this at the destination:
| ID | DESCRIPTION | Date |
| --- | ----------- |----------|
| 100 | Red |2019-01-01|
| 101 | White |2019-01-01|
| 102 | Red_Trim |2019-01-15|
| 103 | Red_v3 |2019-01-14|
How do I collapse the source so that I take only the first row for each ID from source?
I prefer to do this with a transformation in SSIS but can use SQL if necessary. Actually, solutions for both methods would be most helpful.
This question is distinct from Trouble using ROW_NUMBER() OVER (PARTITION BY …)
in that this seeks to identify an approach. The asker of that question has adopted one approach, of more than one available as identified by answers here. That question is about how to make that particular approach work.
You can use row_number() :
select t.*
from (select t.*, row_number() over (partition by id order by date) as seq
from table t
) t
where seq = 1;
A correlated subquery will help here:
SELECT *
FROM yourtable t1
WHERE [Date] = (SELECT min([Date]) FROM yourtable WHERE id = t1.id)
use first_value window function
select * from (select *,
first_value(DESCRIPTION) over(partition by id order by Date) as des,
row_number() over(partition by id order by Date) rn
from table
) a where a.rn =1
You can use the ROW_NUMBER() window function to do this. For example:
select *
from (
select
id, description, date,
row_number() over(partition by id order by date) as rn
from t
)
where rn = 1

How to SELECT in SQL based on a value from the same table column?

I have the following table
| id | date | team |
|----|------------|------|
| 1 | 2019-01-05 | A |
| 2 | 2019-01-05 | A |
| 3 | 2019-01-01 | A |
| 4 | 2019-01-04 | B |
| 5 | 2019-01-01 | B |
How can I query the table to receive the most recent values for the teams?
For example, the result for the above table would be ids 1,2,4.
In this case, you can use window functions:
select t.*
from (select t.*, rank() over (partition by team order by date desc) as seqnum
from t
) t
where seqnum = 1;
In some databases a correlated subquery is faster with the right indexes (I haven't tested this with Postgres):
select t.*
from t
where t.date = (select max(t2.date) from t t2 where t2.team = t.team);
And if you wanted only one row per team, then the canonical answer is:
select distinct on (t.team) t.*
from t
order by t.team, t.date desc;
However, that doesn't work in this case because you want all rows from the most recent date.
If your dataset is large, consider the max analytic function in a subquery:
with cte as (
select
id, date, team,
max (date) over (partition by team) as max_date
from t
)
select id
from cte
where date = max_date
Notionally, max is O(n), so it should be pretty efficient. I don't pretend to know the actual implementation on PostgreSQL, but my guess is it's O(n).
One more possibility, generic:
select * from t join (select max(date) date,team from t
group by team) tt
using(date,team)
Window function is the best solution for you.
select id
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
That query will return this table:
| id |
|----|
| 1 |
| 2 |
| 4 |
If you to get it one row per team, you need to use array_agg function.
select team, array_agg(id) ids
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
group by team
That query will return this table:
| team | ids |
|------|--------|
| A | [1, 2] |
| B | [4] |

Grouping SQL Results based on order

I have table with data something like this:
ID | RowNumber | Data
------------------------------
1 | 1 | Data
2 | 2 | Data
3 | 3 | Data
4 | 1 | Data
5 | 2 | Data
6 | 1 | Data
7 | 2 | Data
8 | 3 | Data
9 | 4 | Data
I want to group each set of RowNumbers So that my result is something like this:
ID | RowNumber | Group | Data
--------------------------------------
1 | 1 | a | Data
2 | 2 | a | Data
3 | 3 | a | Data
4 | 1 | b | Data
5 | 2 | b | Data
6 | 1 | c | Data
7 | 2 | c | Data
8 | 3 | c | Data
9 | 4 | c | Data
The only way I know where each group starts and stops is when the RowNumber starts over. How can I accomplish this? It also needs to be fairly efficient since the table I need to do this on has 52 Million Rows.
Additional Info
ID is truly sequential, but RowNumber may not be. I think RowNumber will always begin with 1 but for example the RowNumbers for group1 could be "1,1,2,2,3,4" and for group2 they could be "1,2,4,6", etc.
For the clarified requirements in the comments
The rownumbers for group1 could be "1,1,2,2,3,4" and for group2 they
could be "1,2,4,6" ... a higher number followed by a lower would be a
new group.
A SQL Server 2012 solution could be as follows.
Use LAG to access the previous row and set a flag to 1 if that row is the start of a new group or 0 otherwise.
Calculate a running sum of these flags to use as the grouping value.
Code
WITH T1 AS
(
SELECT *,
LAG(RowNumber) OVER (ORDER BY ID) AS PrevRowNumber
FROM YourTable
), T2 AS
(
SELECT *,
IIF(PrevRowNumber IS NULL OR PrevRowNumber > RowNumber, 1, 0) AS NewGroup
FROM T1
)
SELECT ID,
RowNumber,
Data,
SUM(NewGroup) OVER (ORDER BY ID
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM T2
SQL Fiddle
Assuming ID is the clustered index the plan for this has one scan against YourTable and avoids any sort operations.
If the ids are truly sequential, you can do:
select t.*,
(id - rowNumber) as grp
from t
Also you can use recursive CTE
;WITH cte AS
(
SELECT ID, RowNumber, Data, 1 AS [Group]
FROM dbo.test1
WHERE ID = 1
UNION ALL
SELECT t.ID, t.RowNumber, t.Data,
CASE WHEN t.RowNumber != 1 THEN c.[Group] ELSE c.[Group] + 1 END
FROM dbo.test1 t JOIN cte c ON t.ID = c.ID + 1
)
SELECT *
FROM cte
Demo on SQLFiddle
How about:
select ID, RowNumber, Data, dense_rank() over (order by grp) as Grp
from (
select *, (select min(ID) from [Your Table] where ID > t.ID and RowNumber = 1) as grp
from [Your Table] t
) t
order by ID
This should work on SQL 2005. You could also use rank() instead if you don't care about consecutive numbers.