How can I create header records by taking values from one of several line items? - sql

I have a set of sorted line items. They are sorted first by ID then by Date:
| ID | DESCRIPTION | Date |
| --- | ----------- |----------|
| 100 | Red |2019-01-01|
| 101 | White |2019-01-01|
| 101 | White_v2 |2019-02-01|
| 102 | Red_Trim |2019-01-15|
| 102 | White |2019-01-16|
| 102 | Blue |2019-01-20|
| 103 | Red_v3 |2019-01-14|
| 103 | Red_v3 |2019-03-14|
I need to insert rows in a SQL Server table, which represents a project header, so that the first row for each ID provides the Description and Date in the destination table. There should only be one row in the destination table for each ID.
For example, the source table above would result in this at the destination:
| ID | DESCRIPTION | Date |
| --- | ----------- |----------|
| 100 | Red |2019-01-01|
| 101 | White |2019-01-01|
| 102 | Red_Trim |2019-01-15|
| 103 | Red_v3 |2019-01-14|
How do I collapse the source so that I take only the first row for each ID from source?
I prefer to do this with a transformation in SSIS but can use SQL if necessary. Actually, solutions for both methods would be most helpful.
This question is distinct from Trouble using ROW_NUMBER() OVER (PARTITION BY …)
in that this seeks to identify an approach. The asker of that question has adopted one approach, of more than one available as identified by answers here. That question is about how to make that particular approach work.

You can use row_number() :
select t.*
from (select t.*, row_number() over (partition by id order by date) as seq
from table t
) t
where seq = 1;

A correlated subquery will help here:
SELECT *
FROM yourtable t1
WHERE [Date] = (SELECT min([Date]) FROM yourtable WHERE id = t1.id)

use first_value window function
select * from (select *,
first_value(DESCRIPTION) over(partition by id order by Date) as des,
row_number() over(partition by id order by Date) rn
from table
) a where a.rn =1

You can use the ROW_NUMBER() window function to do this. For example:
select *
from (
select
id, description, date,
row_number() over(partition by id order by date) as rn
from t
)
where rn = 1

Related

Pulling multiple entries based on ROW_NUMBER

I got the row_num column from a partition. I want each Type to match with at least one Sent and one Resent. For example, Jon's row is removed below because there is no Resent. Kim's Sheet row is also removed because again, there is no Resent. I tried using a CTE to take all columns for a Code if row_num = 2 but Kim's Sheet row obviously shows up because they're all under one Code. If anyone could help, that'd be great!
Edit: I'm using SSMS 2018. There are multiple Statuses other than Sent and Resent.
What my table looks like:
+-------+--------+--------+---------+---------+
| Code | Name | Type | Status | row_num |
+-------+--------+--------+---------+---------+
| 123 | Jon | Sheet | Sent | 1 |
| 221 | Kim | Sheet | Sent | 1 |
| 221 | Kim | Book | Resent | 1 |
| 221 | Kim | Book | Sent | 2 |
| 221 | Kim | Book | Sent | 3 |
+-------+--------+--------+---------+---------+
What I want it to look like:
+-------+--------+--------+---------+---------+
| Code | Name | Type | Status | row_num |
+-------+--------+--------+---------+---------+
| 221 | Kim | Book | Resent| 1 |
| 221 | Kim | Book | Sent | 2 |
| 221 | Kim | Book | Sent | 3 |
+-------+--------+--------+---------+---------+
Here is my CTE code:
WITH CTE AS
(
SELECT *
FROM #MyTable
)
SELECT *
FROM #MyTable
WHERE Code IN (SELECT Code FROM CTE WHERE row_num = 2)
If sent and resent are the only values for status, then you can use:
select t.*
from t
where exists (select 1
from t t2
where t2.name = t.name and
t2.type = t.type and
t2.status <> t.status
);
You can also phrase this with window functions:
select t.*
from (select t.*,
min(status) over (partition by name, type) as min_status,
max(status) over (partition by name, type) as max_status
from t
) t
where min_status <> max_status;
Both of these can be tweaked if other status values are possible. However, based on your question and sample data, that does not seem necessary.
FIDDLE
CREATE TABLE Table1(ID integer,Name VARCHAR(10),Type VARCHAR(10),Status VARCHAR(10),row_num integer);
INSERT INTO Table1 VALUES
('123','Jon','Sheet','Sent','1'),
('221','Kim','Sheet','Sent','1'),
('221','Kim','Book','Resent','1'),
('221','Kim','Book','Sent','2'),
('221','Kim','Book','Sent','3');
SELECT t1.*
FROM Table1 t1
WHERE EXISTS (
select 1
from Table1 t2
where t2.Name=t1.Name
and t2.Type=t1.TYpe
and t2.Status = case when t1.Status='Sent'
then 'Resent'
else 'Sent' end)
It would be easier if you would provide some scripts to create table and put these test data, but try something like
with a1 as (
select
name, type,
row_number() over (partition by code, Name, type, status) as rn
from #MyTable
), a2 as (
select * from a1 where rn > 1
)
select t.*
from #MyTable as t
inner join a2 on t.name = a2.name and t.type = a2.type;
Here you
calculate another row number using partitions by code, name, type and status,
then fetch these with this new row number > 1
and finally, you use that to join to original table and get interesting you rows
Syntax may vary on MSSQL, but you should give it a try. And please use better names than me ;-)
This solution is quite generic because it doesn't rely on used statuses. They're not hardcoded. And you can easily control what matters by changing partitions.
Fiddle

Reissue ids to records with duplicates

We have a CRM DB which for the last 6 weeks has been creating duplicate CaseID's
I need to go in and give new case id's int he 20000000 range to all of the duplicates.
So I have found all the duplicates like this
SELECT CaseNumber,
COUNT(CaseNumber) AS NumOccurrences
FROM Goldmine.dbo.cases
WHERE CaseNumber > 9000000
GROUP BY CaseNumber
HAVING ( COUNT(CaseNumber) > 1 )
Which brought back this.
I now need to renumber each one of these like so 20000001, 20000002, etc etc
Any help would be great.
By the look of the data you have got overlaps in numbers because there are records which overlap with the "updated" values if we are to increment by 1.
Here is a way to fix this,
with data
as (select *
,count(*) over(partition by x) as cnt
,row_number() over(order by x) as rnk
from t
)
update data
set x = x+rnk;
Initial record set
+-----------+
| orig_data |
+-----------+
| 10000009 |
| 10000009 |
| 10000009 |
| 10000009 |
| 10000010 |
| 10000010 |
| 10000011 |
+-----------+
After update
+-----------+
| after_upd |
+-----------+
| 10000010 |
| 10000011 |
| 10000012 |
| 10000014 |
| 10000015 |
| 10000017 |
+-----------+
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c4ea8335abb074b8c0143e2f7c767f04
I am going to assume that you are using SQL Server. So you can use updatable CTEs:
WITH dups as (
SELECT c.*,
ROW_NUMBER() OVER (ORDER BY CaseNumber) as seqnum
FROM Goldmine.dbo.cases c
WHERE CaseNumber > 9000000
),
toupdate as (
SELECT d.*, ROW_NUMBER() OVER (PARTITION BY CaseNumber ORDER BY CaseNumber) as inc
FROM dups d
WHERE seqnum > 1
)
UPDATE toupdate
SET CaseNumber = 20000000 + inc;
The first subquery identifies the duplicates by enumerating them. Presumably, you don't want the "first" one to change. So the second CTE selects only the real duplicates and assigns a sequential number. The outer update uses that to assign the new number.

How to SELECT in SQL based on a value from the same table column?

I have the following table
| id | date | team |
|----|------------|------|
| 1 | 2019-01-05 | A |
| 2 | 2019-01-05 | A |
| 3 | 2019-01-01 | A |
| 4 | 2019-01-04 | B |
| 5 | 2019-01-01 | B |
How can I query the table to receive the most recent values for the teams?
For example, the result for the above table would be ids 1,2,4.
In this case, you can use window functions:
select t.*
from (select t.*, rank() over (partition by team order by date desc) as seqnum
from t
) t
where seqnum = 1;
In some databases a correlated subquery is faster with the right indexes (I haven't tested this with Postgres):
select t.*
from t
where t.date = (select max(t2.date) from t t2 where t2.team = t.team);
And if you wanted only one row per team, then the canonical answer is:
select distinct on (t.team) t.*
from t
order by t.team, t.date desc;
However, that doesn't work in this case because you want all rows from the most recent date.
If your dataset is large, consider the max analytic function in a subquery:
with cte as (
select
id, date, team,
max (date) over (partition by team) as max_date
from t
)
select id
from cte
where date = max_date
Notionally, max is O(n), so it should be pretty efficient. I don't pretend to know the actual implementation on PostgreSQL, but my guess is it's O(n).
One more possibility, generic:
select * from t join (select max(date) date,team from t
group by team) tt
using(date,team)
Window function is the best solution for you.
select id
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
That query will return this table:
| id |
|----|
| 1 |
| 2 |
| 4 |
If you to get it one row per team, you need to use array_agg function.
select team, array_agg(id) ids
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
group by team
That query will return this table:
| team | ids |
|------|--------|
| A | [1, 2] |
| B | [4] |

SQL Query: get the unique id/date combos based on latest dates - need speed improvement

Not sure how to title or ask this really. Say I am getting a result set like this on a join of two tables, one contains the Id (C), the other contains the Rating and CreatedDate (R) with a foreign key to the first table:
-----------------------------------
| C.Id | R.Rating | R.CreatedDate |
-----------------------------------
| 2 | 5 | 12/08/1981 |
| 2 | 3 | 01/01/2001 |
| 5 | 1 | 11/11/2011 |
| 5 | 2 | 10/10/2010 |
I want this result set (the newest ones only):
-----------------------------------
| C.Id | R.Rating | R.CreatedDate |
-----------------------------------
| 2 | 3 | 01/01/2001 |
| 5 | 1 | 11/11/2011 |
This is a very large data set, and my methods (I won't mention which so there is no bias) is very slow to do this. Any ideas on how to get this set? It doesn't necessarily have to be a single query, this is in a stored procedure.
Thank you!
You need a CTE with a ROW_NUMBER():
WITH CTE AS (
SELECT ID, Rating, CreatedDate, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY CreatedDate DESC) RowID
FROM [TABLESWITHJOIN]
)
SELECT *
FROM CTE
WHERE RowID = 1;
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by createddate desc) as seqnum
from table t
) t
where seqnum = 1;
If you are using SQL Server 2008 or later, you should consider using windowing functions. For example:
select ID, Rating, CreatedDate from (
select ID, Rating, CreatedDate,
rowseq=ROW_NUMBER() over (partition by ID order by CreatedDate desc)
from MyTable
) x
where rowseq = 1
Also, please understand that while this is an efficient query in and of itself, your overall performance depends even more heavily on the underlying tables and, in particular, the indexes and explain plans that are used when joining the tables in the first place, etc.

How to keep the first row of a certain group based on some condition on Teradata SQL?

I have table in Teradata that looks like this
ID | Date | Values
------------------------
abc | 1Jan2015 | 1
abc | 1Dec2015 | 0
def | 2Feb2015 | 0
def | 2Jul2015 | 0
I want to write a piece of SQL that keeps only the earliest date of each ID. So the result I wanted is
ID | Date | Values
------------------------
abc | 1Jan2015 | 1
def | 2Feb2015 | 0
I know there is top n syntax but it only seems to work on the whole table not within groups.
Basically how do I do a top n within groups?
TOP can be easily rewritten using ROW_NUMBER:
select *
from tab
qualify
row_number() over (partition by id order by date) = 1
You can do this using row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by date) as seqnum
from table t
) t
where seqnum = 1;