SQL Query: get the unique id/date combos based on latest dates - need speed improvement - sql

Not sure how to title or ask this really. Say I am getting a result set like this on a join of two tables, one contains the Id (C), the other contains the Rating and CreatedDate (R) with a foreign key to the first table:
-----------------------------------
| C.Id | R.Rating | R.CreatedDate |
-----------------------------------
| 2 | 5 | 12/08/1981 |
| 2 | 3 | 01/01/2001 |
| 5 | 1 | 11/11/2011 |
| 5 | 2 | 10/10/2010 |
I want this result set (the newest ones only):
-----------------------------------
| C.Id | R.Rating | R.CreatedDate |
-----------------------------------
| 2 | 3 | 01/01/2001 |
| 5 | 1 | 11/11/2011 |
This is a very large data set, and my methods (I won't mention which so there is no bias) is very slow to do this. Any ideas on how to get this set? It doesn't necessarily have to be a single query, this is in a stored procedure.
Thank you!

You need a CTE with a ROW_NUMBER():
WITH CTE AS (
SELECT ID, Rating, CreatedDate, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY CreatedDate DESC) RowID
FROM [TABLESWITHJOIN]
)
SELECT *
FROM CTE
WHERE RowID = 1;

You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by id order by createddate desc) as seqnum
from table t
) t
where seqnum = 1;

If you are using SQL Server 2008 or later, you should consider using windowing functions. For example:
select ID, Rating, CreatedDate from (
select ID, Rating, CreatedDate,
rowseq=ROW_NUMBER() over (partition by ID order by CreatedDate desc)
from MyTable
) x
where rowseq = 1
Also, please understand that while this is an efficient query in and of itself, your overall performance depends even more heavily on the underlying tables and, in particular, the indexes and explain plans that are used when joining the tables in the first place, etc.

Related

How can I filter duplicates/repeated fields in bigquery?

I have a table without primaryKey. And I am trying to get the events of the earliest date grouped by id.
This is what small piece of mytable looks like:
|----------|------------------|-------------|
| id | date | events |
|----------|------------------|-------------|
| 1 |2020-04-11 3:44:20| call |
|----------|------------------|-------------|
| 3 |2020-04-21 7:59:06| appointment |
|----------|------------------|-------------|
| 1 |2020-04-17 1:14:32| appointment |
|----------|------------------|-------------|
| 2 |2020-04-10 3:41:17| feedback |
|----------|------------------|-------------|
| 1 |2020-04-23 1:36:13| appointment |
|----------|------------------|-------------|
| 3 |2020-04-12 4:55:38| call |
|----------|------------------|-------------|
This is the result I am looking for:
|----------|------------------|-------------|
| id | date | events |
|----------|------------------|-------------|
| 1 |2020-04-11 3:44:20| call |
|----------|------------------|-------------|
| 2 |2020-04-10 3:41:17| feedback |
|----------|------------------|-------------|
| 3 |2020-04-12 4:55:38| call |
|----------|------------------|-------------|
I am trying to get events by id only for their respective MIN(date) but the problem is that I have to SELECT events but then I have to add events to GROUP BY so I can't GROUP BY id only as I would like to.
I have tried a lot of different version but here is one:
SELECT id, MIN(date), events
FROM mydataset.mytable
GROUP BY id, events
Please keep in mind that my table is much larger than this.
Any help would be very much appreciated.
You can use aggregation:
select array_agg(t order by date asc limit 1)[ordinal(1)].*
from mydataset.mytable t
group by t.id;
Or the more traditional method of using row_number():
select t.* except (seqnum)
from (select t.*, row_number() over (partition by id order by date) as seqnum
from mydataset.mytable t
) t
where seqnum = 1;
You could modify what you have as an uncorrelated subquery
select *
from mytable
where (id, date) in (select id, min(date)
from mytable
group by id);
If your DB supports window functions you could also do
select distinct id,
min(date) over(partition by id) date,
first_value(events) over (partition by id order by date asc) events
from mytable;
Outputs
+----+---------------------+----------+
| id | date | events |
+----+---------------------+----------+
| 1 | 2020-04-11 03:44:20 | call |
| 2 | 2020-04-10 03:41:17 | feedback |
| 3 | 2020-04-12 04:55:38 | call |
+----+---------------------+----------+
A join to a derived table might perform better, esp. if id and date are indexed:
select m.*
from mytable m
join (select id, min(date) date
from mytable
group by id ) x
on m.id = x.id
and m.date = x.date
;
to built on Gordon's answer with Jones' comment -
Below version does not require using alias and allows use of just id in GROUP BY
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY date LIMIT 1)[ORDINAL(1)]
FROM `project.dataset.table` t
GROUP BY id

Reissue ids to records with duplicates

We have a CRM DB which for the last 6 weeks has been creating duplicate CaseID's
I need to go in and give new case id's int he 20000000 range to all of the duplicates.
So I have found all the duplicates like this
SELECT CaseNumber,
COUNT(CaseNumber) AS NumOccurrences
FROM Goldmine.dbo.cases
WHERE CaseNumber > 9000000
GROUP BY CaseNumber
HAVING ( COUNT(CaseNumber) > 1 )
Which brought back this.
I now need to renumber each one of these like so 20000001, 20000002, etc etc
Any help would be great.
By the look of the data you have got overlaps in numbers because there are records which overlap with the "updated" values if we are to increment by 1.
Here is a way to fix this,
with data
as (select *
,count(*) over(partition by x) as cnt
,row_number() over(order by x) as rnk
from t
)
update data
set x = x+rnk;
Initial record set
+-----------+
| orig_data |
+-----------+
| 10000009 |
| 10000009 |
| 10000009 |
| 10000009 |
| 10000010 |
| 10000010 |
| 10000011 |
+-----------+
After update
+-----------+
| after_upd |
+-----------+
| 10000010 |
| 10000011 |
| 10000012 |
| 10000014 |
| 10000015 |
| 10000017 |
+-----------+
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c4ea8335abb074b8c0143e2f7c767f04
I am going to assume that you are using SQL Server. So you can use updatable CTEs:
WITH dups as (
SELECT c.*,
ROW_NUMBER() OVER (ORDER BY CaseNumber) as seqnum
FROM Goldmine.dbo.cases c
WHERE CaseNumber > 9000000
),
toupdate as (
SELECT d.*, ROW_NUMBER() OVER (PARTITION BY CaseNumber ORDER BY CaseNumber) as inc
FROM dups d
WHERE seqnum > 1
)
UPDATE toupdate
SET CaseNumber = 20000000 + inc;
The first subquery identifies the duplicates by enumerating them. Presumably, you don't want the "first" one to change. So the second CTE selects only the real duplicates and assigns a sequential number. The outer update uses that to assign the new number.

How can I create header records by taking values from one of several line items?

I have a set of sorted line items. They are sorted first by ID then by Date:
| ID | DESCRIPTION | Date |
| --- | ----------- |----------|
| 100 | Red |2019-01-01|
| 101 | White |2019-01-01|
| 101 | White_v2 |2019-02-01|
| 102 | Red_Trim |2019-01-15|
| 102 | White |2019-01-16|
| 102 | Blue |2019-01-20|
| 103 | Red_v3 |2019-01-14|
| 103 | Red_v3 |2019-03-14|
I need to insert rows in a SQL Server table, which represents a project header, so that the first row for each ID provides the Description and Date in the destination table. There should only be one row in the destination table for each ID.
For example, the source table above would result in this at the destination:
| ID | DESCRIPTION | Date |
| --- | ----------- |----------|
| 100 | Red |2019-01-01|
| 101 | White |2019-01-01|
| 102 | Red_Trim |2019-01-15|
| 103 | Red_v3 |2019-01-14|
How do I collapse the source so that I take only the first row for each ID from source?
I prefer to do this with a transformation in SSIS but can use SQL if necessary. Actually, solutions for both methods would be most helpful.
This question is distinct from Trouble using ROW_NUMBER() OVER (PARTITION BY …)
in that this seeks to identify an approach. The asker of that question has adopted one approach, of more than one available as identified by answers here. That question is about how to make that particular approach work.
You can use row_number() :
select t.*
from (select t.*, row_number() over (partition by id order by date) as seq
from table t
) t
where seq = 1;
A correlated subquery will help here:
SELECT *
FROM yourtable t1
WHERE [Date] = (SELECT min([Date]) FROM yourtable WHERE id = t1.id)
use first_value window function
select * from (select *,
first_value(DESCRIPTION) over(partition by id order by Date) as des,
row_number() over(partition by id order by Date) rn
from table
) a where a.rn =1
You can use the ROW_NUMBER() window function to do this. For example:
select *
from (
select
id, description, date,
row_number() over(partition by id order by date) as rn
from t
)
where rn = 1

Redshift window function for change in column

I have a redshift table with amongst other things an id and plan_type column and would like a window function group clause where the plan_type changes so that if this is the data for example:
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | A | 2019-01-02 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
| 2 | A | 2-10-01-05 |
I would like a result like this where I get the first date that the plan_type was "new":
| user_id | plan_type | created |
|---------|-----------|------------|
| 1 | A | 2019-01-01 |
| 1 | B | 2019-01-05 |
| 2 | A | 2019-01-01 |
Is this possible with window functions?
EDIT
Since I have some garbage in the data where plan_type can sometimes be null and the accepted solution does not include the first row (since I can't have the OR is not null I had to make some modifications. Hopefully his will help other people if they have similar issues. The final query is as follows:
SELECT * FROM
(
SELECT
user_id,
plan_type,
created_at,
lag(plan_type) OVER (PARTITION by user_id ORDER BY created_at) as prev_plan,
row_number() OVER (PARTITION by user_id ORDER BY created_at) as rownum
FROM tablename
WHERE plan_type IS NOT NULL
) userHistory
WHERE
userHistory.plan_type <> userHistory.prev_plan
OR userHistory.rownum = 1
ORDER BY created_at;
The plan_type IS NOT NULL filters out bad data at the source table and the outer where clause gets any changes OR the first row of data that would not be included otherwise.
ALSO BE CAREFUL about the created_at timestamp if you are working of your prev_plan field since it would of course give you the time of the new value!!!
This is a gaps-and-islands problem. I think lag() is the simplest approach:
select user_id, plan_type, created
from (select t.*,
lag(plan_type) over (partition by user_id order by created) as prev_plan_type
from t
) t
where prev_plan_type is null or prev_plan_type <> plan_type;
This assumes that plan types can move back to another value and you want each one.
If not, just use aggregation:
select user_id, plan_type, min(created)
from t
group by user_id, plan_type;
use row_number() window function
select * from
(select *,row_number()over(partition by user_id,plan_type order by created) rn
) a where a.rn=1
use lag()
select * from
(
select user_id, plant_type, lag(plan_type) over (partition by user_id order by created) as changes, created
from tablename
)A where plan_type<>changes and changes is not null

How to SELECT in SQL based on a value from the same table column?

I have the following table
| id | date | team |
|----|------------|------|
| 1 | 2019-01-05 | A |
| 2 | 2019-01-05 | A |
| 3 | 2019-01-01 | A |
| 4 | 2019-01-04 | B |
| 5 | 2019-01-01 | B |
How can I query the table to receive the most recent values for the teams?
For example, the result for the above table would be ids 1,2,4.
In this case, you can use window functions:
select t.*
from (select t.*, rank() over (partition by team order by date desc) as seqnum
from t
) t
where seqnum = 1;
In some databases a correlated subquery is faster with the right indexes (I haven't tested this with Postgres):
select t.*
from t
where t.date = (select max(t2.date) from t t2 where t2.team = t.team);
And if you wanted only one row per team, then the canonical answer is:
select distinct on (t.team) t.*
from t
order by t.team, t.date desc;
However, that doesn't work in this case because you want all rows from the most recent date.
If your dataset is large, consider the max analytic function in a subquery:
with cte as (
select
id, date, team,
max (date) over (partition by team) as max_date
from t
)
select id
from cte
where date = max_date
Notionally, max is O(n), so it should be pretty efficient. I don't pretend to know the actual implementation on PostgreSQL, but my guess is it's O(n).
One more possibility, generic:
select * from t join (select max(date) date,team from t
group by team) tt
using(date,team)
Window function is the best solution for you.
select id
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
That query will return this table:
| id |
|----|
| 1 |
| 2 |
| 4 |
If you to get it one row per team, you need to use array_agg function.
select team, array_agg(id) ids
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
group by team
That query will return this table:
| team | ids |
|------|--------|
| A | [1, 2] |
| B | [4] |