How to delete records with lower version in big query? - sql

Lets say my table contains the following data
id
name
version
1
Rahul
1
1
Rahul
2
2
John
1
3
Mike
1
2
John
2
4
Rubel
1
5
David
1
1
Rahul
3
I need to filter the duplicate records with lower version. How can this be done?
The output essentially should be
id
name
version
1
Rahul
3
2
John
2
3
Mike
1
4
Rubel
1
5
David
1

For this dataset, aggregation seems sufficient:
select id, name, max(version) as max_version
from mytable
group by id, name

You can use not exists as follows:
select id, name, version
from your_table t
Where not exists
(Select 1 from your_table tt
Where tt.id = t.id and tt.version > t.version)
Or you can use analytical function row_number as follows:
Select id, name, version from
(select t.*,
Row_number() over (partition by id order by version desc) as rn
from your_table t) t
Where rn = 1

Related

How to set an incrementing flag column for related rows?

I am trying to create a flag column called "Related" to use in reporting to highlight specific rows that are related based on the ID column (1 = related, NULL = not related). The original table "table1" looks like below:
Name ID Related
--------------------------------
Jack 101 NULL
John 101 NULL
Pat 105 NULL
Ben 106 NULL
Jordan 106 NULL
George 300 NULL
Alan 500 NULL
Bill 200 NULL
Bob 200 NULL
I then used this UPDATE statement below:
UPDATE a
SET Related = 1
FROM table1 a
JOIN (SELECT ID FROM table1 GROUP BY ID HAVING COUNT(*) > 1) b
ON a.ID = b.ID
Below is the result of this update statement:
Name ID Related
--------------------------------
Jack 101 1
John 101 1
Pat 105 NULL
Ben 106 1
Jordan 106 1
George 300 NULL
Alan 500 NULL
Bill 200 1
Bob 200 1
This gets me close but I need for it to instead of assigning the number 1 to each related row, to increment the number for each set of related rows based on their different ID column values.
Desired result:
Name ID Related
--------------------------------
Jack 101 1
John 101 1
Pat 105 NULL
Ben 106 2
Jordan 106 2
George 300 NULL
Alan 500 NULL
Bill 200 3
Bob 200 3
This is a possible solution using dense_rank to number your related values and an updateable CTE
with r as (
select id
from t
group by id having Count(*) > 1
),
n as (
select t.id, t.related, Dense_Rank() over (order by r.id) r
from r
join t on t.id = r.id
)
update n set related = r
You can do this without a self-join, just using window functions in a CTE, and updating the CTE directly:
WITH tCounted AS (
SELECT
t.id,
t.related,
c = COUNT(*) OVER (PARTITION BY r.id)
FROM t
),
tWithRelated as (
SELECT
t.id,
t.related,
rn = DENSE_RANK() OVER (ORDER BY r.id)
FROM tCounted
WHERE c > 1
)
UPDATE tWithRelated
SET related = rn;
Use an updateable CTE - comments explain the logic.
with cte1 as (
select [Name], ID, Related
-- Get the count within the id partition, less 1 as specified
, count(*) over (partition by id) - 1 cnt
-- Get the row number within the id partition
, row_number() over (partition by id order by id) rn
from #Test
), cte2 as (
select [Name], ID, Related, cnt, rn
-- Add 1 *only* if the count is > 0 *and* its the first row in the id partition
, case when cnt > 0 then sum(case when cnt > 0 and rn = 1 then 1 else 0 end) over (order by id) else null end NewRelated
from cte1
)
update cte2 set Related = NewRelated;
This doesn't assume Related is already null and works for more than 2 rows for any given ID.
It does assume that one can order by the ID column - even though the data provided doesn't do that.

How to delete rows after the item which equals to exact value?

I have the following dataframe
Block_id step name
1 1 Marie
1 2 Bob
1 3 John
1 4 Lola
2 1 Alex
2 2 John
2 3 Kate
2 4 Herald
3 1 Alec
3 2 Paul
3 3 Rex
As you can see data frame is sorted by block_id and then by step. I want to delete only in one block_id everything after the row where I have name John(the row with John as well). So the desired output would be
Block_id step name
1 1 Marie
1 2 Bob
2 1 Alex
3 1 Alec
3 2 Paul
3 3 Rex
An updatable CTE with a cumulative conditional COUNT seems to be what you are after:
CREATE TABLE dbo.YourTable (BlockID int,
Step int,
[Name] varchar(10));
GO
INSERT INTO dbo.YourTable
VALUES(1,1,'Marie'),
(1,2,'Bob'),
(1,3,'John'),
(1,4,'Lola'),
(2,1,'Alex'),
(2,2,'John'),
(2,3,'Kate'),
(2,4,'Herald'),
(3,1,'Alec'),
(3,2,'Paul'),
(3,3,'Rex');
GO
WITH CTE AS(
SELECT COUNT(CASE [Name] WHEN 'John' THEN 1 END) OVER (PARTITION BY BlockID ORDER BY Step) AS Johns
FROM dbo.YourTable)
DELETE FROM CTE
WHERE Johns >= 1;
GO
SELECT *
FROM dbo.YourTable;
GO
DROP TABLE dbo.YourTable;
One method uses an updatable CTE:
with todelete as (
select t.*,
min(case when name = 'John' then step end) over (partition by block_id) as john_id
from t
)
delete from todelete
where id >= john_id;
Or, if you prefer, a correlated subquery:
delete from t
where id >= (select min(t2.id)
from t t2
where t2.blockid = t.blockid and t2.name = 'John'
);
For performance, both of these can take advantage of an index on (blockid, name, id).

SQL Server get distinct counts of name by each ID

I have a dataset like :
ID NAME
1 Aaron
2 Theon
3 Jon Snow
4 Jon Snow
4 Dany
5 Arya
5 Robert
5 Tyrion
I need to add a new column to this that shows the output based on the number of distinct names per ID. So expected output would be:
ID NAME Mapping
1 Aaron 1
2 Theon 1
3 Jon Snow 1
4 Jon Snow 2
4 Dany 2
5 Arya 3
5 Robert 3
5 Tyrion 3
I am confused about how to achieve this since I have tried a case statement where count(distinct(name)) does not return the right values.
You may try using COUNT as an analytic function:
SELECT
ID,
Name,
COUNT(*) OVER (PARTITION BY ID) Mapping
FROM yourTable
ORDER BY
ID;
Another approach to get COUNT of DISTINCT Name for each ID
SELECT *,
(SELECT Count(DISTINCT NAME)
FROM #table T
WHERE T1.id = T.id) Mapping
FROM #table T1
Online Demo
You can simply use below query
SELECT COUNT(DISTINCT NAME)
FROM YOUR_TABLE
GROUP BY ID
Thanks
Other method (specif SQL Server, otherwise use INNER JOIN LATERAL):
SELECT *
FROM #table f1
CROSS APPLY
(
select Count(*) Nb from #table f2
where f2.ID=f1.ID
) f3

How to write optimized query for multiple prioritize conditional joins in SQL server?

The scenario I'm after for is :
Result = Nothing
CollectionOfTables = Tbl1, Tbl2, Tbl3
While(True){
CurrentTable = GetHighestPriorityTable(CollectionOfTables)
If(CurrentTable) = Nothing Then Break Loop;
RemoveCurrentTableFrom(CollectionOfTables)
ForEach ID in CurrentTable as TempRow {
If(Result.DoesntContainsId(ID)) Then Result.AddRow(TempRow)
}
}
Assume I have following three tables.
IdNameTable1, Priority 1
1 John
2 Mary
3 Elsa
IdNameTable2, Priority 2
2 Steve
3 Max
4 Peter
IdNameTable3, Priority 3
4 Frank
5 Harry
6 Mona
Here is the final result I need.
IdNameResult
1 John
2 Mary
3 Elsa
4 Peter
5 Harry
6 Mona
A few tips to keep in mind.
Number of actual tables is 10.
Number of rows per table exceeds 1 Million.
It's not necessary to use join in query, but because of amount of data I'm working with the query must be optimized and used set-operations in SQL not a Cursor script.
Here's a way to do it using UNION and ROW_NUMBER():
;With Cte As
(
Select Id, Name, 1 As Prio
From Table1
Union All
Select Id, Name, 2 As Prio
From Table2
Union All
Select Id, Name, 3 As Prio
From Table3
), Ranked As
(
Select Id, Name, Row_Number() Over (Partition By Id Order By Prio) As RN
From Cte
)
Select Id, Name
From Ranked
Where RN = 1
Order By Id Asc;

postgresql - filter out double rows (but not the first and last one)

i got an "postgres" SQL problem.
I got a table which looks like this
id name level timestamp
1 pete 1 100
2 pete 1 200
3 pete 1 500
4 pete 5 900
7 pete 5 1000
9 pete 5 1200
15 pete 2 700
Now I want to delete the lines i dont need. i only want to now the first line where he get a new level and the last line he has this level.
id name level timestamp
1 pete 1 100
3 pete 1 500
15 pete 2 700
4 pete 5 900
9 pete 5 1200
(there much more columns like realmpoints and so on)
I have a solution if the the timestamp is only increasing.
SELECT id, name, level, timestamp
FROM player_testing
WHERE id IN ( SELECT MAX(dup.id)
FROM player_testing As dup
GROUP BY dup.name, dup.level)
UNION
SELECT MIN(dup.id)
FROM player_testing As dup
GROUP BY dup.name, dup.level)
)
ORDER BY ts
But I find no way to makes it work for my problem.
select id, name, level, timestamp
from (
select id,name,level,timestamp,
row_number() over (partition by name, level order by timestamp) as rn,
count(*) over (partition by name, level) as max_rn
from player_testing
) t
where rn = 1 or rn = max_rn;
Btw: timestamp is a horrible name for a column. For one reason because it's a reserved word, but more importantly because it doesn't document what the column contains. Is that a start_timestamp and end_timestamp a valid_until_timestamp, ...?
Here is an alternate solution to #a_horse_with_no_name's without over partition, and thus more generic SQL:
select *
from player_testing as A
where id = (
select min(id)
from player_testing as B
where A.name = B.name
and A.level = B.level
)
or id = (
select max(id)
from player_testing as B
where A.name = B.name
and A.level = B.level
)
Here is the fiddle to show it working: http://sqlfiddle.com/#!2/47bd44/1