How to delete records with lower version in big query?

How to delete records with lower version in big query? - sql

Lets say my table contains the following data
id
name
version
1
Rahul
1
1
Rahul
2
2
John
1
3
Mike
1
2
John
2
4
Rubel
1
5
David
1
1
Rahul
3
I need to filter the duplicate records with lower version. How can this be done?
The output essentially should be
id
name
version
1
Rahul
3
2
John
2
3
Mike
1
4
Rubel
1
5
David
1

For this dataset, aggregation seems sufficient:
select id, name, max(version) as max_version
from mytable
group by id, name

You can use not exists as follows:
select id, name, version
from your_table t
Where not exists
(Select 1 from your_table tt
Where tt.id = t.id and tt.version > t.version)
Or you can use analytical function row_number as follows:
Select id, name, version from
(select t.*,
Row_number() over (partition by id order by version desc) as rn
from your_table t) t
Where rn = 1

Related

How to set an incrementing flag column for related rows?

I am trying to create a flag column called "Related" to use in reporting to highlight specific rows that are related based on the ID column (1 = related, NULL = not related). The original table "table1" looks like below:
Name ID Related
--------------------------------
Jack 101 NULL
John 101 NULL
Pat 105 NULL
Ben 106 NULL
Jordan 106 NULL
George 300 NULL
Alan 500 NULL
Bill 200 NULL
Bob 200 NULL
I then used this UPDATE statement below:
UPDATE a
SET Related = 1
FROM table1 a
JOIN (SELECT ID FROM table1 GROUP BY ID HAVING COUNT(*) > 1) b
ON a.ID = b.ID
Below is the result of this update statement:
Name ID Related
--------------------------------
Jack 101 1
John 101 1
Pat 105 NULL
Ben 106 1
Jordan 106 1
George 300 NULL
Alan 500 NULL
Bill 200 1
Bob 200 1
This gets me close but I need for it to instead of assigning the number 1 to each related row, to increment the number for each set of related rows based on their different ID column values.
Desired result:
Name ID Related
--------------------------------
Jack 101 1
John 101 1
Pat 105 NULL
Ben 106 2
Jordan 106 2
George 300 NULL
Alan 500 NULL
Bill 200 3
Bob 200 3

This is a possible solution using dense_rank to number your related values and an updateable CTE
with r as (
select id
from t
group by id having Count(*) > 1
),
n as (
select t.id, t.related, Dense_Rank() over (order by r.id) r
from r
join t on t.id = r.id
)
update n set related = r

You can do this without a self-join, just using window functions in a CTE, and updating the CTE directly:
WITH tCounted AS (
SELECT
t.id,
t.related,
c = COUNT(*) OVER (PARTITION BY r.id)
FROM t
),
tWithRelated as (
SELECT
t.id,
t.related,
rn = DENSE_RANK() OVER (ORDER BY r.id)
FROM tCounted
WHERE c > 1
)
UPDATE tWithRelated
SET related = rn;

Use an updateable CTE - comments explain the logic.
with cte1 as (
select [Name], ID, Related
-- Get the count within the id partition, less 1 as specified
, count(*) over (partition by id) - 1 cnt
-- Get the row number within the id partition
, row_number() over (partition by id order by id) rn
from #Test
), cte2 as (
select [Name], ID, Related, cnt, rn
-- Add 1 *only* if the count is > 0 *and* its the first row in the id partition
, case when cnt > 0 then sum(case when cnt > 0 and rn = 1 then 1 else 0 end) over (order by id) else null end NewRelated
from cte1
)
update cte2 set Related = NewRelated;
This doesn't assume Related is already null and works for more than 2 rows for any given ID.
It does assume that one can order by the ID column - even though the data provided doesn't do that.

How to delete rows after the item which equals to exact value?

I have the following dataframe
Block_id step name
1 1 Marie
1 2 Bob
1 3 John
1 4 Lola
2 1 Alex
2 2 John
2 3 Kate
2 4 Herald
3 1 Alec
3 2 Paul
3 3 Rex
As you can see data frame is sorted by block_id and then by step. I want to delete only in one block_id everything after the row where I have name John(the row with John as well). So the desired output would be
Block_id step name
1 1 Marie
1 2 Bob
2 1 Alex
3 1 Alec
3 2 Paul
3 3 Rex

An updatable CTE with a cumulative conditional COUNT seems to be what you are after:
CREATE TABLE dbo.YourTable (BlockID int,
Step int,
[Name] varchar(10));
GO
INSERT INTO dbo.YourTable
VALUES(1,1,'Marie'),
(1,2,'Bob'),
(1,3,'John'),
(1,4,'Lola'),
(2,1,'Alex'),
(2,2,'John'),
(2,3,'Kate'),
(2,4,'Herald'),
(3,1,'Alec'),
(3,2,'Paul'),
(3,3,'Rex');
GO
WITH CTE AS(
SELECT COUNT(CASE [Name] WHEN 'John' THEN 1 END) OVER (PARTITION BY BlockID ORDER BY Step) AS Johns
FROM dbo.YourTable)
DELETE FROM CTE
WHERE Johns >= 1;
GO
SELECT *
FROM dbo.YourTable;
GO
DROP TABLE dbo.YourTable;

One method uses an updatable CTE:
with todelete as (
select t.*,
min(case when name = 'John' then step end) over (partition by block_id) as john_id
from t
)
delete from todelete
where id >= john_id;
Or, if you prefer, a correlated subquery:
delete from t
where id >= (select min(t2.id)
from t t2
where t2.blockid = t.blockid and t2.name = 'John'
);
For performance, both of these can take advantage of an index on (blockid, name, id).

SQL Server get distinct counts of name by each ID

I have a dataset like :
ID NAME
1 Aaron
2 Theon
3 Jon Snow
4 Jon Snow
4 Dany
5 Arya
5 Robert
5 Tyrion
I need to add a new column to this that shows the output based on the number of distinct names per ID. So expected output would be:
ID NAME Mapping
1 Aaron 1
2 Theon 1
3 Jon Snow 1
4 Jon Snow 2
4 Dany 2
5 Arya 3
5 Robert 3
5 Tyrion 3
I am confused about how to achieve this since I have tried a case statement where count(distinct(name)) does not return the right values.

You may try using COUNT as an analytic function:
SELECT
ID,
Name,
COUNT(*) OVER (PARTITION BY ID) Mapping
FROM yourTable
ORDER BY
ID;

Another approach to get COUNT of DISTINCT Name for each ID
SELECT *,
(SELECT Count(DISTINCT NAME)
FROM #table T
WHERE T1.id = T.id) Mapping
FROM #table T1
Online Demo

You can simply use below query
SELECT COUNT(DISTINCT NAME)
FROM YOUR_TABLE
GROUP BY ID
Thanks

Other method (specif SQL Server, otherwise use INNER JOIN LATERAL):
SELECT *
FROM #table f1
CROSS APPLY
(
select Count(*) Nb from #table f2
where f2.ID=f1.ID
) f3

How to write optimized query for multiple prioritize conditional joins in SQL server?

The scenario I'm after for is :
Result = Nothing
CollectionOfTables = Tbl1, Tbl2, Tbl3
While(True){
CurrentTable = GetHighestPriorityTable(CollectionOfTables)
If(CurrentTable) = Nothing Then Break Loop;
RemoveCurrentTableFrom(CollectionOfTables)
ForEach ID in CurrentTable as TempRow {
If(Result.DoesntContainsId(ID)) Then Result.AddRow(TempRow)
}
}
Assume I have following three tables.
IdNameTable1, Priority 1
1 John
2 Mary
3 Elsa
IdNameTable2, Priority 2
2 Steve
3 Max
4 Peter
IdNameTable3, Priority 3
4 Frank
5 Harry
6 Mona
Here is the final result I need.
IdNameResult
1 John
2 Mary
3 Elsa
4 Peter
5 Harry
6 Mona
A few tips to keep in mind.
Number of actual tables is 10.
Number of rows per table exceeds 1 Million.
It's not necessary to use join in query, but because of amount of data I'm working with the query must be optimized and used set-operations in SQL not a Cursor script.

Here's a way to do it using UNION and ROW_NUMBER():
;With Cte As
(
Select Id, Name, 1 As Prio
From Table1
Union All
Select Id, Name, 2 As Prio
From Table2
Union All
Select Id, Name, 3 As Prio
From Table3
), Ranked As
(
Select Id, Name, Row_Number() Over (Partition By Id Order By Prio) As RN
From Cte
)
Select Id, Name
From Ranked
Where RN = 1
Order By Id Asc;

postgresql - filter out double rows (but not the first and last one)

i got an "postgres" SQL problem.
I got a table which looks like this
id name level timestamp
1 pete 1 100
2 pete 1 200
3 pete 1 500
4 pete 5 900
7 pete 5 1000
9 pete 5 1200
15 pete 2 700
Now I want to delete the lines i dont need. i only want to now the first line where he get a new level and the last line he has this level.
id name level timestamp
1 pete 1 100
3 pete 1 500
15 pete 2 700
4 pete 5 900
9 pete 5 1200
(there much more columns like realmpoints and so on)
I have a solution if the the timestamp is only increasing.
SELECT id, name, level, timestamp
FROM player_testing
WHERE id IN ( SELECT MAX(dup.id)
FROM player_testing As dup
GROUP BY dup.name, dup.level)
UNION
SELECT MIN(dup.id)
FROM player_testing As dup
GROUP BY dup.name, dup.level)
)
ORDER BY ts
But I find no way to makes it work for my problem.

select id, name, level, timestamp
from (
select id,name,level,timestamp,
row_number() over (partition by name, level order by timestamp) as rn,
count(*) over (partition by name, level) as max_rn
from player_testing
) t
where rn = 1 or rn = max_rn;
Btw: timestamp is a horrible name for a column. For one reason because it's a reserved word, but more importantly because it doesn't document what the column contains. Is that a start_timestamp and end_timestamp a valid_until_timestamp, ...?

Here is an alternate solution to #a_horse_with_no_name's without over partition, and thus more generic SQL:
select *
from player_testing as A
where id = (
select min(id)
from player_testing as B
where A.name = B.name
and A.level = B.level
)
or id = (
select max(id)
from player_testing as B
where A.name = B.name
and A.level = B.level
)
Here is the fiddle to show it working: http://sqlfiddle.com/#!2/47bd44/1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to delete records with lower version in big query? - sql

For this dataset, aggregation seems sufficient: select id, name, max(version) as max_version from mytable group by id, name

Related

How to set an incrementing flag column for related rows?

How to delete rows after the item which equals to exact value?

SQL Server get distinct counts of name by each ID

How to write optimized query for multiple prioritize conditional joins in SQL server?

postgresql - filter out double rows (but not the first and last one)

Categories

Resources