Delete rows based on group by - sql

Let's say I have the following table:
Id | QuestionId
----------------
1 | 'MyQuestionId'
1 | NULL
2 | NULL
2 | NULL
It should behave like so
Find all the results of the same Id
If ANY of them has QuestionId IS NOT NULL, do not touch any rows with that Id.
Only if ALL the results for the same Id have QuestionId IS NULL, delete all the rows with that Id.
So in this case it should only delete rows with Id=2.
I haven't found an example for such a case anywhere. I've tried some options with rank, count, group by, but nothing worked. Can you help me?

You can use an updatable CTE or derived table for this, and calculate the count using a window function.
WITH cte AS (
SELECT t.*,
CountNonNulls = COUNT(t.QuestionId) OVER (PARTITION BY t.Id)
FROM YourTable t
)
DELETE cte
WHERE CountNonNulls = 0;
db<>fiddle
Note that this query does not contain any self-joins at all.

Related

Select multiple columns having distinct just in 3 of them

i've got a table that i need to return about 14 column values but only return 1 row for the duplicates on some of the columns.
The second problem is that between the duplicates i need to keep the one that has the biggest int in one of the columns that is not required to be unique.
Since the Table is somewhat big, I am seeking advice into doing this in the most efficient way.
should i be doing a group by?
my table is somewhat like this, i will simplify the number of columns.
ID(UniqueIdentifier) | ACCID(UniqueIdentifier) | DateTime(DateTime) | distance(int)|type(int)
28761188-0886-E911-822F-DD1FA635D450 1238FD8A-BD00-411A-A81C-0F6F5C026BCC 2019-06-03 14:04:41.000 2 3
41761188-0886-E911-822F-DD1FA635D450 1238FD8A-BD00-411A-A81C-0F6F5C026BCC 2019-06-03 14:04:41.000 1 3
I should be only selecting when ACCID and DATETIME is unique, the column ID in primary so will never be duplicate, and i need to keep the row with the biggest distance.
You can use the ROW_NUMBER() window function, as in:
select *
from (
select
id,
accid,
datetime,
distance,
type,
row_number() over(partition by accid, datetime order by type desc) as rn
from t
) x
where rn = 1
If you want to show multiple "ties", then replace ROW_NUMBER() by RANK().
I would suggest a correlated subquery with the right index as the fastest method:
select t.*
from t
where t.id = (select top (1) t2.id
from t t2
where t2.ACCID = t.ACCID
order by t2.distance desc
) ;
The best index is on (ACCID, distance desc, id).

Get any not null value of other fileds in aggregations

I want to aggregate on some fields and get any not null value on others. To be more precise the query looks something like:
SELECT id, any_value(field1), any_value(field2) FROM mytable GROUP BY ID
and the columns are like:
ID | field1 | field 2
-----------------
id | null | 3
id | 1 | null
id | null | null
id | 2 | 4
and the output can be like (id, 1,4) or (id,2,4) or ... but not something like (id, 1, null)
I can't find in the docs if any_value() is guaranteed to return a not null row if there is one (although it did so in my experiments) or may return a row with null value even if there are some not null values.
Does any_value() perform the task I described? If not what way to you suggest for doing it?
This is sort of a guess, but have you tried:
SELECT id, MIN(field1), MAX(field2)
FROM mytable
GROUP BY id;
This will ignore NULL values return different values from the two columns.
You can use analyatical functions as well.
Below is the query (SQL server):
select id, field1, field2
from (select id, field1, field2, row_number()
over (partition by id order by isnull(field1, 'ZZZ') asc, isnull(field2, 'ZZZ') asc) as RNK from mytable) aa
where aa.RNK = 1;
This will return only one row, you can change the order in order by clause if you are looking for maximun value in any column.
This could be achieved by aggregating to array with 'ignore nulls' specified and taking the first element of the resulting array. Unlike MIN/MAX solution, you can use it with structs
SELECT
id,
ARRAY_AGG(field1 IGNORE NULLS LIMIT 1)[SAFE_OFFSET(0)],
FROM
mytable
GROUP BY
id

Performance issue on selecting n newest rows in subselect

I have a database with courses. Each course contains a set of nodes, and some nodes contains a set of answers from students. The Answer table looks (simplified) like this:
Answer
id | courseId | nodeId | answer
------------------------------------------------
1 | 1 | 1 | <- text ->
2 | 2 | 2 | <- text ->
3 | 1 | 1 | <- text ->
4 | 1 | 3 | <- text ->
5 | 2 | 2 | <- text ->
.. | .. | .. | ..
When a teacher opens a course (i.e. courseId = 1) I want to pick the node that have received the most answers lately. I can do this using the following query:
with Answers as
(
select top 50 id, nodeId from Answer A where courseId=1 order by id desc
)
select top 1 nodeId from Answers group by nodeId order by count(id) desc
or equally using this query:
select top 1 nodeId from
(select top 50 id, nodeId from Answer A where courseId=1 order by id desc)
group by nodeId order by count(id) desc
In both querys the newest 50 answers (with the highest ids) are selected and then grouped by nodeId so I can pick the one with the highest frequency. My problem is, however, that the query is very slow. If I only run the subselect, it takes less than a second, and grouping 50 rows should be fast, but when I run the entire query it takes about 10 seconds! My guess is that sql server does the select and grouping first, and afterwards does the top 50 and top 1, which in this case leads to terrible performance.
So, how can I rewrite the query to be efficient?
You can add indexes to make your queries more efficient. For this query:
with Answers as (
select top 50 id, nodeId
from Answer A
where courseId = 1
order by id desc
)
select top 1 nodeId
from Answers
group by nodeId
order by count(id) desc;
The best index is Answer(courseId, id, nodeid).
To be more insightful we'd need to see the indexes on that table and the execution plans you're getting (one plan for the inner query on it's own, one plan for the full query).
I'd even recommend doing the same analysis again having added the index mentioned elsewhere on this page.
Without that information the only things we can recommend are trial and error.
For example, try avoiding using TOP (this shouldn't matter, but we're guessing while we can't see your indexes and execution plans)
WITH
Answers AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY id DESC) AS rowId,
id,
nodeId
FROM
Answer
WHERE
courseId = 1
),
top50 AS
(
SELECT
nodeId,
COUNT(*) AS row_count
FROM
Answers
WHERE
rowId <= 50
GROUP BY
nodeId
),
ranked AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY row_count DESC, nodeId DESC) AS ordinal,
nodeID
FROM
top50
)
SELECT
nodeID
FROM
ranked
WHERE
oridinal = 1
Which is massively over the top, but functionally the same as you have in your OP, but sufficiently different to potentially get a different execution plan.
Alternatively (and not very nice), just put the results of your inner query in to a table variable, then run the outer query on the table variable.
I still expect, however, that adding the index will be the least-worst option.

How to find first duplicate row in a table sql server

I am working on SQL Server. I have a table, that contains around 75000 records. Among them there are several duplicate records. So i wrote a query to know which record repeated how many times like,
SELECT [RETAILERNAME],COUNT([RETAILERNAME]) as Repeated FROM [Stores] GROUP BY [RETAILERNAME]
It gives me result like,
---------------------------
RETAILERNAME | Repeated
---------------------------
X | 4
---------------------------
Y | 6
---------------------------
Z | 10
---------------------------
Among 4 record(s) of X record, i need take only first record of X.
so here i want to retrieve all fields from first row of duplicate records. i.e. Take all records whose RETAILERNAME='X' we will get some no. of duplicate records, we need to get only first row from them.
Please guide me.
You could try using ROW_NUMBER.
Something like
;WITH Vals AS (
SELECT [RETAILERNAME],
ROW_NUMBER() OVER(PARTITION BY [RETAILERNAME] ORDER BY [RETAILERNAME]) RowID
FROM [Stores ]
)
SELECT *
FROm Vals
WHERE RowID = 1
SQL Fiddle DEMO
You can then also remove the duplicates if need be (BUT BE CAREFUL THIS IS PERMANENT)
;WITH Vals AS (
SELECT [RETAILERNAME],
ROW_NUMBER() OVER(PARTITION BY [RETAILERNAME] ORDER BY [RETAILERNAME]) RowID
FROM Stores
)
DELETE
FROM Vals
WHERE RowID > 1;
You Can write query as under
SELECT TOP 1 * FROM [Stores] GROUP BY [RETAILERNAME]
HAVING your condition
WITH cte
AS (SELECT [retailername],
Row_number()
OVER(
partition BY [retailername]
ORDER BY [retailername])'RowRank'
FROM [retailername])
SELECT *
FROM cte

SQL Query to get all rows with duplicate values but are not part of the same group

The database schema is organized as follows:
ID | GroupID | VALUE
--------------------
1 | 1 | A
2 | 1 | A
3 | 2 | B
4 | 3 | B
In this example, I want to GET all Rows with duplicate VALUE, but are not part of the same group. So the desired result set should be IDs (3, 4), because they are not in the same group (2, 3) but still have the same VALUE (B).
I'm having trouble writing a SQL Query and would appreciate any guidance. Thanks.
So far, I'm using SQL Count, but can't figure out what to do with the GroupId.
SELECT *
FROM TABLE T
HAVING COUNT(T.VALUE) > 1
GROUP BY ID, GroupId, VALUE
The simplest method for this is using EXISTS:
SELECT
ID
FROM
MyTable T1
WHERE
EXISTS (SELECT 1
FROM MyTable
WHERE Value = t1.Value
AND GroupID <> t1.GroupID)
Here is one method. First you have to identify the values that appear in more than one group and then use that information to find the right rows in the original table:
select *
from t
where value in (SELECT value
FROM TABLE T
GROUP BY VALUE
HAVING COUNT(distinct groupid) > 1
)
order by value
Actually, I prefer a slight variant in this case, by changing the HAVING clause:
HAVING min(groupid) <> max(groupid)
This works when you are looking for more than one group and should be faster than the COUNT DISTINCT version.
SELECT ALL_.*
FROM (SELECT *
FROM TABLE_
GROUP BY ID, GROUPID, VALUE
ORDER BY ID) GROUPED,
TABLE_ ALL_
WHERE GROUPED.VALUE = ALL_.VALUE
AND GROUPED.GROUPID <> ALL_.GROUPID