Need a SQL delete script for group by and having clause - sql

I'm trying to build a Delete query for the below script which has multiple columns in group by.
select Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,
Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,Test_16,Test_17,Test_18,Test_19,Test_20,
Test_21,Test_22,Test_23,Test_24,Test_25,Test_26,Test_27,Test_28,Test_29,Test_30 from Test_Table
where PROS_Test='236458'
group by Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,
Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,Test_16,Test_17,Test_18,Test_19,Test_20,
Test_21,Test_22,Test_23,Test_24,Test_25,Test_26,Test_27,Test_28,Test_29,Test_30
having count(*)>1
The total count for the select query is 102100, I need the delete query for the same.

Try delete statement with subquery:
delete tt
from (
select row_number() over (partition by Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,
Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,Test_16,Test_17,Test_18,Test_19,Test_20,
Test_21,Test_22,Test_23,Test_24,Test_25,Test_26,Test_27,Test_28,Test_29,Test_30 order by Test_1) rn
from Test_Table
where PROS_Test='236458'
) tt where rn > 1

you can create a cte with your query and use it in the delete command:
with cte as (
select Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,
Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,Test_16,Test_17,Test_18,
Test_19,Test_20,Test_21,Test_22,Test_23,Test_24,Test_25,Test_26,
Test_27,Test_28,Test_29,Test_30 from Test_Table
where PROS_Test='236458'
group by Test_1,Test_2,Test_3,Test_4,Test_5,Test_6,Test_7,Test_8,Test_9,
Test_10,Test_11,Test_12,Test_13,Test_14,Test_15,Test_16,Test_17,Test_18,
Test_19,Test_20,Test_21,Test_22,Test_23,Test_24,Test_25,Test_26,
Test_27,Test_28,Test_29,Test_30
having count(*) > 1
)
delete t
from Test_Table as t
INNER JOIN cte as c ON t.Test_1 = c.Test_1 ...
And all the conditions you need besides the Test_1 matching.

Related

SQL - Removing Row Groups

I have a table with the following information:
Is there a way to remove all groups which have multiple IDs? For example group 3 would be removed because it consists of ID 1 and 2.
Thank you!
A simple, portable and efficient approach is not exists:
select t.*
from mytable t
where not exists (
select 1
from mytable t1
where t1.group = t.group and t1.id <> t.id
)
For performance, consider an index on (group, id).
Side note: group is a SQL keyword (as in group by), hence not a good choice for a column name.
You can use below query to remove all groups having multiple IDs
Delete from <your_table_name> where Group in (select Group from <your_table_name> group by Group,ID having count(*) > 1)
inner query will return Group having multiple IDs.
select * from temp where group in (
select groups from temp group by id,group having count(1)<3)
delete from temp where group in (
select groups from temp group by id,group having count(1)<3)
Try to execute below query:
select id,group from table where group in
(
select group from(
select group,count(distinct id) as cn from table group by 1 having cn=1) a
)

Select duplicated data from table

Query
select * from table1
where having count(reference)>1
I want to select * the data which have duplicate data,any idea why my query is not working?
Below are my expect result..
You can make use of window function count to find number of rows per id and reference and then filter to get those which have count more than 1.
;with cte as (
select t.*, count(*) over (partition by id, reference) cnt
from table1 t
)
select * from cte where cnt > 1;
Demo
In the above solution, I have made an assumption that name and id has one to one correspondence (which is true as per your given data). If that's not the case, add name too in the partition by clause:
;with cte as (
select t.*, count(*) over (partition by name, id, reference) cnt
from table1 t
)
select * from cte where cnt > 1;
I might actually approach this by using a subquery with GROUP BY:
SELECT t1.*
FROM table1 t1
INNER JOIN
(
SELECT Name, ID, reference
FROM table1
GROUP BY Name, ID, reference
HAVING COUNT(*) > 1
) t2
ON t1.Name = t2.Name AND
t1.ID = t2.ID AND
t1.reference = t2.reference
Demo here:
Rextester
Try this ), first i get count by partition, after that i get row with count > 1
select No, Name, ID, Reference
from (select count(*) over (partition by name, ID, reference) cnt, table1.* from table1)
where cnt>1
The easy way (although maybe not the best for performance) would be:
select * from table1 where reference in (
select reference from table1 group by reference having count(*)>1
)
In a subselect you have the duplicated data, and in the outter select you have all the data for these references.

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

SQL query: how to distinct count of a column group by another column

In my table I need to know if each ID has one and only one ID_name. How can I write such query?
I tried:
select ID, count(distinct ID_name) as count_name
from table
group by ID
having count_name > 1
But it takes forever to run.
Any thoughts?
select ID
from YourTable
group by
ID
having count(distinct ID_name) > 1
or
select *
from YourTable yt1
where exists
(
select *
from YourTable yt2
where yt1.ID = yt2.ID
and yt1.ID_Name <> yt2.ID_Name
)
Now, most ID columns are defined as primary key and are unique. So in a regular database you'd expect both queries to return an empty set.
select tt.ID,max(tt.myRank)
from
(
select
ip.ID,ip.ID_name,
ROW_Number() over (partition by ip.ID,ip.ID_nameorder by ip.ID) as myRank
from YourTable ip
) tt
group by tt.ID
This gives you every ID with it's total number of ID_Name
If you want only those ID's which have more than one name associated just add a where clause
e.g.
select tt.ID,max(tt.myRank)
from
(
select
ip.ID,ip.ID_name,
ROW_NUMBER() over (partition by ip.ID,ip.ID_nameorder by ip.ID) as myRank
from YourTable ip
) tt
**where tt.myRank > 1**
group by tt.ID

How to get Original Rows filtered by a HAVING Condition?

What is the method in T-SQL to select the orginal values limited by a HAVING attribute. For example, if I have
A|B
10|1
11|2
10|3
How would I get all the values of B (Not An Average or some other summary stat), Grouped by A, having a Count (Occurrences of A) greater than or equal two 2?
Actually, you have several options to choose from
1. You could make a subquery out of your original having statement and join it back to your table
SELECT *
FROM YourTable yt
INNER JOIN (
SELECT A
FROM YourTable
GROUP BY
A
HAVING COUNT(*) >= 2
) cnt ON cnt.A = yt.A
2. another equivalent solution would be to use a WITH clause
;WITH cnt AS (
SELECT A
FROM YourTable
GROUP BY
A
HAVING COUNT(*) >= 2
)
SELECT *
FROM YourTable yt
INNER JOIN cnt ON cnt.A = yt.A
3. or you could use an IN statement
SELECT *
FROM YourTable yt
WHERE A IN (SELECT A FROM YourTable GROUP BY A HAVING COUNT(*) >= 2)
A self join will work:
select B
from table
join(
select A
from table
group by 1
having count(1)>1
)s
using(A);
You can use window function (no joins, only one table scan):
select * from (
select *, cnt=count(*) over(partiton by A) from table
) as a
where cnt >= 2