How to deduplicate rows without an additional table? - sql

I have a table containing some duplicates e.g.
-- table definition: t(a,b,value)
select a, b
from t
group by a, b
having count(*) > 1;
I could do
create table x as
select a, b, min(value)
from t
group by a, b;
delete from t;
insert into t select * from x;
drop table x;
but this needs creating a table x which for huge tables becomes impractical.

Assuming you want to retain the tuple having the smallest value for a given a and b value, you may try:
DELETE
FROM yourTable
WHERE NOT EXISTS (SELECT 1 FROM yourTable t
WHERE t.a = yourTable.a AND
t.b = yourTable.b AND
t.value < yourTable.value);
The above query might benefit from an index on (a, b, value). But if you don't already have this index then suggesting it is a something of a moot point, as you would have to recreate the entire table.

but this needs creating a table x which for huge tables becomes
impractical
On the contrary, creating a new table with all the distinct rows is the preferred method for huge tables with a large number of duplicates.
Create the new table x with the exact same schema as t:
CREATE TABLE x(a ... REFERENCES ...., b ..., value ...);
Disable foreign key constraints checks to speed up the process:
PRAGMA foreign_keys = OFF;
Insert the distinct rows of t to x:
INSERT INTO x(a, b, value)
SELECT a, b, MIN(value)
FROM t
GROUP BY a, b
Drop the table t:
DROP TABLE t;
Rename the table x as t:
ALTER TABLE x RENAME TO t;
Finally reenable foreign key constraints checks:
PRAGMA foreign_keys = ON;
See a simplified demo.

Related

Count non-null values from multiple columns at once without manual entry in SQL

I have a SQL table with about 50 columns, the first represents unique users and the other columns represent categories which are scored 1-10.
Here is an idea of what I'm working with
user
a
b
c
abc
5
null
null
xyz
null
6
null
I am interested in counting the number of non-null values per column.
Currently, my queries are:
SELECT col_name, COUNT(col_name) AS count
FROM table
WHERE col_name IS NOT NULL
Is there a way to count non-null values for each column in one query, without having to manually enter each column name?
The desired output would be:
column
count
a
1
b
1
c
0
Consider below approach (no knowledge of column names is required at all - with exception of user)
select column, countif(value != 'null') nulls_count
from your_table t,
unnest(array(
select as struct trim(arr[offset(0)], '"') column, trim(arr[offset(1)], '"') value
from unnest(split(trim(to_json_string(t), '{}'))) kv,
unnest([struct(split(kv, ':') as arr)])
where trim(arr[offset(0)], '"') != 'user'
)) rec
group by column
if applied to sample data in your question - output is
I didn't do this in big-query but instead in SQL Server, however big query has the concept of unpivot as well. Basically you're trying to transpose your columns to rows and then do a simple aggregate of the columns to see how many records have data in each column. My example is below and should work in big query without much or any tweaking.
Here is the table I created:
CREATE TABLE example(
user_name char(3),
a integer,
b integer,
c integer
);
INSERT INTO example(user_name, a, b, c)
VALUES('abc', 5, null, null);
INSERT INTO example(user_name, a, b, c)
VALUES('xyz', null, 6, null);
INSERT INTO example(user_name, a, b, c)
VALUES('tst', 3, 6, 1);
And here is the UNPIVOT I did:
select count(*) as amount, col
from
(select user_name, a, b, c from example) e
unpivot
(blah for col in (a, b, c)
) as unpvt
group by col
Here's example of the output (note, I added an extra record in the table to make sure it was working properly):
Again, the syntax may be slightly different in BigQuery but I think thould get you most of the way there.
Here's a link to my db-fiddle - https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=deaa0e92a4ef1de7d4801e458652816b

Performance difference between Select count(ID) and Select count(*)

There are two queries below which return count of ID column excluding NULL values
and second query will return the count of all the rows from the table including NULL rows.
select COUNT(ID) from TableName
select COUNT(*) from TableName
My Confusion :
Is there any performance difference ?
TL/DR: Plans might not be the same, you should test on appropriate
data and make sure you have the correct indexes and then choose the best solution based on your investigations.
The query plans might not be the same depending on the indexing and the nullability of the column which is used in the COUNT function.
In the following example I create a table and fill it with one million rows.
All the columns have been indexed except column 'b'.
The conclusion is that some of these queries do result in the same execution plan but most of them are different.
This was tested on SQL Server 2014, I do not have access to an instance of 2012 at this moment. You should test this yourself to figure out the best solution.
create table t1(id bigint identity,
dt datetime2(7) not null default(sysdatetime()),
a char(800) null,
b char(800) null,
c char(800) null);
-- We will use these 4 indexes. Only column 'b' does not have any supporting index on it.
alter table t1 add constraint [pk_t1] primary key NONCLUSTERED (id);
create clustered index cix_dt on t1(dt);
create nonclustered index ix_a on t1(a);
create nonclustered index ix_c on t1(c);
insert into T1 (a, b, c)
select top 1000000
a = case when low = 1 then null else left(REPLICATE(newid(), low), 800) end,
b = case when low between 1 and 10 then null else left(REPLICATE(newid(), 800-low), 800) end,
c = case when low between 1 and 192 then null else left(REPLICATE(newid(), 800-low), 800) end
from master..spt_values
cross join (select 1 from master..spt_values) m(ock)
where type = 'p';
checkpoint;
-- All rows, no matter if any columns are null or not
-- Uses primary key index
select count(*) from t1;
-- All not null,
-- Uses primary key index
select count(id) from t1;
-- Some values of 'a' are null
-- Uses the index on 'a'
select count(a) from t1;
-- Some values of b are null
-- Uses the clustered index
select count(b) from t1;
-- No values of dt are null and the table have a clustered index on 'dt'
-- Uses primary key index and not the clustered index as one could expect.
select count(dt) from t1;
-- Most values of c are null
-- Uses the index on c
select count(c) from t1;
Now what would happen if we were more explicit in what we wanted our count to do? If we tell the query planner that we want to get only rows which have not null, will that change anything?
-- Homework!
-- What happens if we explicitly count only rows where the column is not null? What if we add a filtered index to support this query?
-- Hint: It will once again be different than the other queries.
create index ix_c2 on t1(c) where c is not null;
select count(*) from t1 where c is not null;

Optimize query with many OR statements in WHERE clause

What is the best way to write query which will give equivalent result to this:
SELECT X,Y,* FROM TABLE
WHERE (X = 1 AND Y = 2) OR (X = 2235 AND Y = 324) OR...
Table has clustered index (X, Y).
Table is huge (milions) and there can be hundreds of OR statements.
you can create another table with columns X and Y
and insert the values in that table and and then join with the original table
create table XY_Values(X int, Y int)
Insert into XY_Values values
(1,2),
(2235,324),
...
Then
SELECT X,Y,* FROM TABLE T
join XY_Values V
on T.X=V.X
and T.Y=V.Y
You could create an index on (X,Y) on XY_Values , which will boost the performance
You could create XY_Values as a table variable also..
I think you can fill up a temp tables with the hundreds of X and Y values, and join them.
Like:
DECLARE #Temp TABLE
(
X int,
Y int
)
Prefill with this with your search requirements and join then.
(Or an other physical table which saves the search settings.)
this will do better
select t.*
from table t
join (select 1 as x,2 as y
union
...) t1 on t.x=t1.x and t.y=t1.y
if you are using too many or statements the execution plan wont use indexes.
It is better to create multiple statements and merge the result using union all.
SELECT X,Y,*
FROM TABLE
WHERE (X = 1 AND Y = 2)
union all
SELECT X,Y,*
FROM TABLE
WHERE (X = 2235 AND Y = 324)
union all...

sql - select row id based on two column values in same row as id

Using a SELECT, I want to find the row ID of 3 columns (each value is unique/dissimilar and is populated by separate tables.) Only the ID is auto incremented.
I have a middle table I reference that has 3 values: ID, A, B.
A is based on data from another table.
B is based on data from another table.
How can I select the row ID when I only know the value of A and B, and A and B are not the same value?
Do you mean that columns A and B are foreign keys?
Does this work?
SELECT [ID]
FROM tbl
WHERE A = #a AND B = #b
SELECT ID FROM table WHERE A=value1 and B=value2
It's not very clear. Do you mean this:
SELECT ID
FROM middletable
WHERE A = knownA
AND B = knownB
Or this?
SELECT ID
FROM middletable
WHERE A = knownA
AND B <> A
Or perhaps "I know A" means you have a list of values for A, which come from another table?
SELECT ID
FROM middletable
WHERE A IN
( SELECT otherA FROM otherTable ...)
AND B IN
( SELECT otherB FROM anotherTable ...)

insert 2 column primary key

I want to insert 2 column primary key ( X and Y) in table A. X is inserted from table B and Y inserted from a fixed value..
INSERT INTO A
(X,Y)
SELECT W
FROM table B, '1'
You were close:
INSERT INTO A
(X,Y)
SELECT W, '1'
FROM table B
Define the static value -- in this example, the text "1" -- as a column in the SELECT clause.
Only one FROM clause per SELECT clause. Additional SELECTs need to be within brackets/parenthesis...