Easiest way to select distinct with least number of null - sql

I want to create a view over a table that has 500k rows and 10 columns. In that table there are duplicate id but with different amount of information, because some of the columns are NULL. My objective is to keep one column in case of duplicates, but want to keep the one with less number of NULL values.
Let me explain it with a quick example. I am working with a query similar to this.
CREATE TABLE test (ID INT, b char(1), c char (1), d char(1))
INSERT INTO test(ID,b,c,d) VALUES
(1,NULL,NULL,NULL),
(1,'B', NULL,NULL),
(1,'B','C',NULL),
(1,'B','C','D'),
(2,'E','F',NULL),
(2,'E',NULL,NULL),
(3,NULL,NULL,NULL),
(3,'G',NULL,NULL)
SELECT DISTINCT ID,b,c,d FROM test
DROP TABLE test
The result is
ID b c d
--------------------
1 NULL NULL NULL
1 B NULL NULL
1 B C NULL
1 B C D
2 E F NULL
2 E NULL NULL
3 NULL NULL NULL
3 G NULL NULL
However, the output I want to see is
ID b c d
--------------------
1 B C D
2 E F NULL
3 G NULL NULL
So, based on the id and if there are duplicates, I want to have the row with the least number of nulls. How is it possible?
Thank you very much

If you want the row with the least number of NULLs, then you would basically count them:
select t.*
from test t
order by ( (case when b is null then 1 else 0 end) +
(case when c is null then 1 else 0 end) +
(case when d is null then 1 else 0 end)
) desc
fetch first 1 row only;
However, if you want one row per id with a non-NULL value in each column (if available) then #maSTAShuFu's answer is appropriate.
EDIT:
If you want one row per client, then simply use row_number():
select t.*
from (select t.*,
row_number() over (partition by client_id
order by ( (case when b is null then 1 else 0 end) +
(case when c is null then 1 else 0 end) +
(case when d is null then 1 else 0 end)
) desc
) as seqnum
from t
) t
where seqnum = 1;

using MAX.
SELECT
MAX(ID) ID,
MAX(B) B,
MAX(C) C,
MAX(D) D
FROM test

Related

How to get specific records in posgtres

In Postgres I have two tables:
Table A { int keyA, Text name}
Table B { int keyB, int keyA, char mark, date start, date end}
Mark from Table B could be 'X', 'Y', 'Z'.
I want to get every record 'X' with dates but only one from 'Y', 'Z'. Also if there are 'X', 'Y', 'Z' i want only 'X'.
From:
keyB
keyA
mark
start
end
1
1
X
15-01-2023
16-01-2023
2
1
X
17-01-2023
18-01-2023
3
1
Y
null
null
4
1
Z
null
null
5
2
Y
null
null
6
2
Z
null
null
7
2
Y
null
null
8
3
Z
null
null
9
3
Y
null
null
10
4
X
19-01-2023
20-01-2023
I want to get
keyB
keyA
mark
start
end
1
1
X
15-01-2023
16-01-2023
2
1
X
17-01-2023
17-01-2023
5
2
Y
null
null
8
3
Z
null
null
10
4
X
19-01-2023
20-01-2023
I tried:
1.
Select A.name,
(select b2.start from B b2 where b2.keyA = A.keyA and b2.mark = 'X') as Start,
(select b2.end from B b2 where b2.keyA = A.keyA and b2.mark = 'X') as End,
from A order by name;
Order is important. I need to have name first.
There is a porblem. In subqueries i have more than one record so i have to add limit 1. But I want to get every X not only one.
If I do this
Select A.name, B.start, B.end
from A inner join B on A.keyA = B.keyB
I'll have X, Y, Z and as I mentioned I want only X or one from Y or Z.
Any idea how should I solve this?
Use the row_number function with your join query as the following:
select name, keyB, keyA, mark, start_dt, end_dt
from
(
select A.name, B.*,
row_number() over (partition by B.keyA order by case when B.mark='X' then 1 else 2 end, B.keyb) rn
from tableB B join tableA A
on B.keyA = A.keyA
) T
where mark = 'X' or rn = 1
order by keyb
See demo

Select the greatest occurence from a column, based on date is frequencies are the same

I have the following dataset with let's say ID = {1,[...],5} and Col1 = {a,b,c,Null} :
ID
Col1
Date
1
a
01/10/2022
1
a
02/10/2022
1
a
03/10/2022
2
b
01/10/2022
2
c
02/10/2022
2
c
03/10/2022
3
a
01/10/2022
3
b
02/10/2022
3
Null
03/10/2022
4
c
01/10/2022
5
b
01/10/2022
5
Null
02/10/2022
5
Null
03/10/2022
I would like to group my rows by ID, compute new columns to show the number of occurences and compute a new column that would show a string of characters, depending on the frequency of Col1. With most a = Hi, most b = Hello, most c = Welcome, most Null = Unknown. If multiple modalities except Null have the same frequency, the most recent one based on date wins.
Here is the dataset I need :
ID
nb_a
nb_b
nb_c
nb_Null
greatest
1
3
0
0
0
Hi
2
0
1
2
0
Welcome
3
1
1
0
1
Hello
4
0
0
1
0
Welcome
5
0
1
0
2
Unknown
I have to do this in a compute recipe in Dataiku. The group by is handled by the group by section of the recipe while the rest of the query needs to be done in the "custom aggregations" section of the recipe. I'm having troubles with the if equality then most recent part of the code.
My SQL code looks like this :
CASE WHEN SUM(CASE WHEN Col1 = a THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = b THEN 1 ELSE 0)
AND SUM(CASE WHEN Col1 = a THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = c THEN 1 ELSE 0)
THEN 'Hi'
CASE WHEN SUM(CASE WHEN Col1 = b THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = a THEN 1 ELSE 0)
AND SUM(CASE WHEN Col1 = b THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = c THEN 1 ELSE 0)
THEN 'Hello'
CASE WHEN SUM(CASE WHEN Col1 = c THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = a THEN 1 ELSE 0)
AND SUM(CASE WHEN Col1 = c THEN 1 ELSE 0) >
SUM(CASE WHEN Col1 = b THEN 1 ELSE 0)
THEN 'Welcome'
Etc, etc, repeat for other cases.
But surely there must be a better way to do this right? And I have no idea how to include the most recent one when frequencies are the same.
Thank you for your help and sorry if my message isn't clear.
I tried to repro this in Azure Synapse using SQL script. Below is the approach.
Sample Table is created as in below image.
Create table tab1 (id int, col1 varchar(50), date_column date)
Insert into tab1 values(1,'a','2021-10-01')
Insert into tab1 values(1,'a','2021-10-02')
Insert into tab1 values(1,'a','2021-10-03')
Insert into tab1 values(2,'b','2021-10-01')
Insert into tab1 values(2,'c','2021-10-02')
Insert into tab1 values(2,'c','2021-10-03')
Insert into tab1 values(3,'a','2021-10-01')
Insert into tab1 values(3,'b','2021-10-02')
Insert into tab1 values(3,'Null','2021-10-03')
Insert into tab1 values(4,'c','2021-10-01')
Insert into tab1 values(5,'b','2021-10-01')
Insert into tab1 values(5,'Null','2021-10-02')
Insert into tab1 values(5,'Null','2021-10-03')
Step:1
Query is written to find the count of values within the group id,col1 and maximum date value within each combination of id, col1.
select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1
Step:2
Row number is calculated within each id, col1 group on the decreasing order of count and max_date columns. This is done when two or more values have same frequency, then to assign value based on latest date.
select *, row_number() over (partition by id order by count desc, max_date desc) as row_num from
(select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1)q1
Step:3
Line items with row_num=1 are filtered and values for the greatest column is assigned with the logic
most a = Hi, most b = Hello, most c = Welcome, most Null = Unknown.
Full Query
select id,
[greatest]=case when col1='a' then 'Hi'
when col1='b' then 'Hello'
when col1='c' then 'Welcome'
else 'Unknown'
end
from
(select *, row_number() over (partition by id order by count desc, max_date desc) as row_num from
(select
distinct id,col1,
count(*) over (partition by id,col1) as count,
case when col1='Null' then null else max(date_column) over (partition by id,col1) end as max_date
from tab1)q1
)q2 where row_num=1
Output
By this approach, even when the frequencies are same, based on the most recent date, required values can be updated.

PostgreSQL - Removing NULLS row and column from conditional aggregation results

I have a query for a multidimensional table using conditional aggregation
select A,
SUM(case when D = 3 then D end) as SUM_D1,
SUM(case when D = 4 then D end) as SUM_D2)
The result:
A SUM_D1 SUM_D2
-------------------
a1 100 NULL
a1 200 NULL
a3 NULL NULL
a4 NULL NULL
However, I would like to hide all NULL rows and columns as follows:
A SUM_D1
-----------
a1 100
a1 200
I have looked for similar problems but they are not my expected answer.
Any help is much appreciated,
Thank you
I think this does what you want:
select A,
coalesce(sum(case when D = 3 then D end),
sum(case when D = 4 then D end)
) as sum_d
from t
group by A
having sum(case when d in (3, 4) then 1 else 0 end) > 0;
Note that this returns only one column -- as in your example. If both "3" and "4" are in the data, then the value is for the "3"s.
If you want a query that returns a variable number of columns, then you need to use dynamic SQL -- or some other method. SQL queries return a fixed number of columns.
One method would be to return the values as an array:
select a,
array_agg(d order by d) as ds,
array_agg(sumd order by d) as sumds
from (select a, d, sum(d) as sumd
from t
where d in (3, 4)
group by a, d
) d
group by a;
To filter all-NULL rows you can use HAVING
select *
from
(
select A,
SUM(case when D = 3 then D end) as SUM_D1,
SUM(case when D = 4 then D end) as SUM_D2)
...
) as dt
where SUM_D1 is not null
and SUM_D2 is not null
Of course, if you got simple conditions like the ones in your example you better filter before aggregation:
select A,
SUM(case when D = 3 then D end) as SUM_D1,
SUM(case when D = 4 then D end) as SUM_D2)
...
where D in (3,4)
Now at least one calculation will return a value, thus no need to check for all-NULL.
To filter all-NULL columns you need some Dynamic SQL:
materialize the data in a temporary tabke using Insert/Select
scan each column for all-NULL select 1 from temp having count(SUM_D1) > 0
dynamically create the Select list based on this
run the Select
But why do you think you need this? It will be confusing for a user to run the same Stored Procedure and receive a different number of columns for each run.
I may have misinterpreted your question because the solution seems so simple:
select A,
SUM(case when D = 3 then D end) as SUM_D1,
SUM(case when D = 4 then D end) as SUM_D2)
where D is not null
This is not what you want, is it? :-)
Null appear because the condition that's not handled by case statement
select A,
SUM(case when D = 3 then D end) as SUM_D1,
SUM(case when D = 4 then D end) as SUM_D2
from
Table1
group by
A
having
(case when D = 3 or D = 4 then D end) is not null
As comment said if you want to suppress the null value.. You can use having to suppress null using is not null

Merge multiple columns into one column with multiple rows

In PostgreSQL, how can I merge multiple columns into one column with multiple rows?
The columns are all boolean, so I want to:
Filter for true values only
Replace the true value (1) with the name of the column (A, B or C)
I have this table:
ID | A | B | C
1 0 1 0
2 1 1 0
3 0 0 1
4 1 0 1
5 1 0 0
6 0 1 1
I want to get this table:
ID | Letter
1 B
2 A
2 B
3 C
4 A
4 C
5 A
6 B
6 C
I think you need something like this:
SELECT ID, 'A' as Letter FROM table WHERE A=1
UNION ALL
SELECT ID, 'B' as Letter FROM table WHERE B=1
UNION ALL
SELECT ID, 'C'as Letter FROM table WHERE C=1
ORDER BY ID, Letter
SELECT ID,
(CASE
WHEN TABLE.A = 1 then 'A'
WHEN TABLE.B = 1 then 'B'
WHEN TABLE.C = 1 then 'C'
ELSE NULL END) AS LETTER
from TABLE
You may try this.
insert into t2 select id, 'A' from t1 where A=1;
insert into t2 select id, 'B' from t2 where B=1;
insert into t2 select id, 'C' from t3 where C=1;
If you care about the order, then you can do this.
insert into t3 select id, letter from t2 order by id, letter;
W/o UNION
You can use a single query to get the desired output.Real time example
select id
,regexp_split_to_table((
concat_ws(',', case
when a = 0
then null
else 'a'
end, case
when b = 0
then null
else 'b'
end, case
when c = 0
then null
else 'c'
end)
), ',') l
from c1;
regexp_split_to_table() & concat_ws()

sql excluding certain results

lets say i have a data set of
A B
-- --
a 1
b 1
c 1
d 1
d 2
e 1
f 1
f 2
g 1
how would i exclude a result in column B of 1, if column B has values of both 1 and 2 for the same value in column A?
i want my results to look like this
A B
-- --
a 1
b 1
c 1
d 2
e 1
f 2
g 1
Checking explicitly here for the values 1 and 2 and using the fact that there are exactly two of them. You could potentially make this less cumbersome if it's safe to assume that you always want the highest value.
select
tbl.A,
tbl.B
from
Table1 tbl
left outer join (
select
A
from
Table1
where
B in (1,2)
group by
A
having
count(B) = 2
) mlt on tbl.A = mlt.A
where
(
mlt.A is not null
and tbl.B = 2
) or (
mlt.A is null
and tbl.B = 1
)
Figure out all the A values that have both 1 and 2.
Match those to the table on the A value.
If A is in the subquery, use the B = 2 record. If it isn't, use the B = 1 record.
select
* from tbl where a IN
(
select
a from tbl
group by a
having count(*)>1
)
and b!=1
UNION ALL
select
* from tbl where a IN
(
select
a from tbl
group by a
having count(*)=1
)
For the example data and desired result, the simplest query to achieve the result would be a GROUP BY operation and an aggregate function.
SELECT d.A
, MAX(d.B) AS B
FROM my_data_set d
GROUP BY d.A
ORDER BY d.A
If we are only interested in rows that have a 1 or 2 in column B, we can add a WHERE clause
SELECT d.A
, MAX(d.B) AS B
FROM my_data_set d
WHERE d.B IN (1,2)
GROUP BY d.A
ORDER BY d.A
With the example data, the output is the same.
Both of these statements achieve the specified result. (There is only a single row returned for each distinct value in A.)
Or, for the same the example data, we can return the same result set with a more literal implementation of the specification.
To exclude rows with 1when there is a row with 2 for the same value of A, we can use a NOT EXISTS predicate and a correlated subquery.
SELECT d.A
, d.B
FROM my_data_set d
WHERE ( d.B = 2 )
OR ( d.B = 1 AND
NOT EXISTS ( SELECT 1
FROM my_data_set e
WHERE e.A = d.A
AND e.B = 2
)
)
ORDER BY d.A, d.B