SQL - efficient way to aggregate boolean values (postgresql)

SQL - efficient way to aggregate boolean values (postgresql) - sql

Let's assume table with 3 columns (originally it's a big table): id, is_deleted, date.
I have to check if given id's are deleted or not and create new column with this value (TRUE or FALSE).
Let's simplify it to below table (before):
id
is_deleted
date
A
False
03-07-2022
A
True
04-07-2022
B
False
05-07-2022
B
False
06-07-2022
C
True
07-07-2022
(after):
id
is_deleted
date
deleted
A
True
03-07-2022
TRUE
A
False
04-07-2022
TRUE
B
False
05-07-2022
FALSE
B
False
06-07-2022
FALSE
C
True
07-07-2022
TRUE
So we can see that row with ids A and C should have True value in new column.
For given id could be more than one TRUE value in is_deleted column. If any id has at least one TRUE value, all rows with given id should be deleted (TRUE value in new column).
I need to do it inside this table, without group by, cuz by choosing group by, I have to create another CTE to join it with and it complicates a problem and performance.
I want to just create single column inside this table with new deleted value.
I've found bool_or function, but it won't work with window functions in redshift, my code:
bool_or(is_deleted) over(partition by id) as is_del
I can't use max, sum functions on boolean.
Casting bool to int worsens the performance.
Is there any other way to do it using booleans and keep good performance?
Thank you.

It should be possible to emulate such behaviour with MIN/MAX functions and explicit casting:
SELECT MAX(is_deleted::INT) OVER (PARTITION BY id)
FROM ...;
-- if all is_deleted are false, then result is 0, 1 otherwise
If the result should be boolean, then: MAX(is_deleted::INT) OVER (PARTITION BY id) = 1 or ( MAX(is_deleted::INT) OVER (PARTITION BY id))::BOOLEAN

From me here is 2 diffrent way you could check:
1.With EXISTS, which work very well in very redundant table
SELECT
id
, is_deleted
, date
, NVL((SELECT 'TRUE' FROM dual WHERE EXISTS (SELECT 1 FROM yourtabletable yt2 WHERE
yt2.id = yt1.id
AND yt2.is_deleted = 'True')
), 'FALSE') deleted
FROM
yourtabletable yt1;
2.With WITH where you could use hint's like /*+ materialize */
WITH tmp AS(
SELECT /*+ materialize */ id, 'TRUE' deleted FROM yourtabletable WHERE is_deleted = 'True'
)
SELECT
id
, is_deleted
, date
, NVL((SELECT deleted FROM tmp yt2 WHERE
yt2.id = yt1.id
AND yt2.is_deleted = 'True'
), 'FALSE') deleted
FROM
yourtabletable yt1;

If I understand the problem, then I would think that for each unique id value you should be looking at the is_deleted value that has the latest (maximum) date value. In this way even though there may be a row where is_deleted is true, if there is another row for the same id value with a later date that has is_deleted as false, then false should be the final status. If this isn't how the new deleted column should be computed, then just ignore this answer, please.
Schema (PostgreSQL v15)
CREATE TABLE Table1
("id" varchar(1), "is_deleted" bool, "date" timestamp)
;
INSERT INTO Table1
("id", "is_deleted", "date")
VALUES
('A', False, '2022-03-07 00:00:00'),
('A', True, '2022-04-07 00:00:00'),
('A', True, '2022-04-09 00:00:00'), /* another True row for A */
('B', False, '2022-05-07 00:00:00'),
('B', False, '2022-06-07 00:00:00'),
('C', True, '2022-07-07 00:00:00')
;
Query #1
with lastest_is_deleted as (
select t.* from
(select t.id, t.is_deleted as deleted, row_number() over (partition by id order by date desc) as seqnum
from Table1 t
) t
where seqnum = 1
)
select t.*, l.deleted from
Table1 t join lastest_is_deleted l on t.id = l.id;
id
is_deleted
date
deleted
A
false
2022-03-07T00:00:00.000Z
true
A
true
2022-04-07T00:00:00.000Z
true
A
true
2022-04-09T00:00:00.000Z
true
B
false
2022-05-07T00:00:00.000Z
false
B
false
2022-06-07T00:00:00.000Z
false
C
true
2022-07-07T00:00:00.000Z
true
View on DB Fiddle

This is one of the approach with which you can get all records with their respective deleted column values.
select a.*,case when b.id is not null then 'TRUE' else 'FALSE' end as deleted
from table1 a left join (select distinct id from table1 where is_deleted is true) b on (a.id=b.id) order by 1,3;
I have created sample schema here :https://www.db-fiddle.com/f/4k32Eb1t2DSUQ6FkzKBMXi/0
Feel free to customize it with your data.
CREATE TABLE Table1
("id" varchar(1), "is_deleted" bool, "date" timestamp);
INSERT INTO Table1
("id", "is_deleted", "date")
VALUES
('A', False, '2022-03-07 00:00:00'),
('A', True, '2022-04-07 00:00:00'),
('A', True, '2022-04-09 00:00:00'), /* another True row for A */
('B', False, '2022-05-07 00:00:00'),
('B', False, '2022-06-07 00:00:00'),
('C', True, '2022-07-07 00:00:00')
;
INSERT INTO Table1
("id", "is_deleted", "date")
VALUES
('D', False, '2022-03-07 00:00:00'),
('D', false, '2022-04-06 00:00:00');
INSERT INTO Table1
("id", "is_deleted", "date")
VALUES
('C', False, '2022-03-07 00:00:00');

In your case, I think using UNION ALL of 2 sub queries could yield better performance than using window functions, especially if your table have index on id and is_deleted columns.
SELECT
d1.*,
TRUE AS deleted
FROM <your table> d1
WHERE EXISTS (SELECT 1
FROM <your table> d2
WHERE d1.id = d2.id AND is_deleted)
UNION ALL
SELECT
d1.*,
FALSE AS deleted
FROM <your table> d1
WHERE NOT EXISTS (SELECT 1
FROM <your table> d2
WHERE d1.id = d2.id AND is_deleted);
See demo here

This select statement should give the needed output:
select
yt1.id,
yt1.is_deleted,
yt1.date,
case when yt2.is_deleted then true else false end as deleted
from yourtabletable yt1
left join yourtabletable yt2 on yt2.id = yt1.id and yt2.is_deleted

Related

Group by absorb NULL unless it's the only value

I'm trying to group by a primary column and a secondary column. I want to ignore NULL in the secondary column unless it's the only value.
CREATE TABLE #tempx1 ( Id INT, [Foo] VARCHAR(10), OtherKeyId INT );
INSERT INTO #tempx1 ([Id],[Foo],[OtherKeyId]) VALUES
(1, 'A', NULL),
(2, 'B', NULL),
(3, 'B', 1),
(4, 'C', NULL),
(5, 'C', 1),
(6, 'C', 2);
I'm trying to get output like
Foo OtherKeyId
A NULL
B 1
C 1
C 2
This question is similar, but takes the MAX of the column I want, so it ignores other non-NULL values and won't work.
I tried to work out something based on this question, but I don't quite understand what that query does and can't get my output to work
-- Doesn't include Foo='A', creates duplicates for 'B' and 'C'
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [Foo] ORDER BY [OtherKeyId]) rn1
FROM #tempx1
)
SELECT c1.[Foo], c1.[OtherKeyId], c1.rn1
FROM cte c1
INNER JOIN cte c2 ON c2.[OtherKeyId] = c1.[OtherKeyId] AND c2.rn1 = c1.rn1
This is for a modern SQL Server: Microsoft SQL Server 2019

You can use a GROUP BY expression with HAVING clause like below one
SELECT [Foo],[OtherKeyId]
FROM #tempx1 t
GROUP BY [Foo],[OtherKeyId]
HAVING SUM(CASE WHEN [OtherKeyId] IS NULL THEN 0 END) IS NULL
OR ( SELECT COUNT(*) FROM #tempx1 WHERE [Foo] = t.[Foo] ) = 1
Demo

Hmmm . . . I think you want filtering:
select t.*
from #tempx1 t
where t.otherkeyid is not null or
not exists (select 1
from #tempx1 t2
where t2.foo = t.foo and t2.otherkeyid is not null
);

My actual problem is a bit more complicated than presented here, I ended up using the idea from Barbaros Özhan solution to count the number of items. This ends up with two inner queries on the data set with two different GROUP BY. I'm able to get the results I need on my real dataset using a query like the following:
SELECT
a.[Foo],
b.[OtherKeyId]
FROM (
SELECT
[Foo],
COUNT([OtherKeyId]) [C]
FROM #tempx1 t
GROUP BY [Foo]
) a
JOIN (
SELECT
[Foo],
[OtherKeyId]
FROM #tempx1 t
GROUP BY [Foo], [OtherKeyId]
) b ON b.[Foo] = a.[Foo]
WHERE
(b.[OtherKeyId] IS NULL AND a.[C] = 0)
OR (b.[OtherKeyId] IS NOT NULL AND a.[C] > 0)

Updating multiple rows with a conditional where clause in Postgres?

I'm trying to update multiple rows in a single query as I have many rows to update at once. In my query, there is a where clause that applies only to certain rows.
For example, I've the following query:
update mytable as m set
column_a = c.column_a,
column_b = c.column_b
from (values
(1, 12, 6, TRUE),
(2, 1, 45, FALSE),
(3, 56, 3, TRUE)
) as c(id, column_a, column_b, additional_condition)
where c.id = m.id
and CASE c.additional_condition when TRUE m.status != ALL(array['active', 'inactive']) end;
The last line in the where clause (m.status != ALL(array['active', 'inactive'])) should only be applied to rows which has TRUE in the value of c.additional_condition. Otherwise, the condition should not be applied.
Is it possible to achieve this in Postgres?

I think that this is what you want:
and CASE
when c.additional_condition THEN m.status != ALL(array['active', 'inactive'])
else TRUE
end

I think the logic you want is:
where c.id = m.id and
( (not c.additional_condition) and orm.status = 'active' )
You can use in or arrays for multiple values:
where c.id = m.id and
( (not c.additional_condition) and orm.status not in ( 'active', 'inacctive') )
I don't see a particular value to use arrays, unless you are passing a value in as an array.

SELECT Query should return the records of one of the OR conditions in Postgresql

Sample Code :
CREATE TABLE Foo
(
Id SERIAL NOT NULL,
Name TEXT NOT NULL,
DefaultValue BOOLEAN NOT NULL,
Active BOOLEAN NOT NULL
);
INSERT INTO Foo(Name, DefaultValue, Active)
VALUES ('aa', TRUE, FALSE);
INSERT INTO Foo(Name, DefaultValue, Active)
VALUES ('bb', TRUE, FALSE);
I need a query similar to this to get my required result:
SELECT *
FROM Foo
WHERE Active = TRUE OR DefaultValue = TRUE;
And the query should return the records of aa, bb.
After adding these two records:
INSERT INTO Foo(Name, DefaultValue, Active)
VALUES ('cc', FALSE, TRUE);
INSERT INTO Foo(Name, DefaultValue, Active)
VALUES ('dd', FALSE, TRUE);
The query should return the records of cc, dd.
I only need the records of first condition if exists. Else I want the records of second condition.
Is there any simpler approach for achieving this result in postgresql? Thanks in advance

You can use window functions to determine if any of the rows are active:
SELECT *
FROM (SELECT f.*,
BOOL_OR(Active) OVER () as any_active
FROM Foo f
) f
WHERE any_active OR
(NOT any_active AND DefaultValue);

If you want only rows that are active if there is at least one row that is active, and otherwise the default rows, then the following can already be enough:
with by_precedence as (
select
*
, case when Active = True then 0
when defaultValue = True then 1
else 2
end as precedence
from foo
)
select *
from by_precedence b
where b.precedence in (select min(precedence) from by_precedence)
order by precedence
Otherwise, can you please edit your question and add more examples where the above doesn't do what you want?
id name defaultvalue active precedence
3 cc f t 0
4 dd f t 0
SQL Fiddle

Update column based on IF Else Condition

I have two tables A and B
Table A
ID_number as PK
first_name,
L_Name
Table B
ID_number,
Email_id,
Flag
I have several people who have multiple email ID and are already flagged as X on table B.
Whereas i am trying to find list of people who have an email id or multiple email ID, but were never flagged.
e.g John clark might have 2 email in table B, but was never flagged.

Simply use not exists:
select a.*
from a
where not exists (select 1
from b
where b.id_number = a.id_number and b.flag = 'X'
);

You may want to perform an update, but your question seems to be only about selecting (probably to update based on select). It should be something like this:
SELECT A.L_Name
FROM A
WHERE NOT EXISTS (
SELECT 1
FROM B
WHERE B.ID_number = A.ID_number AND B.Flag = 'X'
)
OR the LEFT JOIN version
SELECT 1
FROM A
LEFT JOIN B ON B.ID_number = A.ID_number AND B.Flag = 'X'
WHER B.ID_number IS NULL
Usually, the first version is faster than the second one.

Forget Table A...
SELECT DISTINCT ID_number FROM table_b t1
WHERE NOT EXISTS(
SELECT NULL FROM table_b t2 WHERE t1.ID_number=t2.ID_number AND t2.flag='X'
)

Judging by your responses in the comments, I believe this is what you are looking for:
--drop table update_test;
create table update_test
(
id_num number,
email_id number,
flag varchar2(1) default null
);
insert into update_test values (1, 1, null);
insert into update_test values (1, 2, null);
insert into update_test values (2, 3, null);
insert into update_test values (2, 7, null);
insert into update_test values (3, 2, null);
insert into update_test values (3, 3, 'X');
insert into update_test values (3, 7, null);
select * from update_test;
select id_num, min(email_id)
from update_test
group by id_num;
update update_test ut1
set flag = case
when email_id = (
select min(email_id)
from update_test ut2
where ut2.id_num = ut1.id_num
) then 'X'
else null end
where id_num not in (
select id_num
from update_test
where Flag is not null);
The last update statement will update and set the Flag field on the record for each id_num group with the lowest email_id. If the id_num group already has the Flag field set for one it will ignore it.

TSQL Order By - List of hard-coded values

I have a query that returns among others a Record Status column. The record status column has several values like: "Active", "Deleted", etc ...
I need to order the results by "Active", then "Deleted", then etc ...
I am currently creating CTEs to bring each set of records then UNION ALL. Is there a better and dynamic way of getting the query done?
Thank you,

you can use CASE on here
ORDER BY CASE WHEN Status = 'Active' THEN 0 ELSE 1 END ASC
but if you have more values for status and you want to sort Active then DELETE
ORDER BY CASE WHEN Status = 'Active' THEN 0
WHEN Status = 'Deleted' THEN 1
ELSE 2
END ASC

For more status values, you can do this:
WITH StatusOrders
AS
(
SELECT StatusOrderID, StatusName
FROM (VALUES(1, 'Active'),
(2, 'Deleted'),
...
n, 'last status')) AS Statuses(StatusOrderID, StatusName)
)
SELECT *
FROM YourTable t
INNER JOIN StatusOrders s ON t.StatusName = s.StatusName
ORDER BY s.StatusOrderID;

WITH
cteRiskStatus
AS
(
SELECT RiskStatusID, RiskStatusName
FROM (VALUES(1, 'Active'),
(2, 'Draft'),
(3, 'Occured'),
(4, 'Escalated'),
(5, 'Closed'),
(6, 'Expired'),
(7, 'Deleted')) AS RiskStatuses(RiskStatusID, RiskStatusName)
)
SELECT * FROM cteRiskStatus
Thanks

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL - efficient way to aggregate boolean values (postgresql) - sql

This select statement should give the needed output: select yt1.id, yt1.is_deleted, yt1.date, case when yt2.is_deleted then true else false end as deleted from yourtabletable yt1 left join yourtabletable yt2 on yt2.id = yt1.id and yt2.is_deleted

Related

Group by absorb NULL unless it's the only value

Updating multiple rows with a conditional where clause in Postgres?

SELECT Query should return the records of one of the OR conditions in Postgresql

Update column based on IF Else Condition

TSQL Order By - List of hard-coded values

Categories

Resources