SQL query to get repeating column value that have other columns in a certain codition

SQL query to get repeating column value that have other columns in a certain codition - sql

Let's say we have below table of below schema.
create table result
(
id int,
task_id int,
test_name string,
test_result string
);
And dataset populated on this table looks like this.
insert into result
values (1, 1, 'test_a', 'pass'),
(2, 1, 'test_b', 'fail'),
(3, 1, 'test_c', 'pass'),
(4, 1, 'test_d', 'pass'),
(5, 2, 'test_a', 'pass'),
(6, 2, 'test_b', 'pass'),
(7, 2, 'test_c', 'pass'),
(8, 2, 'test_d', 'pass');
Basically single task has multiple test results entry. I want to retrieve task_id that has test_b fail but all the other test passed. So in this example it should return only task_id: 1.
I've tried with EXISTS and HAVING but it doesn't seem working in this case. I'm new to SQL. How can I implement it?

I would just use aggregation with a having clause:
select task_id
from result
group by task_id
having sum(case when test_name = 'test_b' and test_result = 'fail' then 1 else 0 end) = 1 and
sum(case when test_result = 'pass' then 1 else 0 end) = count(*) - 1;
The first condition validates that test_b failed. The second counts the number of passes and it should be one less then the number of rows for the task.
If your database supports except (or minus), you an use set-based operations:
select task_id
from result
where test_name = 'test_b' and test_result = 'fail'
except
select task_id
from result
where test_name <> 'test_b' and test_result = 'fail'

Maybe selecting distinct task IDs that have a fail result:
select distinct [task_id], [task_result]
from [result]
where [task_result] = 'fail'
Note that this query will scan the entire table unless there is an index on task_result.

Following code first sums test takers per task and counts fro 'test_b' whether it failed or not. Outer select ensure 'test_b' failed and other have passed.
select task_id from (
select
task_id,
count(test_result) numberoftakers,
sum(case when test_result<>'pass' AND test_name='test_b' then 1 else 0 end) numberoffailb,
sum(case when test_result='pass' then 1 else 0 end) numberofallpasses
from result
group by task_id) a
where numberoftakers=numberoffailb+numberofallpasses and numberoffailb=1

Assuming that (task_id, task_name) is a unique key of your table, you can indeed use (not) exists, along with a correlated subqueries wich ensures that other records having the same task_id did not passed.
select task_id
from result r
where
test_name = 'test_b'
and test_result = 'fail'
and not exists (
select 1
from result r1
where
r1.task_id = r.task_id
and r1.id != r.id
and r1.test_result = 'fail'
)
The left join antipattern also comes to mind:
select r.task_id
from result r
left join result r1
on r1.task_id = r.task_id
and r1.id != r.id
and r1.test_result = 'fail'
where
r.test_name = 'test_b'
and r.test_result = 'fail'
and r1.id is null
Demo on DB Fiddle - Both queries return:
| task_id |
| :------ |
| 1 |

Related

How do I report the results for a test that matches ID's across different sources?

An ID can exist in multiple SOURCE's. I need to test whether the VALUE for the same ID matches across different SOURCES. If they didn't all match across all sources, it should return FALSE.
CREATE TABLE example_table
(
SOURCE varchar(255),
ID varchar(255),
VALUE varchar(255)
);
INSERT INTO example_table (SOURCE, ID, VALUE)
VALUES ('A', 1, 55), ('A', 2, 36), ('B', 1, 55), ('B', 2, 34);
With the code above, I would like the query to return the following:
ID MATCH
1 TRUE
2 FALSE
This is a bit of a "big data" problem, as there are millions of ID's and around 50 or so sources. The query is being written for Vertica 9.2.

You can use aggregation:
select et.id,
(case when min(value) = max(value) then 'true' else 'false' end) as match
from example_table et
group by et.id;
You can simplify this to:
select et.id,
(min(value) = max(value)) as match
from example_table et
group by et.id;

Using SELF JOIN and GROUP BY, you can write the query as below:
SELECT t1.id, t1.value, (CASE WHEN t1.value = t2.value THEN 'true' ELSE 'false' END) AS 'MATCH'
FROM example_table t1 INNER JOIN example_table t2
ON t1.id = t2.id
WHERE t1.source != t2.source
GROUP BY t1.id, t1.value;
Output:
id | value | match
1 55 true
2 34 false
2 36 false

Self join SQL query to return parents that have at least two same children

I have this table setup.
create table holdMyBeer
(
Id int,
Name varchar(20)
)
insert into holdMyBeer
values (1, 'park'), (1, 'washington'), (1, 'virginia'),
(2, 'harbor'), (2, 'premier'), (2, 'park'),
(3, 'park'), (3, 'washington'), (3, 'virginia'), (3, 'Ball');
I am looking for the id's (parents) that at least have park, washington and virginia as name(child).
I have the answer on Fiddle. http://sqlfiddle.com/#!6/e7346/1 but there must be a better way to do this.

This concept is called as Conditional Aggregation. I am grouping on Id and then checking whether there is atleast one entry for park,washington,virginia by using having clause and . This should answer your question.
SELECT Id
FROM holdMyBeer
GROUP BY Id
HAVING SUM( CASE WHEN Name = 'park' THEN 1 ELSE 0 END ) >= 1 AND
SUM( CASE WHEN Name = 'washington' THEN 1 ELSE 0 END ) >= 1 AND
SUM( CASE WHEN Name = 'virginia' THEN 1 ELSE 0 END ) >= 1;

SQL - most efficient way to find if a pair of row does NOT exist

I can't seem to find a similar situation to mine online. I have a table for 'orders' called Order, and a table for details on those orders, called 'order detail'. The definition of a certain type of order is if it has 1 of two pairs of order details (Value-Unit pairs). So, my order detail table might look like this:
order_id | detail
---------|-------
1 | X
1 | Y
1 | Z
2 | X
2 | Z
2 | B
3 | A
3 | Z
3 | B
The two pairs that go together are (X & Y) and (A & B). What is an efficient way of retrieving only those order_ids that DO NOT contain either one of these pairs? e.g. For the above table, I need to receive only the order_id 2.
The only solution I can come up with is essentially to use two queries and perform a self join:
select distinct o.order_id
from orders o
where o.order_id not in (
select distinct order_id
from order_detail od1 where od1.detail=X
join order_detail od2 on od2.order_id = od1.order_id and od2.detail=Y
)
and o.order_id not in (
select distinct order_id
from order_detail od1 where od1.detail=A
join order_detail od2 on od2.order_id = od1.order_id and od2.detail=B
)
The problem is that performance is an issue, my order_detail table is HUGE, and I am quite inexperienced in query languages. Is there a faster way to do this with a lower cardinality? I also have zero control over the schema of the tables, so I can't change anything there.

First and foremost I'd like to emphasise that finding the most efficient query is a combination of a good query and a good index. Far too often I see questions here where people look for magic to happen in only one or the other.
E.g. Of a variety of solutions, yours is the slowest (after fixing syntax errors) when there are no indexes, but is quite a bit better with an index on (detail, order_id)
Please also note that you have the actual data and table structures. You'll need to experiment with various combinations of queries and indexes to find what works best; not least because you haven't indicated what platform you're using and results are likely to vary between platforms.
[/ranf-off]
Query
Without further ado, Gordon Linoff has provided some good suggestions. There's another option likely to offer similar performance. You said you can't control the schema; but you can use a sub-query to transform the data into a 'friendlier structure'.
Specifically, if you:
pivot the data so you have a row per order_id
and columns for each detail you want to check
and the intersection is a count of how many orders have that detail...
Then your query is simply: where (x=0 or y=0) and (a=0 or b=0). The following uses SQL Server's temporary tables to demonstrate with sample data. The queries below work regardless of duplicate id, val pairs.
/*Set up sample data*/
declare #t table (
id int,
val char(1)
)
insert #t(id, val)
values (1, 'x'), (1, 'y'), (1, 'z'),
(2, 'x'), (2, 'z'), (2, 'b'),
(3, 'a'), (3, 'z'), (3, 'b')
/*Option 1 manual pivoting*/
select t.id
from (
select o.id,
sum(case when o.val = 'a' then 1 else 0 end) as a,
sum(case when o.val = 'b' then 1 else 0 end) as b,
sum(case when o.val = 'x' then 1 else 0 end) as x,
sum(case when o.val = 'y' then 1 else 0 end) as y
from #t o
group by o.id
) t
where (x = 0 or y = 0) and (a = 0 or b = 0)
/*Option 2 using Sql Server PIVOT feature*/
select t.id
from (
select id ,[a],[b],[x],[y]
from (select id, val from #t) src
pivot (count(val) for val in ([a],[b],[x],[y])) pvt
) t
where (x = 0 or y = 0) and (a = 0 or b = 0)
It's interesting to note that the query plans for options 1 and 2 above are slightly different. This suggests the possibility of different performance characteristics over large data sets.
Indexes
Note that the above will likely process the whole table. So there is little to be gained from indexes. However, if the table has "long rows", an index on only the 2 columns you're working with means that less data needs to be read from disk.
The query structure you provided is likely to benefit from an indexes such as (detail, order_id). This is because the server can more efficiently check the NOT IN sub-query conditions. How beneficial will depend on the distribution of data in your table.
As a side note I tested various query options including a fixed version of yours and Gordon's. (Only a small data size though.)
Without the above index, your query was slowest in the batch.
With the above index, Gordon's second query was slowest.
Alternative Queries
Your query (fixed):
select distinct o.id
from #t o
where o.id not in (
select od1.id
from #t od1
inner join #t od2 on
od2.id = od1.id
and od2.val='Y'
where od1.val= 'X'
)
and o.id not in (
select od1.id
from #t od1
inner join #t od2 on
od2.id = od1.id
and od2.val='a'
where od1.val= 'b'
)
Mixture between Gordon's first and second query. Fixes the duplicate issue in the first and the performance in the second:
select id
from #t od
group by id
having ( sum(case when val in ('X') then 1 else 0 end) = 0
or sum(case when val in ('Y') then 1 else 0 end) = 0
)
and( sum(case when val in ('A') then 1 else 0 end) = 0
or sum(case when val in ('B') then 1 else 0 end) = 0
)
Using INTERSECT and EXCEPT:
select id
from #t
except
(
select id
from #t
where val = 'a'
intersect
select id
from #t
where val = 'b'
)
except
(
select id
from #t
where val = 'x'
intersect
select id
from #t
where val = 'y'
)

I would use aggregation and having:
select order_id
from order_detail od
group by order_id
having sum(case when detail in ('X', 'Y') then 1 else 0 end) < 2 and
sum(case when detail in ('A', 'B') then 1 else 0 end) < 2;
This assumes that orders do not have duplicate rows with the same detail. If that is possible:
select order_id
from order_detail od
group by order_id
having count(distinct case when detail in ('X', 'Y') then detail end) < 2 and
count(distinct case when detail in ('A', 'B') then detail end) < 2;

SQL Query to Update Object Count Based on Event Name

Imagine I am an owner of many bookstores. I keep a database of all events that occur in all of my many bookstores. Two events of note are "Book Added" and "Book Removed", for when a book is added to the inventory of a story, and when it is sold from a store. An example schema would be bookstore_id, event_name, `time.
Now say I have a second table, which maintains the current state of each bookstore, so the schema would be bookstore_id, num_books.
I want to be able to use the first table to get the count of all the "Book Added" events per bookstore, subtract the count of all the "Book Removed" events per bookstore, and then update the number of books in each bookstore in the second table.
The only way I can think to do it requires using a cursor, but I'm assuming there's a more "SQL-esque" way to do it that is more set-based and doesn't require a cursor.

You can count the events by using a GROUP BY clause.
If we would create 2 subtables where we count the added respectively the removed books, we can simply subtract the results and update these in the parent table. This will look like:
UPDATE b
SET b.numbooks = AddedBooks.BooksAdded - RemovedBooks.BooksRemoved
FROM dbo.Books b
INNER JOIN (SELECT be.book_id, count(*) AS BooksAdded
FROM dbo.BookEvents be
WHERE be.event = 'BookAdded'
GROUP BY be.book_id, be.event) AS AddedBooks
ON b.bookid = AddedBooks.book_id
INNER JOIN (SELECT be.book_id, count(*) AS BooksRemoved
FROM dbo.BookEvents be
WHERE be.event = 'BookRemoved'
GROUP BY be.book_id, be.event) AS RemovedBooks
ON b.bookid = RemovedBooks.book_id

select bookstore_id
, sum(case when event_name = "Book Removed" then -1 else 1 end) as "num books"
from bookstores
group by bookstore_id
if more than 2 events
select bookstore_id
, sum(case when event_name = "Book Removed" then -1
when event_name = "Book Added" then 1
end) as "num books"
from bookstores
group by bookstore_id
And I would just make it a view unless you come up with performance issues

We can use CTEs to get details individually and process them.
With CTE_Add AS
(
Select Bkstr_ID, Count(event_Name) As Added From temp Where event = 'Added' Group by Bkstr_ID
), CTE_Rem As
(
Select Bkstr_ID, Count(event_Name) As Removed From temp Where event = 'Removed' Group by Bkstr_ID
)
Select A.Bkstr_ID, Added - Removed
From CTE_Add A
Left Join CTE_Rem R On A.Bkstr_ID= R.Bkstr_ID
This will give you ID and count.
Instead of select, you can use Insert statement

I'd use SUM(CASE WHEN ...). Below is an example.
If object_id('tempdb..#BoookStores') Is Not Null Drop Table #BoookStores
Create Table #BoookStores (bookstore_id int, num_books int)
/* We have 3 stores */
Insert #BoookStores (bookstore_id, num_books)
Values (1, 0), (2, 0), (3, 0)
If object_id('tempdb..#Events') Is Not Null Drop Table #Events
Create Table #Events (bookstore_id int, event_name varchar(10), time dateTime Default(GetDate()) )
Insert #Events (bookstore_id, event_name)
Values
(1, 'Added'), (1, 'Added'), (1, 'Added'), (1, 'Added'), -- Added 4 books to 1. store
(2, 'Added'), (2, 'Added'), (2, 'Added'), -- Added 3 books to 2. store
(3, 'Added'), (3, 'Added'), -- Added 2 books to 3. store
/* removed 2 books from each stores */
(1, 'Removed'), (1, 'Removed'),
(2, 'Removed'), (2, 'Removed'),
(3, 'Removed'), (3, 'Removed')
/* Calculate adds and removes. Update the results */
;With Tmp As (
Select E.bookstore_id,
Sum(Case When E.event_name = 'Added' Then 1 Else 0 End) As AddCount,
Sum(Case When E.event_name = 'Removed' Then 1 Else 0 End) As RemoveCount
From #Events E
Group By E.bookstore_id
)
Update BS Set num_books = T.AddCount-T.RemoveCount
From #BoookStores BS
Inner Join Tmp T On T.bookstore_id = BS.bookstore_id
/* check results*/
Select * From #BoookStores BS

Something like this will get you in the ball park. Similar logic could be used for INSERT.
UPDATE tableA
SET tableA.num_books = tableB.num_books
FROM secondTable AS TableA
INNER JOIN (
SELECT bookstore_id,
SUM(CASE
WHEN event_name = 'Books Added'
THEN 1
END) - SUM(CASE
WHEN event_name = 'Books Removed'
THEN 1
END
) AS num_books
FROM firstTable
GROUP BY bookstore_id
) TableB ON TableA.bookstore_id = tableB.bookstore_id

You can try a query like below:
update t1
set num_books=inventory
FROM bs t1 LEFT JOIN
(select bookstore_id,SUM(case when event_name like 'A' then 1 when event_name like 'R' then -1 else NULL end) as inventory
from bse
group by bookstore_id) t2
on t1.bookstore_id=t2.bookstore_id
Live SQL demo

UPDATE bsc
SET bsc.num_books = bse.num_books
FROM bookstorecounts bsc
JOIN (SELECT bookstore_id,
SUM(CASE event_name
WHEN 'Book Removed' THEN -1
WHEN 'Book Added' THEN 1
END) AS num_books
FROM bookstoreevents
GROUP BY bookstore_id
) bse ON bsc.bookstore_id = bse.bookstore_id

SQL query to get count based on filtered status

I have a table which has two columns, CustomerId & Status (A, B, C).
A customer can have multiple status in different rows.
I need to get the count of different status based on following rules:
If the status of a customer is A & B, he should be counted in Status A.
If status is both B & C, it should be counted in Status B.
If status is all three, it will fall in status A.
What I need is a table with status and count.
Could please someone help?
I know that someone would ask me to write my query first, but i couldn't understand how to implement this logic in query.

You could play with different variations of this:
select customerId,
case when HasA+HasB+HasC = 3 then 'A'
when HasA+HasB = 2 then 'A'
when HasB+HasC = 2 then 'B'
when HasA+HasC = 2 then 'A'
when HasA is null and HasB is null and HasC is not null then 'C'
when HasB is null and HasC is null and HasA is not null then 'A'
when HasC is null and HasA is null and HasB is not null then 'B'
end as overallStatus
from
(
select customerId,
max(case when Status = 'A' then 1 end) HasA,
max(case when Status = 'B' then 1 end) HasB,
max(case when Status = 'C' then 1 end) HasC
from tableName
group by customerId
) as t;

I like to use Cross Apply for this type of query as it allows for use of the calculated status in the Group By clause.
Here's my solution with some sample data.
Declare #Table Table (Customerid int, Stat varchar(1))
INSERT INTO #Table (Customerid, Stat )
VALUES
(1, 'a'),
(1 , 'b'),
(2, 'b'),
(2 , 'c'),
(3, 'a'),
(3 , 'b'),
(3, 'c')
SELECT
ca.StatusGroup
, COUNT(DISTINCT Customerid) as Total
FROM
#Table t
CROSS APPLY
(VALUES
(
CASE WHEN
EXISTS
(SELECT 1 FROM #Table x where x.Customerid = t.CustomerID and x.Stat = 'a')
AND EXISTS
(SELECT 1 FROM #Table x where x.Customerid = t.CustomerID and x.Stat = 'b')
THEN 'A'
WHEN
EXISTS
(SELECT 1 FROM #Table x where x.Customerid = t.CustomerID and x.Stat = 'b')
AND EXISTS
(SELECT 1 FROM #Table x where x.Customerid = t.CustomerID and x.Stat = 'c')
THEN 'B'
ELSE t.stat
END
)
) ca (StatusGroup)
GROUP BY ca.StatusGroup
I edited this to deal with Customers who only have one status... in which case it will return A, B or C dependant on the customers status

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL query to get repeating column value that have other columns in a certain codition - sql

Maybe selecting distinct task IDs that have a fail result: select distinct [task_id], [task_result] from [result] where [task_result] = 'fail' Note that this query will scan the entire table unless there is an index on task_result.

Related

How do I report the results for a test that matches ID's across different sources?

Self join SQL query to return parents that have at least two same children

SQL - most efficient way to find if a pair of row does NOT exist

SQL Query to Update Object Count Based on Event Name

SQL query to get count based on filtered status

Categories

Resources