SQL simplifying an except query - sql

I have a database with around 50 million entries showing the status of a device for a given day, simplified to the form:
id | status
-------------
1 | Off
1 | Off
1 | On
2 | Off
2 | Off
3 | Off
3 | Off
3 | On
...
such that each id is guaranteed to have at least 2 rows with an 'off' status, but doesn't have to have an 'on' status. I'm trying to get a list of only the ids that do not have an 'On' status. For example, in the above data set I'd want a query returned with only '2'
The current query is:
SELECT DISTINCT id FROM table
EXCEPT
SELECT DISTINCT id FROM table WHERE status <> 'Off'
Which seems to work, but it's having to iterate over the entire table twice which ends up taking ~10-12 minutes to run per query. Is there a simpler way to do this with only a single query?

You can use WHERE NOT EXISTS instead:
Select Distinct Id
From Table A
Where Not Exists
(
Select *
From Table B
Where A.Id = B.Id
And B.Status = 'On'
)
I would also recommend looking at the indexes on the Status column. 10-12 minutes to run is excessively long. Even with 50m records, with proper indexing, a query like this shouldn't take longer than a second.
To add an index to the column, you can run this (I'm assuming SQL Server, your syntax may vary):
Create NonClustered Index Ix_YourTable_Status On YourTable (Status Asc);

You can use conditional aggregation.
select id
from table
group by id
having count(case when status='On' then 1 end)=0

You can use the help of a SELF JOIN ..
SELECT DISTINCT A.Id
FROM Table A
LEFT JOIN Table B ON A.Id=B.Id
WHERE B.Status='On'
AND B.Id IS NULL

Related

Postgres, groupBy and count for table and relations at the same time

I have a table called 'users' that has the following structure:
id (PK)
campaign_id
createdAt
1
123
2022-07-14T10:30:01.967Z
2
1234
2022-07-14T10:30:01.967Z
3
123
2022-07-14T10:30:01.967Z
4
123
2022-07-14T10:30:01.967Z
At the same time I have a table that tracks clicks per user:
id (PK)
user_id(FK)
createdAt
1
1
2022-07-14T10:30:01.967Z
2
2
2022-07-14T10:30:01.967Z
3
2
2022-07-14T10:30:01.967Z
4
2
2022-07-14T10:30:01.967Z
Both of these table are up to millions of records... I need the most efficient query to group the data per campaign_id.
The result I am looking for would look like this:
campaign_id
total_users
total_clicks
123
3
1
1234
1
3
I unfortunately have no idea how to achieve this while minding performance and most important of it all I need to use WHERE or HAVING to limit the query in a certain time range by createdAt
Note, PostgreSQL is not my forte, nor is SQL. But, I'm learning spending some time on your question. Have a go with INNER JOIN after two seperate SELECT() statements:
SELECT * FROM
(
SELECT campaign_id, COUNT (t1."id(PK)") total_users FROM t1 GROUP BY campaign_id
) tbl1
INNER JOIN
(
SELECT campaign_id, COUNT (t2."user_id(FK)") total_clicks FROM t2 INNER JOIN t1 ON t1."id(PK)" = t2."user_id(FK)" GROUP BY campaign_id
) tbl2
USING(campaign_id)
See an online fiddle. I believe this is now also ready for a WHERE clause in both SELECT statements to filter by "createdAt". I'm pretty sure someone else will come up with something better.
Good luck.
Hope this will help you.
select u.campaign_id,
count(distinct u.id) users_count,
count(c.user_id) clicks_count
from
users u left join clicks c on u.id=c.user_id
group by 1;
See here query output

Aggregate and count after left join

I am aggregating columns of a table to find the count of unique values. For example, aggregating
the status shows that out of 5 alerts there are 2 in open status and 3 that are closed. The simplified table looks like this:
create table alerts (
id,
status,
owner_id
);
The query below uses grouping sets to aggregate multiple columns at once. This approach works well.
with aggs as (
select status
from alerts
where alerts.owner_id = 'x'
)
select status, count(*)
from aggs
group by grouping sets(
(),
(status)
);
the output at its simplest could look like this:
status | count
--------+-------
| 1
new | 1
However, now I need to aggregate additional columns from another table. This table (shown below) can have zero or more rows associated to the first table (alerts:users 1:N).
create table users (
id,
alert_id,
name
);
I have tried updating the query to use a left join but this approach incorrectly inflates the counts of the alert columns.
with aggs as (
select alerts.status, users.name
from alerts
left join users on alerts.id = users.alert_id
where alerts.owner_id = 'x'
-- and additional filtering by columns in the users table
)
select status, name, count(*)
from aggs
group by grouping sets(
(),
(status),
(name)
);
Below is an example of the incorrect results. Since there are 3 rows in the user table the count
for the status column is now 3 but should be 1.
status | name | count
--------+-------------------------+-------
| | 3
| user1 | 1
| user2 | 1
| user3 | 1
new | | 3
How can I perform this aggregation to include the columns from the table with a many-to-one relationship without inflating the counts? In the future I will likely need to aggregate more columns from other tables with a many-to-one relationship and need a solution that will still work with several left joins. All help is much appreciated.
edit: link to db-fiddle https://www.db-fiddle.com/f/buGD2DuJiqf9LGF9rw5EgT/2
Do you just want to count the number of alerts? If so, use count(distinct):
count(distinct alert_id)
Of course, you need this in aggs, so the select would include:
alerts.id as alert_id

Simple WHERE clause but keep extracted rows and fill them will null values

I have a table which basically looks like this one:
Date | Criteria
12-04-2016 123
12-05-2016 1234
...
Now I want to select those rows with values in the column 'Criteria' within a given range but I want to keep the extracted rows. The extracted rows should get the value 'null' for the column 'Criteria'. So for example, if I want to select the row with 'Criteria = 123' my result should look like this:
Date | Criteria
12-04-2016 123
12-05-2016 null
Currently I am using this query to get the result:
SELECT b.date, a.criteria
FROM (SELECT id, date, criteria FROM ABC WHERE criteria > 100 and criteria < 200) a
FULL OUTER JOIN ABC b ON a.id = b.id ORDER BY a.criteria
Someone told me that full outer joins perform very badly. Plus my table has like 400000 records and the query is used pretty often. So anyone has an idea to speed up my query? Btw I am using the Oracle11g database.
Do you just want a case expression?
SELECT date,
(case when criteria > 100 and criteria < 200 then criteria end) as criteria
FROM ABC;

How to obtain the number of tuple returned by each SELECT of an UNION query in postgreSQL?

I have a query consiting a union of two SELECT queries. Each SELECT query returns an info. I'm interested in obtaining the number of tuple returned by each SELECT query.
I want to do something like :
SELECT table1.info AS info, num_result_tuple()
FROM table1, table2
WHERE table1.info < table2.info
UNION
SELECT table3.info AS info, num_result_tuple()
FROM table3, table4
WHERE table3.info < table4.info
Where num_result_tuple() must represent the number of tuples found by each simple query.
is there such a function in postgresql ? or is there a another way to achieve this ?
You can use a window function for this:
SELECT table1.info AS info, count(*) over () as part_count
FROM table1, table2
WHERE table1.info < table2.info
UNION
SELECT table3.info AS info, count(*) over ()
FROM table3, table4
WHERE table3.info < table4.info;
You probably want UNION ALL instead of UNION. UNION will remove duplicate rows between the two select parts. If you know that you can't have any duplicates (or you do want to return them) UNION ALL will be faster.
If you need to remove duplicates from the original query, adding the count() will change the result. Imagine the following partial results:
First query:
info | part_count
-----+-----------
a | 1
Second query:
info | part_count
-----+-----------
a | 2
b | 2
The union would return
info | part_count
-----+-----------
a | 1
a | 2
b | 2
The query without the counts would have returned
info
----
a
b

Select data from a table where only the first two columns are distinct

Background
I have a table which has six columns. The first three columns create the pk. I'm tasked with removing one of the pk columns.
I selected (using distinct) the data into a temp table (excluding the third column), and tried inserting all of that data back into the original table with the third column being '11' for every row as this is what I was instructed to do. (this column is going to be removed by a DBA after I do this)
However, when I went to insert this data back into the original table I get a pk constraint error. (shocking, I know)
The other three columns are just date columns, so the distinct select didn't create a unique pk for each record. What I'm trying to achieve is just calling a distinct on the first two columns, and then just arbitrarily selecting the three other columns as it doesn't matter which dates I choose (at least not on dev).
What I've tried
I found the following post which seems to achieve what I want:
How do I (or can I) SELECT DISTINCT on multiple columns?
I tried the answers from both Joel,and Erwin.
Attempt 1:
However, with Joels answer the set returned is too large - the inner join isn't doing what I thought it would do. Selecting distinct col1 and col2 there are 400 columns returned, however when I use his solution 600 rows are returned. I checked the data and in fact there were duplicate pk's. Here is my attempt at duplicating Joels answer:
select a.emp_no,
a.eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no, modify_dte,
modify_by_emp_no
from tempdb.guest.temp_part_time_evaluator b
inner join
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
) a
ON b.emp_no = a.emp_no AND b.eec_planning_unit_cde = a.eec_planning_unit_cde
Now, if I execute just the inner select statement 400 rows are returned. If I select the whole query 600 rows are returned? Isn't inner join supposed to only show the intersection of the two sets?
Attempt 2:
I also tried the answer from Erwin. This one has a syntax error and I'm having trouble googling the spec on the where clause (specifically, the trick he is using with (emp_no, eec_planning_unit_cde))
Here is the attempt:
select emp_no,
eec_planning_unit_cde,
'11' as area, create_dte,
create_by_emp_no,
modify_dte,
modify_by_emp_no
where (emp_no, eec_planning_unit_cde) IN
(
select emp_no, eec_planning_unit_cde
from tempdb.guest.temp_part_time_evaluator
group by emp_no, eec_planning_unit_cde
)
Now, I realize that the post I referenced is for postgresql. Doesn't T-SQL have something similar? Trying to google parenthesis isn't working too well.
Overview of Questions:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
A select distinct will be based on all columns so it does not guarantee the first two to be distinct
select pk1, pk2, '11', max(c1), max(c2), max(c3)
from table
group by pk1, pk2
You could TRY this:
SELECT a.emp_no,
a.eec_planning_unit_cde,
b.'11' as area,
b.create_dte,
b.create_by_emp_no,
b.modify_dte,
b.modify_by_emp_no
FROM
(
SELECT emp_no, eec_planning_unit_cde
FROM tempdb.guest.temp_part_time_evaluator
GROUP BY emp_no, eec_planning_unit_cde
) a
JOIN tempdb.guest.temp_part_time_evaluator b
ON a.emp_no = b.emp_no AND a.eec_planning_unit_cde = b.eec_planning_unit_cde
That would give you a distinct on those fields but if there is differences in the data between columns you might have to try a more brute force approch.
SELECT a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY emp_no, eec_planning_unit_cde) rownumber,
a.emp_no,
a.eec_planning_unit_cde,
a.'11' as area,
a.create_dte,
a.create_by_emp_no,
a.modify_dte,
a.modify_by_emp_no
FROM tempdb.guest.temp_part_time_evaluator
) a
WHERE rownumber = 1
I'll reply one by one:
Why doesn't inner join return an intersection of two sets? From googling this is what I thought it was supposed to do
Inner join don't do an intersection. Le'ts supose this tables:
T1 T2
n s n s
1 A 2 X
2 B 2 Y
2 C
3 D
If you join both tables by numeric column you don't get the intersection (2 rows). You get:
select *
from t1 inner join t2
on t1.n = t2.n;
| N | S |
---------
| 2 | B |
| 2 | B |
| 2 | C |
| 2 | C |
And, your second query approach:
select *
from t1
where t1.n in (select n from t2);
| N | S |
---------
| 2 | B |
| 2 | C |
Is there another way to achieve the same method that I was trying in attempt 2 in t-sql?
Yes, this subquery:
select *
from t1
where not exists (
select 1
from t2
where t2.n = t1.n
);
It doesn't matter to me which one of these I use, or if I use another solution... how should I go about this?
yes, using #JTC second query.