oracle12c,sql,difference between count(*) and sum() - sql

Tell me the difference between sql1 and sql2:
sql1:
select count(1)
from table_1 a
inner join table_2 b on a.key = b.key where a.id in (
select id from table_1 group by id having count(1) > 1
)
sql2:
select sum(a) from (
select count(1) as a
from table_1 a
inner join table_2 b on a.key = b.key group by a.id having count(1) > 1
)
Why is the output not the same?

The queries are not even similar. They are very different. Let's check the first one:
select count(1)
from table_1 a
inner join table_2 b
on a.key = b.key
where a.id in (
select id from table_1 group by id having count(1) > 1
) ;
You are first making an inner join:
select count(1)
from table_1 a
inner join table_2 b
on a.key = b.key
In this case, you can use count(1), count(id), count(*), it's equivalent. You are counting the common elements in both tables: those ones that have in common the key field.
After that, you are enforcing this:
where a.id in (
select id from table_1 group by id having count(1) > 1
)
In other words, that every "id" of the table_1 must be at least two times in the table_1 table.
And lastly, you are doing this:
select count(1)
In other words, counting those elements. So, translated into english you have done this:
get every record of table_1 and pair with records of table_2 for the id, and get only those that match
for the result above, filter out only the elements whose id of the table_1 appears more than one time
count that result
Let's see what happens with the second query:
select sum(a) from (
select count(1) as a
from table_1 a
inner join table_2 b
on a.key = b.key
group by a.id
having count(1) > 1
);
You are making the same inner join:
select count(1) as a
from table_1 a
inner join table_2 b
on a.key = b.key
but, you are grouping it by the id of the table:
group by a.id
and then filtering out only those elements who appear more than one time:
having count(1) > 1
The result so far are a set of records that have in common the key field in both tables, but grouped by the id: this means that only those fields that are at leas two times in the table_b are outputed of this join. After that, you group by id, collapsing those results into the table_1.id field and counting the result. I presume that very few records will match this strict criteria.
And lastly, you sum all those set.

When you use count(*) you count ALL the rows. The SUM() function is an aggregate function that returns the sum of all or distinct values in a set of values.

Related

efficently get nearest id based on counting a column in one-to-many table

I have a relational table (one-to-many) and I need to efficiently get the similarities between the ids giving they associated items. The table its something like this:
id item
1 A2231
1 A2134
2 A2134
2 B2313
...
What I need is to get how many rows are common between all the ids:
a_id b_id count_items
1 2 1
1 3 0
2 1 1
...
I have made a query, but Its o(n2), and it doesn't work because the spool space.
SELECT A.ID AS a_id, B.ID AS b_id, COUNT(B.item) AS count_items
FROM Tab AS A LEFT JOIN Tab AS B --same table
ON (A.item = B.item)
GROUP BY A.ID, B.ID
EDIT:
n_rows ~ 50MM
n_items ~ 100K
n_ids ~ 170K
combinations id/item are unique
It'a there a way to efficiently accomplish this?
Thanks in advance!
I would start by just using an inner join:
SELECT A.ID, B.ID, COUNT(*) AS count_items
FROM Tab A LEFT JOIN
Tab B --same table
ON A.item = B.item
GROUP BY A.ID, B.ID;
Next, if your table has duplicates, then this might work:
with t as (
select distinct id, item
from tab
)
select a.id, b.id, count(*)
from t a join
t b
on a.item = b.item
group by a.id, b.id;
And finally, if you want all pairs of items, then:
with t as (
select distinct id, item
from tab
)
select i1.id, i2.id, count(b.id)
from (select distinct id from tab) i1 cross join
(select distinct id from tab) i2 left join
t a
on t.id = i1.id left join
t b
on b.id = i2.id and a.item = b.item
group by i1.id, i2.id;

MSSQL 2012 - Returning multiple columns in a subquery

I'd like to return multiple columns with a sub query.
E.G,
select a.name, a.age
from table1 a, ( select b.race, b.weight from table2 b where dateDiff(dd, b.date1, b.date2 ) < 30 )
where a.age > 24
Some of you have said "Just use a join" - I do not want the dateDiff in the subquery affecting the results of the parent query. Again, my real query is more complex then this but this should be sufficient in explaining my issue.
Use left join to do this, left join will return NULL values
SELECT a.name, b.score, ...
FROM (select id, name, ... from table1 where ???) a
LEFT JOIN (select id, score, ... from table2 where ???) b on (a.id = b.id)
WHERE clause

SQL joined by last date

This is a question asked here before more than once, however I couldn't find what I was looking for. I am looking for join two tables, where the joined table is set by the last register ordered by date time, until here all is ok.
My trouble start on having more than two records on the joined table, let me show you a sample
table_a
-------
id
name
description
created
updated
table_b
-------
id
table_a_id
name
description
created
updated
What I have done at the beginning was:
SELECT a.id, b.updated
FROM table_a AS a
LEFT JOIN (SELECT table_a_id, max (updated) as updated
FROM table_b GROUP BY table_a_id ) AS b
ON a.id = b.table_a_id
Until here I was getting cols, a.id and b.updated. I need the full table_b cols, but when I try to add a new col to my query, Postgres tells me that I need to add my col to a GROUP BY criteria in order to complete the query, and the result is not what I am looking for.
I am trying to find a way to have this list.
DISTINCT ON or is your friend. Here is a solution with correct syntax:
SELECT a.id, b.updated, b.col1, b.col2
FROM table_a as a
LEFT JOIN (
SELECT DISTINCT ON (table_a_id)
table_a_id, updated, col1, col2
FROM table_b
ORDER BY table_a_id, updated DESC
) b ON a.id = b.table_a_id;
Or, to get the whole row from table_b:
SELECT a.id, b.*
FROM table_a as a
LEFT JOIN (
SELECT DISTINCT ON (table_a_id)
*
FROM table_b
ORDER BY table_a_id, updated DESC
) b ON a.id = b.table_a_id;
Detailed explanation for this technique as well as alternative solutions under this closely related question:
Select first row in each GROUP BY group?
Try:
SELECT a.id, b.*
FROM table_a AS a
LEFT JOIN (SELECT t.*,
row_number() over (partition by table_a_id
order by updated desc) rn
FROM table_b t) AS b
ON a.id = b.table_a_id and b.rn=1
You can use Postgres's distinct on syntax:
select a.id, b.*
from table_a as a left join
(select distinct on (table_a_id) table_a_id, . . .
from table_b
order by table_a_id, updated desc
) b
on a.id = b.table_a_id
Where the . . . is, you should put in the columns that you want.

SQL query to find record with ID not in another table

I have two tables with binding primary key in database and I desire to find a disjoint set between them. For example,
Table1 has columns (ID, Name) and sample data: (1 ,John), (2, Peter), (3, Mary)
Table2 has columns (ID, Address) and sample data: (1, address2), (2, address2)
So how do I create a SQL query so I can fetch the row with ID from table1 that is not in table2. In this case, (3, Mary) should be returned?
PS: The ID is the primary key for those two tables.
Try this
SELECT ID, Name
FROM Table1
WHERE ID NOT IN (SELECT ID FROM Table2)
Use LEFT JOIN
SELECT a.*
FROM table1 a
LEFT JOIN table2 b
on a.ID = b.ID
WHERE b.id IS NULL
There are basically 3 approaches to that: not exists, not in and left join / is null.
LEFT JOIN with IS NULL
SELECT l.*
FROM t_left l
LEFT JOIN
t_right r
ON r.value = l.value
WHERE r.value IS NULL
NOT IN
SELECT l.*
FROM t_left l
WHERE l.value NOT IN
(
SELECT value
FROM t_right r
)
NOT EXISTS
SELECT l.*
FROM t_left l
WHERE NOT EXISTS
(
SELECT NULL
FROM t_right r
WHERE r.value = l.value
)
Which one is better? The answer to this question might be better to be broken down to major specific RDBMS vendors. Generally speaking, one should avoid using select ... where ... in (select...) when the magnitude of number of records in the sub-query is unknown. Some vendors might limit the size. Oracle, for example, has a limit of 1,000. Best thing to do is to try all three and show the execution plan.
Specifically form PostgreSQL, execution plan of NOT EXISTS and LEFT JOIN / IS NULL are the same. I personally prefer the NOT EXISTS option because it shows better the intent. After all the semantic is that you want to find records in A that its pk do not exist in B.
Old but still gold, specific to PostgreSQL though: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/
Fast Alternative
I ran some tests (on postgres 9.5) using two tables with ~2M rows each. This query below performed at least 5* better than the other queries proposed:
-- Count
SELECT count(*) FROM (
(SELECT id FROM table1) EXCEPT (SELECT id FROM table2)
) t1_not_in_t2;
-- Get full row
SELECT table1.* FROM (
(SELECT id FROM table1) EXCEPT (SELECT id FROM table2)
) t1_not_in_t2 JOIN table1 ON t1_not_in_t2.id=table1.id;
Keeping in mind the points made in #John Woo's comment/link above, this is how I typically would handle it:
SELECT t1.ID, t1.Name
FROM Table1 t1
WHERE NOT EXISTS (
SELECT TOP 1 NULL
FROM Table2 t2
WHERE t1.ID = t2.ID
)
SELECT COUNT(ID) FROM tblA a
WHERE a.ID NOT IN (SELECT b.ID FROM tblB b) --For count
SELECT ID FROM tblA a
WHERE a.ID NOT IN (SELECT b.ID FROM tblB b) --For results

SQL Select Distinct with Conditional

Table1 has columns (id, a, b, c, group). There are several rows that have the same group, but id is always unique. I would like to SELECT group,a,b FROM Table1 WHERE the group is distinct. However, I would like the returned data to be from the row with the greatest id for that group.
Thus, if we have the rows
(id=10, a=6, b=40, c=3, group=14)
(id=5, a=21, b=45, c=31, group=230)
(id=4, a=42, b=65, c=2, group=230)
I would like to return these 2 rows:
[group=14, a=6,b=40] and
[group=230, a=21,b=45] (because id=5 > id=4)
Is there a simple SELECT statement to do this?
Try:
select grp, a, b
from table1 where id in
(select max(id) from table1 group by grp)
You can do it using a self join or an inner-select. Here's inner select:
select `group`, a, b from Table1 AS T1
where id=(select max(id) from Table1 AS T2 where T1.`group` = T2.`group`)
And self-join method:
select T1.`group`, T2.a, T2.b from
(select max(id) as id,`group` from Table1 group by `group`) T1
join Table1 as T2 on T1.id=T2.id
2 selects, your inner select gets:
SELECT MAX(id) FROM YourTable GROUP BY [GROUP]
Your outer select joins to this table.
Think about it logically, the inner select gets a sub set of the data you need.
The outer select inner joins to this subset and can get further data.
SELECT [group], a, b FROM YourTable INNER JOIN
(SELECT MAX(id) FROM YourTable GROUP BY [GROUP]) t
ON t.id = YourTable.id
SELECT mi.*
FROM (
SELECT DISTINCT grouper
FROM mytable
) md
JOIN mytable mi
ON mi.id =
(
SELECT id
FROM mytable mo
WHERE mo.grouper = md.grouper
ORDER BY
id DESC
LIMIT 1
)
If your table is MyISAM or id is not a PRIMARY KEY, then make sure you have a composite index on (grouper, id).
If your table is InnoDB and id is a PRIMARY KEY, then a simple index on grouper will suffice (id, being a PRIMARY KEY, will be implictly included).
This will use an INDEX FOR GROUP-BY to build the list of distinct groupers, and for each grouper it will use the index access to find the maximal id.
Don't know how to do it in mysql. But the following code will work for MsSQL...
SELECT Y.* FROM
(
SELECT DISTINCT [group], MAX(id) ID
FROM Table1
GROUP BY [group]
) X
INNER JOIN Table1 Y ON X.ID=Table1.ID