Optimalization of select containing Union - sql

I have and easy select:
define account_id = 7
select * from A where ACCOUNT_ID = &account_id
UNION
select * from B where ACCOUNT_ID = &account_id;
I would like to have account_id as input from another select and I did it this way:
select * from A where ACCOUNT_ID in(select accound_id from ACCOUNTS where EMAIL like 'aa#aa.com') -- id 7 returned
UNION
select * from B where ACCOUNT_ID in(select accound_id from ACCOUNTS where EMAIL like 'aa#aa.com')
How could be this optimalized to call select accound_id from ACCOUNTS where EMAIL like 'aa#aa.com' only once?

My first question is whether the union can be replaced by the union all. So, my first attempt would be to use exists and union all:
select a.*
from a
where exists (select 1
from accounts aa
where aa.account_id = a.account_id and
aa.email = 'aa#aa.com'
)
union all
select b.*
from b
where exists (select 1
from accounts aa
where aa.account_id = b.account_id and
aa.email = 'aa#aa.com'
);
For this structure, you want an index on accounts(account_id, email). The exists simply looks up the values in the index. This does require scanning a and b.
If the query is returning a handful of rows and you want to remove duplicates, then union and replace union all. If it is returning a large set of rows -- and there are not duplicates in each table and there is an easy way to identify the duplicates -- then you can instead do:
with cte_a as (
select a.*
from a
where exists (select 1
from accounts aa
where aa.account_id = a.account_id and
aa.email = 'aa#aa.com'
)
)
select cte_a.*
from ctea_a
union all
select b.*
from b
where exists (select 1
from accounts aa
where aa.account_id = b.account_id and
aa.email = 'aa#aa.com'
) and
not exists (select 1
from cte_a
where cte_a.? = b.? -- whatever is needed to identify duplicates
);

This is where WITH comes in handy
WITH ids AS (select account_id from ACCOUNTS where EMAIL like 'aa#aa.com')
select * from A where ACCOUNT_ID in ids
UNION ALL
select * from B where ACCOUNT_ID in ids;
I also changed it to UNION ALL, because it's much faster.

Related

SQL Server - Exclude Records from other tables

I used the search function which brought me to the following solution.
Starting Point is the following: I have one table A which stores all data.
From that table I select a certain amount of records and store it in table B.
In a new statement I want to select new records from table A that do not appear in table B and store them in table c. I tried to solve this with a AND ... NOT IN statement.
But I still receive records in table C that are in table B.
Important: I can only work with select statements, each statement needs to start with select as well.
Does anybody have an idea where the problem in the following statement could be:
Select *
From
(Select TOP 10000 *
FROM [table_A]
WHERE Email like '%#domain_A%'
AND Id NOT IN (SELECT Id
FROM [table_B]))
Union
(Select TOP 7500 *
FROM table_A]
WHERE Email like '%#domain_B%'
AND Id NOT IN (SELECT Id
FROM [table_B]))
Union
(SELECT TOP 5000 *
FROM [table_A]
WHERE Email like '%#domain_C%'
AND Id NOT IN (SELECT Id
FROM [table_B]))
Try NOT EXISTS instead of NOT IN
SELECT
*
FROM TableA A
WHERE NOT EXISTS
(
SELECT 1 FROM TableB WHERE Id = A.Id
)
So Basically the idea here is to select everything from table A that doesnt exists in table B and Insert all that into Table C?
INSERT INTO Table_C
SELECT a.colum1, a.column2,......
FROM [table_A]
LEFT JOIN [table_B] ON a.id = b.ID
WHERE a.Email like '%#domain_A%' AND b.id IS NULL
Thank you guys all for your feedback, from which I learned a lot.
I was able to fix the statement with your help. Above is the statement which is working now with the desired results:
Select Id
From
(Select TOP 10000 * FROM Table_A
WHERE Email like '%#domain_a%'
AND Id NOT IN (SELECT Id
FROM Table_B)
order by No desc) t1
Union
Select Id
From
(Select TOP 7500 * FROM Table_A
WHERE Email like '%#domain_b%'
AND Id NOT IN (SELECT Id
FROM Table_B)
order by No desc) t2
Union
Select Id
From
(SELECT TOP 5000 * FROM Table_A
WHERE Email like '%#domain_c%'
AND Id NOT IN (SELECT Id
FROM Table_B)
order by No desc) t3

selecting incremental data from multiple tables in Hive

I have five tables(A,B,C,D,E) in Hive database and I have to union the data from these tables based on logic over column "id".
The condition is :
Select * from A
UNION
select * from B (except ids not in A)
UNION
select * from C (except ids not in A and B)
UNION
select * from D(except ids not in A,B and C)
UNION
select * from E(except ids not in A,B,C and D)
Have to insert this data into final table.
One way is to create a the target table (target)and append it with data for each UNION stage and then using this table for joining with the other UNION stage.
This would be the part of my .hql file :
insert into target
(select * from A
UNION
select B.* from
A
RIGHT OUTER JOIN B
on A.id=B.id
where ISNULL(A.id));
INSERT INTO target
select C.* from
target
RIGHT outer JOIN C
ON target.id=C.id
where ISNULL(target.id);
INSERT INTO target
select D.* from
target
RIGHT OUTER JOIN D
ON target.id=D.id
where ISNULL(target.id);
INSERT INTO target
select E.* from
target
RIGHT OUTER JOIN E
ON target.id=E.id
where ISNULL(target.id);
Is there a better to make this happen ? I assume we anyway have to do the
multiple joins/lookups .I am looking forward for best approach to achieve this
in
1) Hive with Tez
2) Spark-sql
Many Thanks in advance
If id is unique within each table, then row_number can be used instead of rank.
select *
from (select *
,rank () over
(
partition by id
order by src
) as rnk
from (
select 1 as src,* from a
union all select 2 as src,* from b
union all select 3 as src,* from c
union all select 4 as src,* from d
union all select 5 as src,* from e
) t
) t
where rnk = 1
;
I think I would try to do this as:
with ids as (
select id, min(which) as which
from (select id, 1 as which from a union all
select id, 2 as which from b union all
select id, 3 as which from c union all
select id, 4 as which from d union all
select id, 5 as which from e
) x
)
select a.*
from a join ids on a.id = ids.id and ids.which = 1
union all
select b.*
from b join ids on b.id = ids.id and ids.which = 2
union all
select c.*
from c join ids on c.id = ids.id and ids.which = 3
union all
select d.*
from d join ids on d.id = ids.id and ids.which = 4
union all
select e.*
from e join ids on e.id = ids.id and ids.which = 5;

Value present in more than one table

I have 3 tables. All of them have a column - id. I want to find if there is any value that is common across the tables. Assuming that the tables are named a.b and c, if id value 3 is present is a and b, there is a problem. The query can/should exit at the first such occurrence. There is no need to probe further. What I have now is something like
( select id from a intersect select id from b )
union
( select id from b intersect select id from c )
union
( select id from a intersect select id from c )
Obviously, this is not very efficient. Database is PostgreSQL, version 9.0
id is not unique in the individual tables. It is OK to have duplicates in the same table. But if a value is present in just 2 of the 3 tables, that also needs to be flagged and there is no need to check for existence in he third table, or check if there are more such values. One value, present in more than one table, and I can stop.
Although id is not unique within any given table, it should be unique across the tables; a union of distinct id should be unique, so:
select id from (
select distinct id from a
union all
select distinct id from b
union all
select distinct id from c) x
group by id
having count(*) > 1
Note the use of union all, which preserves duplicates (plain union removes duplicates).
I would suggest a simple join:
select a.id
from a join
b
on a.id = b.id join
c
on a.id = c.id
limit 1;
If you have a query that uses union or group by (or order by, but that is not relevant here), then you need to process all the data before returning a single row. A join can start returning rows as soon as the first values are found.
An alternative, but similar method is:
select a.id
from a
where exists (select 1 from b where a.id = b.id) and
exists (select 1 from c where a.id = c.id);
If a is the smallest table and id is indexes in b and c, then this could be quite fast.
Try this
select id from
(
select distinct id, 1 as t from a
union all
select distinct id, 2 as t from b
union all
select distinct id, 3 as t from c
) as t
group by id having count(t)=3
It is OK to have duplicates in the same table.
The query can/should exit at the first such occurrence.
SELECT 'OMG!' AS danger_bill_robinson
WHERE EXISTS (SELECT 1
FROM a,b,c -- maybe there is a place for old-style joins ...
WHERE a.id = b.id
OR a.id = c.id
OR c.id = b.id
);
Update: it appears the optimiser does not like carthesian joins with 3 OR conditions. The below query is a bit faster:
SELECT 'WTF!' AS danger_bill_robinson
WHERE exists (select 1 from a JOIN b USING (id))
OR exists (select 1 from a JOIN c USING (id))
OR exists (select 1 from c JOIN b USING (id))
;

Logical AND between table elements in T-SQL

I have n tables all with the same fields: Username and Value. The same Username can have multiple registers on each table but the combination Username/Value is unique on each one.
I want to join the tables into a single one which contains all the users who appear on all the tables with all the different (Username/Value) pairs.
Example
Table A: {(User1,Value1);(User1,Value2);(User2,Value2);(User3,Value4)]
Table B: {(User1,Value4);(User3,Value5)]
Table C: {(User1,Value5);(User1,Value2);(User2,Value7);(User3,Value8)]
Desired output
Table D: {(User1,Value1);(User1,Value2);(User1,Value4);(User1,Value5);(User3,Value4);(User3,Value5);(User3,Value8)}
Now I'm doing multiple joins (using perl) like this
SELECT *
INTO $target_table
FROM (SELECT *
FROM $table1
WHERE bname IN (SELECT DISTINCT bname FROM $table2)
UNION
SELECT *
FROM $table2
WHERE bname IN (SELECT DISTINCT bname FROM $table1)
) UN
and then doing the same join between a third table and target_table and so on, but I think it should be a better way.
Any hints?
You can use UNION for this:
SELECT username, value
FROM $table1
UNION
SELECT username, value
FROM $table2
...
SELECT username, value
FROM $tablex
SQL Fiddle Demo
This will return you distinct records. If you are interested in duplicates, use UNION ALL.
Given your edits, it appears you only want to return records if the user is in all the tables.
Breaking that down, you need to do a few things. First, combine all your records together again, but this time denote which table each are coming from. Then you need to know the count of tables each user is in. Finally you need to check that number against the overall number of tables.
Here's one way using a few CTEs:
WITH CTE AS (
SELECT username, value, 1 AS tbl
FROM t1
UNION
SELECT username, value, 2 AS tbl
FROM t2
UNION
SELECT username, value, 3 AS tbl
FROM t3
),
CTECnt AS (
SELECT username, COUNT(DISTINCT tbl) tblCnt
FROM CTE
GROUP BY username
),
CTEMaxCnt AS (
SELECT COUNT(DISTINCT tbl) MaxCnt
FROM CTE
)
SELECT C.username, C.value
FROM CTE C
JOIN CTECnt C2 ON C.username = C2.username
JOIN CTEMaxCnt C3 ON C2.tblCnt = C3.MaxCnt
Another SQL Fiddle Demo
With Combined As
(
Select 'A' As TableName, Username, Value
From TableA
Union All
Select 'B', Username, Value
From TableB
Union All
Select 'C', Username, Value
From TableC
)
Select C.Username, C.Value
From Combined As C
Join (
Select C1.Username
From Combined As C1
Group By C1.Username
Having Count(Distinct C1.TableName) = 3
) As Z
On Z.Username = C.Username
Group By C.Username, C.Value
SQL Fiddle version

SQL query to merge 2 tables with additional conditions?

I have 2 identical tables: user_id, name, age, date_added.
USER_ID column may contain multiple duplicate IDs.
Need to merge those 2 tables into 1 with the following condition.
If there are multiple records with identical 'name' for the same user then need to keep only the LATEST (by date_added) record.
This script will be used with MSSQL 2005, but would also appreciate if somebody comes up with version that does not use ROW_NUMBER(). Need this script to reload a broken table once, performance is not critical.
example:
table1:
1,'john',21,01/01/2010
1,'john',15,01/01/2005
1,'john',71,01/01/2001
table2:
1,'john',81,01/01/2007
1,'john',15,01/01/2005
1,'john',11,01/01/2008
result:
1,'john',21,01/01/2010
UPDATE:
I think that I've found my own solution. It is based on an answer for my previous question given by Larry Lustig and Joe Stefanelli.
with tmp2 as
(
SELECT * FROM table1
UNION
SELECT * FROM table2
)
SELECT * FROM tmp2 c1
WHERE (SELECT COUNT(*) FROM tmp2 c2
WHERE c2.user_id = c1.user_id AND
c2.name = c1.name AND
c2.date_added >= c1.date_added) <= 1
Could you please help me to convert this query to the one without 'WITH' clause?
Here's a variant of #Andomar's answer:
; with all_users as
(
select *
from table1 u1
union all
select *
from table2 u2
)
, ranker as (
select *,
rank() over (partition by userid order by recordtime) as [r]
)
select * from ranker where [r] = 1
Just in the interests of giving a different approach...
WITH distinctlist
As (SELECT user_id,
name
FROM table1
UNION
SELECT user_id,
name
FROM table2)
SELECT C.*
FROM distinctlist d
CROSS APPLY (SELECT TOP 1 *
FROM (SELECT TOP 1 *
FROM table1
WHERE user_id = d.user_id
AND name = d.name
ORDER BY date_added DESC
UNION ALL
SELECT TOP 1 *
FROM table1
WHERE user_id = d.user_id
AND name = d.name
ORDER BY date_added DESC) T
ORDER BY date_added DESC) C
You could use not exists, like:
; with all_users as
(
select *
from table1 u1
union all
select *
from table2 u2
)
select *
from all_users u1
where not exists
(
select *
from all_users u2
where u1.name = u2.name
and u1.record_time < u2.record_time
)
If the database doesn't support CTE's, expand all_users in the two places it is used.
P.S. If there are only three columns, and no more, you could use an even simpler solution:
select name
, MAX(record_time)
from (
select *
from table1 u1
union all
select *
from table2 u2
) sub
group by
name