Query where two columns are in the result of nested query - sql

I'm writing a query like this:
select * from myTable where X in (select X from Y) and XX in (select X from Y)
Values from columns X and XX has to be in the result of the same query: select X from Y.
I think that this query is invoked twice so its senseless. Is there any other option I can write this query more efficiently? Maybe temp table?

Actually no, there isn't a smarter way to write this (without visiting Y twice) given the X that myTable.X and myTable.YY matches to may not be from the same row.
As an alternative, the EXISTS form of the query is
select *
from myTable A
where exists (select * from Y where A.X = Y.X)
and exists (select * from Y where A.XX = Y.X)
If Y contains X values of 1,2,3,4,5, and x.x = 2 and x.xx = 4, they both exist (on different records in Y) and the record from myTable should be shown in output.
EDIT: This answer previously stated that You could rewrite this using _EXISTS_ clauses which will work faster than _IN_. AS Martin has pointed out, this is not true (certainly not for SQL Server 2005 and above). See links
http://explainextended.com/2009/06/16/in-vs-join-vs-exists/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/

It will probably not be particularly efficient to try to write this query by only referencing Y once. However, given that you are using SQL Server 2008, there are variations that can be used:
Select ...
From MyTable As T
Where Exists (
Select 1
From Y
Where Y.X = T.X
Intersect
Select 1
From Y
Where Y.X = T.XX
)
Addition
Actually, I can think of a way you could do it without using Y more than once (Nothing was said about using MyTable more than once). However, this is more for academic reasons as I think that using my first solution will likely perform better:
Select ...
From MyTable As T
Where Exists (
Select 1
From Y
Where Exists(
Select 1
From MyTable1 As T1
Where T1.X = Y.X
Intersect
Select 1
From MyTable1 As T2
Where T2.XX = Y.X
)
And Y.X In(T.X, T.XX)
)

WITH
w_tmp AS(
SELECT x
FROM y
)
SELECT *
FROM myTable
WHERE x IN (SELECT x FROM w_tmp)
AND xx IN (SELECT x FROM w_tmp)
(I've read this in Oracle docs, but I think MS able to do this optimizations too)
This way optimizer knows for sure that you are doing same query and can create temporary table to cash results (But it's still up to optimizer to decide whether it's worth it. For tiny queries, overhead of creating temp table can be too high).
Also (and actually this is way more important for me), when subquery is 50 lines, it's easier for human to see, that the same thing is used in both cases. Pretty much like factoring long functions into subroutines
Docs on MSDN

Not sure what the problem is but isn't simple JOIN an answer?
SELECT t.*
FROM myTable
JOIN Y y1 ON y1.X = myTable.X
JOIN Y y2 ON y2.X = myTable.XX
or
SELECT t.*
FROM myTable, Y y1, Y y2
WHERE y1.X = myTable.X AND y2.X = myTable.XX
ADDED: if there is a strong need to eliminate a second query for Y, let's reverse the logic:
;WITH A(X)
AS (
-- this will select all values that can be found in Y and myTable X and XX fields.
SELECT Y.X -- if there are a lot of dups, add DISTINCT
FROM Y, myTable
WHERE Y.X IN (myTable.X, myTableXX)
)
-- now join back to the orignal table and filter.
SELECT t.*
FROM myTable
-- similar to what has been mentioned before
WHERE EXISTS(SELECT TOP 1 * from A where A.X = myTable.X)
AND EXISTS(SELECT TOP 1 * from A where A.X = myTable.XX)
If you don't like WITH, you may use SELECT INTO clause and create in-memory table.

Related

How to speed up sql query execution?

The task is to execute the sql query:
select * from x where user in (select user from x where id = '1')
The subquery contains about 1000 id so it takes a long time.
Maybe this question was already there, but how can I speed it up? (if it is possible to speed up please write for PL SQL and T-SQL or at least one of them).
I would start by rewriting the in condition to exists:
select *
from x
where exists (select 1 from x x1 where x.user = x.user and x1.id = 1)
Then, consider an index on x(user, id) - or x(id, user) (you can try both and see if one offers better improvement that the other).
Another possibility is to use window functions:
select *
from (
select x.*, max(case when id = 1 then 1 else 0 end) over(partition by user) flag
from x
) x
where flag = 1
This might, or might not, perform better than the not exists solution, depending on various factors.
ids are usually unique. Is it sufficient to do this?
select x.*
from x
where id in ( . . . );
You would want an index on id, if it is not already the primary key of the table.

Shorthand to compare multiple columns against same condition in SQL Server?

Does a shorthand exist that allows you to compare multiple columns against the same condition in the WHERE clause?
SELECT *
FROM [Table]
WHERE [Date1] BETWEEN x AND y
OR [Date2] BETWEEN x AND y
OR [Date3] BETWEEN x and y
OR [Date4] BETWEEN x and y
It's not the end of the world to copy and paste this condition and replace [Date x] with each column, but it sure isn't fun.
You can also write the query like this (in SQL Server 2008 or later):
SELECT * FROM [Table]
WHERE EXISTS (
SELECT *
FROM (VALUES (Date1),(Date2),(Date3),(Date4)) v (TheDate)
WHERE TheDate BETWEEN x AND y
)
However, I don't see any benefits of doing so (in terms of peformance or readability).
Of course, things would be different if you need to write Date1=x OR Date2=x OR Date3=x OR Date4=x, because in this case you can simply write x IN (Date1, Date2, Date3, Date4).
You could use cross apply and values, but the result is even more cumbersome than the code you have right now:
SELECT *
FROM [Table]
CROSS APPLY
(
SELECT MIN([Date]) As MinDate,
MAX([Date]) As MaxDate
FROM (VALUES ([Date1]), ([Date2]), ([Date3]), ([Date4])) VALS([Date])
)
WHERE MinDate <= y
AND MaxDate >= x
AND x <= y
With that being said, I agree with Sean Lange's comment - Seems like the table structure is ill-designed and all these dates values should be in a different table, referenced by this table with a one-to-many relationship.

Why does RANDOM() in a SQLite CTE JOIN behave differently to other RDBMSs?

RANDOM() values in a Common Table Expression (CTE) join aren't behaving as expected in SQLite.
SQL:
WITH
tbl1(n) AS (SELECT 1 UNION ALL SELECT 2),
tbl2(n, r) AS (SELECT n, RANDOM() FROM tbl1)
SELECT * FROM tbl2 t1 CROSS JOIN tbl2 t2;
Sample SQLite results:
n r n r
1 7058971975145008000 1 8874103142384122000
1 1383551786055205600 2 8456124381892735000
2 2646187515714600000 1 7558324128446983000
2 -1529979429149869800 2 7003770339419606000
The random numbers in each column are all different. But a CROSS JOIN repeats rows - so I expected 2 pairs of the same number in each column - which is the case in PostgreSQL, Oracle 11g and SQL Server 2014 (when using a row-based seed).
Sample PostgreSQL / Oracle 11g / SQL Server 2014 results:
n r n r
1 0.117551110684872 1 0.117551110684872
1 0.117551110684872 2 0.221985165029764
2 0.221985165029764 1 0.117551110684872
2 0.221985165029764 2 0.221985165029764
Questions
Can the behaviour in SQLite be explained? Is it a bug?
Is there a way for Table B in a CTE (based on Table A in the same CTE) to have an additional column of randomly generated numbers, which will remain fixed when used in a JOIN?
Your question is rather long and rambling -- not a single question. But, it is interesting and I learned something.
This statement is not true:
SQL Server assigns a random seed to the RAND() function: When used in
a SELECT, it is only seeded once rather than for each row.
SQL Server has the concept of run-time constant functions. These are functions that are pulled from the compiled query and executed once per expression at the beginning of the query. The most prominent examples are getdate() (and related date/time functions) and rand().
You can readily see this if you run:
select rand(), rand()
from (values (1), (2), (3)) v(x);
Each column has the same values, but the values between the columns are different.
Most databases -- including SQLite -- have the more intuitive interpretation of rand()/random(). (As an personal note, a "random" function that returns the same value on each row is highly counter-intuitive.) Each time it is called you get a different value. For SQL Server, you would typically use an expression using newid():
select rand(), rand(), rand(checksum(newid()))
from (values (1), (2), (3)) v(x);
As for your second question, it appears that SQLite materializes recursive CTEs. So this does what you want:
WITH tbl1(n) AS (
SELECT 1 UNION ALL SELECT 2
),
tbl2(n, r) AS (
SELECT n, RANDOM()
FROM tbl1
union all
select *
from tbl2
where 1=0
)
SELECT *
FROM tbl2 t1 CROSS JOIN tbl2 t2;
I have seen no documentation that this is the case, so use at your own risk. Here is a DB-Fiddle.
And, for the record, this seems to work in SQL Server as well. I just learned something!
EDIT:
As suggested in the comment, the materialization may not always happen. It does seem to apply to two references at the same level:
WITH tbl1(n) AS (
SELECT 1 UNION ALL SELECT 2),
tbl2(n, r) AS (
SELECT n, RANDOM()
FROM tbl1
union all
select *
from tbl2
where 1=0
)
SELECT t2a.r, count(*)
FROM tbl2 t2a left JOIN
tbl2 t2b
on t2a.r = t2b.r
GROUP BY t2a.r;

Returning the lowest integer not in a list in SQL

Supposed you have a table T(A) with only positive integers allowed, like:
1,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18
In the above example, the result is 10. We always can use ORDER BY and DISTINCT to sort and remove duplicates. However, to find the lowest integer not in the list, I came up with the following SQL query:
select list.x + 1
from (select x from (select distinct a as x from T order by a)) as list, T
where list.x + 1 not in T limit 1;
My idea is start a counter and 1, check if that counter is in list: if it is, return it, otherwise increment and look again. However, I have to start that counter as 1, and then increment. That query works most of the cases, by there are some corner cases like in 1. How can I accomplish that in SQL or should I go about a completely different direction to solve this problem?
Because SQL works on sets, the intermediate SELECT DISTINCT a AS x FROM t ORDER BY a is redundant.
The basic technique of looking for a gap in a column of integers is to find where the current entry plus 1 does not exist. This requires a self-join of some sort.
Your query is not far off, but I think it can be simplified to:
SELECT MIN(a) + 1
FROM t
WHERE a + 1 NOT IN (SELECT a FROM t)
The NOT IN acts as a sort of self-join. This won't produce anything from an empty table, but should be OK otherwise.
SQL Fiddle
select min(y.a) as a
from
t x
right join
(
select a + 1 as a from t
union
select 1
) y on y.a = x.a
where x.a is null
It will work even in an empty table
SELECT min(t.a) - 1
FROM t
LEFT JOIN t t1 ON t1.a = t.a - 1
WHERE t1.a IS NULL
AND t.a > 1; -- exclude 0
This finds the smallest number greater than 1, where the next-smaller number is not in the same table. That missing number is returned.
This works even for a missing 1. There are multiple answers checking in the opposite direction. All of them would fail with a missing 1.
SQL Fiddle.
You can do the following, although you may also want to define a range - in which case you might need a couple of UNIONs
SELECT x.id+1
FROM my_table x
LEFT
JOIN my_table y
ON x.id+1 = y.id
WHERE y.id IS NULL
ORDER
BY x.id LIMIT 1;
You can always create a table with all of the numbers from 1 to X and then join that table with the table you are comparing. Then just find the TOP value in your SELECT statement that isn't present in the table you are comparing
SELECT TOP 1 table_with_all_numbers.number, table_with_missing_numbers.number
FROM table_with_all_numbers
LEFT JOIN table_with_missing_numbers
ON table_with_missing_numbers.number = table_with_all_numbers.number
WHERE table_with_missing_numbers.number IS NULL
ORDER BY table_with_all_numbers.number ASC;
In SQLite 3.8.3 or later, you can use a recursive common table expression to create a counter.
Here, we stop counting when we find a value not in the table:
WITH RECURSIVE counter(c) AS (
SELECT 1
UNION ALL
SELECT c + 1 FROM counter WHERE c IN t)
SELECT max(c) FROM counter;
(This works for an empty table or a missing 1.)
This query ranks (starting from rank 1) each distinct number in ascending order and selects the lowest rank that's less than its number. If no rank is lower than its number (i.e. there are no gaps in the table) the query returns the max number + 1.
select coalesce(min(number),1) from (
select min(cnt) number
from (
select
number,
(select count(*) from (select distinct number from numbers) b where b.number <= a.number) as cnt
from (select distinct number from numbers) a
) t1 where number > cnt
union
select max(number) + 1 number from numbers
) t1
http://sqlfiddle.com/#!7/720cc/3
Just another method, using EXCEPT this time:
SELECT a + 1 AS missing FROM T
EXCEPT
SELECT a FROM T
ORDER BY missing
LIMIT 1;

Nested queries in Hive SQL

I have a database, and I use a query to produce an intermediate table like this:
id a b
xx 1 2
yy 7 11
and I would like to calculate the standard deviations of b for the users who have a < avg(a)
I calculate avg(a) that way and it works fine:
select avg(select a from (query to produce intermediate table)) from table;
But the query:
select stddev_pop(b)
from (query to produce intermediate table)
where a < (select avg(select a
from (query to produce intermediate table))
from table);
Returns me an error, and more precisely, I am told that the "a" from avg(select a from...) is not recognised. This makes me really confused, as it works in the previous query.
I would be grateful if somebody could help.
EDIT:
I stored the result of my query to generate the intermediary table into a temporary table, but still run into the same problem.
The non working query becomes:
select stddev_pop(b) from temp where a < (select avg(a) from temp);
while this works:
select avg(a) from temp;
OK, a colleague helped me to do it. I'll post the answer in case someone runs into the same problem:
select stddev_pop(b)
from temp x
join (select avg(a) as average from temp) y
where x.a < y.average;
Basically hive doesn't do caching of a table as a variable.
You likely need to move your parentheses in your WHERE clause. Try this:
select stddev_pop(b)
from (query to produce intermediate table)
where c < ( select avg(a)
from (query to produce intermediate table)
);
And, your question refers to a column c; did you mean a?
UPDATE: I saw a similar question with MySQL today; sorry I don't know Hive. See if this works:
select stddev_pop(b)
from temp
where a < ( select *
from (select avg(a) from temp) x
);
ok , first of all hive doesnt support sub queries anywhere only than the from clause.
so you can't use subquery in where clause you have to create a temp table in from clause and you can use that table.
Now if you create a temp table and than you are using it in your where clause than to refer that temp table it has to again run the fetching query so again it will not support .
Bob I think hive will not support this
select stddev_pop(b)
from temp
where a < ( select *
from (select avg(a) from temp) x
);
but yes
select stddev_pop(b)
from temp x
join (select avg(a) as average from temp) y
where x.a < y.average;
if we can create a temp table physically and put the data select avg(a) as average from temp into that then we can refer this .