Nested SELECT with a WHERE clause in Spark - sql

I have a problem with running a Spark SQL query which uses a nested select with a "where in" clause. In the query below table1 represents a temporary table which comes from a more complicated query. In the end I want to substitute table1 with this query.
select * from (select * from table1) as table2
where (product, price)
in (select product, min(price) from table2 group by product)
The Spark error I get says:
AnalysisException: 'Table or view not found: table2;
How could I possibly change the query to make it work as intended?

subquery (i.e. (select * from table1) as table2 ) is not needed & it is limited to immediate use after subquery defined you can't use with in or where clause, you can use correlated subquery instead :
select t1.*
from table1 t1
where t1.price = (select min(t2.price) from table1 t2 where t2.product = t1.product);

Related

How to write subquery like: column=(select xx from table) in Hive?

I have a scenario, for example:
with tmp as (select name from table1)
select * from table2 b
where b.name=(select max(name) from tmp)
However, Hive can't recognize this syntax, so is there any legal syntax for this?
After search, I learnt it can use join to realize:
select table2.* from table2
join (select max(name) as name from tmp) t2
where table2.name = t2.name
but I don't want to use join, as the join will be very slow, I just want to regard it as a reference.
Like in MySQL, you are able to set the result as a reference:
set #max_date := select max(date) from some_table;
select * from some_other_table where date > #max_date;
While Hive can achieve the effect that storing query result in shell. Check: HiveQL: Using query results as variables
Can Hive support such feature in SQL mode?
In Hive you can achieve it as below:
select * from table2 b
where b.name=(select max(name) from table1)
Other way :
You can also create temporary table in hive which will help to replicate your Oracle query above.
CREATE TEMPORARY TABLE tmp AS SELECT name FROM table1;
SELECT * FROM table2 b WHERE b.name=(SELECT max(name) FROM tmp);

Can in line views in oracle sql contain "not in" clause in the query?

For example
select *
from t1
inner join (select * from t2 where t2.id not in (select ID from t2 where city="Paris"))
I tried searching Google. There a lot of examples but none of them uses not in. Plus there are no restrictions specified for an in line view.
Oracle calls subqueries in the FROM clause "inline views".
These are generic SELECT queries. They can contain NOT IN with subqueries. The problem with your query is a lack of ON clause and the use of double quotes for a string constant:
select *
from t1 inner join
(select *
from t2
where t2.id not in (select ID from t2 where city = 'Paris')
---------------------------------------------------------^ single quotes
) t2
on t1.? = t2.?
-----^ on clause
Note: I would discourage you from using NOT IN with subqueries, because they do not work as expected if any returned values are NULL. (If that is the case, then no rows are returned.)
I advise using NOT EXISTS instead.

WHERE + NOT EXIST + 2 Columns

I have a query, that should return all records in T1 that not linked to records in T2:
SELECT DISTINCT fldID, fldValue FROM T1
WHERE NOT EXISTS
(
SELECT T1.fldID, T1.fldValue
FROM T2
JOIN T1 ON T2.fldID = T1.fldPtr
)
But it returns empty set -- should be one record.
If I use query like this (clause on one field):
SELECT DISTINCT fldID FROM T1
WHERE fldID NOT IN
(
SELECT T1.fldID
FROM T2
JOIN T1 ON T2.fldID = T1.fldPtr
)
It returns correct result.
But the SQL Server do not support syntax
WHERE ( fldID, flrValue ) NOT IN ....
Help me please to figure out how to compose query that will check several columns?
Thanks!
You can also use EXCEPT for this:
SELECT DISTINCT fldID, fldValue FROM T1
EXCEPT
SELECT T1.fldID, T1.fldValue
FROM T2
JOIN T1 ON T2.fldID = T1.fldPtr
A more efficient and elegant query that will work with every database is:
SELECT T1.*
FROM T1
LEFT JOIN T2
ON T2.fldID = T1.fldPtr
AND T2.flrValue = T1.flrValue
WHERE T2.fldID IS NULL
The LEFT JOIN attempts to match using both criteria, then the WHERE clause filters the joins, and only non-joins have NULL values for the LEFT JOINed table.
This approach is IMHO pretty much the industry standard for finding non-matches. It is usually more efficient than a NOT EXIstS(), although several databases optimize a NOT EXISTS() to this query anyway.
Use both those columns if sub-query join:
SELECT DISTINCT fldID, fldValue FROM T1
WHERE NOT EXISTS
(
SELECT *
FROM T2
JOIN T1 ON T2.fldID = T1.fldPtr
AND T1.fldValue = T2.flrValue
)
Something like (I think, as I'm not sure I 100% understand your question):
SELECT DISTINCT fldID FROM T1
WHERE fldID NOT IN
(
SELECT T1.fldID
FROM T2
JOIN T1 ON T2.fldID = T1.fldPtr
WHERE T2.flrValue = T1.flrValue
)
If you have the same structure in both tables you can use the EXCEPT operator http://technet.microsoft.com/en-us/library/ms188055.aspx
In a more general case, you have to to use left join and find null elements in second table.
try the below Query.
select DISTINCT fldID
from Table1
WHERE cast(fldID as varchar(100))+'~'+cast(flrValue as varchar)
NOT IN (select cast(fldID as varchar(100))+'~'+cast(flrValue as varchar) from table2)
This is more easy query. It returns all T1.fldID that not linked to records in T2
SELECT DISTINCT T1.fldID
FROM T1
LEFT JOIN T2 ON T2.fldID = T1.fldPtr
WHERE T2.fldID IS NULL
Using IN to exclude a large number of values is terrible for performance. Try the following:
SELECT T1.*
FROM T1
LEFT JOIN T2 ON T2.fldID = T1.fldPtr AND T1.fldValue = T2.fldvalue
WHERE T2.fldID IS NULL
(from my comment:) you do not have to reference t1 again in the subquery. Doing so would cause a logic of the form select all the records from t1 that don't exist in t1 ..., which is always empty, just like select all blue balls that are not blue, or select all odd numbers that are even ...
The first query should be:
SELECT DISTINCT fldID, fldValue
FROM T1
WHERE NOT EXISTS (
SELECT * FROM T2
WHERE T2.fldID = T1.fldPtr
);
And: in your original query, the subquery is uncorrelated: The t1 in the subquery shadows the t1 in the main query, making the subquery not referring any table or alias from the main query: it returns either True (some row exists) or False, the result being totally uncorrelated to the rows in the main query. (yet another good reason to use aliases instead of real table names in your queries)

Column ambiguously defined in subquery using rownums

I have to execute a SQL made from some users and show its results. An example SQL could be this:
SELECT t1.*, t2.* FROM table1 t1, table2 t2, where table1.id = table2.id
This SQL works fine as it is, but I need to manually add pagination and show the rownum, so the SQL ends up like this.
SELECT z.*
FROM(
SELECT y.*, ROWNUM rn
FROM (
SELECT t1.*, t2.* FROM table1 t1, table2 t2, where table1.id = table2.id
) y
WHERE ROWNUM <= 50) z
WHERE rn > 0
This throws an exception: "ORA-00918: column ambiguously defined" because both Table1 and Table2 contains a field with the same name ("id").
What could be the best way to avoid this?
Regards.
UPDATE
In the end, we had to go for the ugly way and parse each SQL coming before executing them. Basically, we resolved asterisks to discover what fields we needed to add, and alias every field with an unique id. This introduced a performance penalty but our client understood it was the only option given the requirements.
I will mark Lex answer as it´s the solution we ended up working on.
I think you have to specify aliasses for (at least one of) table1.id and table2.id. And possibly for any other corresponding columnnames as well.
So instead of SELECT t1.*, t2.* FROM table1 t1, table2 use something like:
SELECT t1.id t1id, t2.id t2id [rest of columns] FROM table1 t1, table2 t2
I'm not familiar with Oracle syntax, but I think you'll get the idea.
I was searching for an answer to something similar. I was referencing an aliased sub-query that had a couple of NULL columns. I had to alias the NULL columns because I had more than one;
select a.*, t2.column, t2.column, t2.column
(select t1.column, t1.column, NULL, NULL, t1.column from t1
where t1='VALUE') a
left outer join t2 on t2.column=t1.column;
Once i aliased the NULL columns in the sub-query it worked fine.
If you could modify the query syntactically (or get the users to do so) to use explicit JOIN syntax with the USING clause, this would automatically fix the problem at hand:
SELECT t1.*, t2.*
FROM table1 t1
JOIN table2 t2 USING (id)
The USING clause does the same as ON t1.id = t2.id (or the implicit JOIN you have in the question), except that only one id column remains in the result, thereby eliminating your problem.
You would still run into problems if there are more columns with identical names that are not included in the USING clause. Aliases as described by #Lex are indispensable then.
Use replace null values function to fix this.
SELECT z.*
FROM(
SELECT y.*, ROWNUM rn
FROM (
SELECT t1.*, t2.* FROM table1 t1, table2 t2, where
NVL(table1.id,0) = NVL(table2.id,0)
) y
WHERE ROWNUM <= 50) z
WHERE rn > 0

How to convert a SQL subquery to a join

I have two tables with a 1:n relationship: "content" and "versioned-content-data" (for example, an article entity and all the versions created of that article). I would like to create a view that displays the top version of each "content".
Currently I use this query (with a simple subquery):
SELECT
t1.id,
t1.title,
t1.contenttext,
t1.fk_idothertable
t1.version
FROM mytable as t1
WHERE (version = (SELECT MAX(version) AS topversion
FROM mytable
WHERE (fk_idothertable = t1.fk_idothertable)))
The subquery is actually a query to the same table that extracts the highest version of a specific item. Notice that the versioned items will have the same fk_idothertable.
In SQL Server I tried to create an indexed view of this query but it seems I'm not able since subqueries are not allowed in indexed views. So... here's my question... Can you think of a way to convert this query to some sort of query with JOINs?
It seems like indexed views cannot contain:
subqueries
common table expressions
derived tables
HAVING clauses
I'm desperate. Any other ideas are welcome :-)
Thanks a lot!
This probably won't help if table is already in production but the right way to model this is to make version = 0 the permanent version and always increment the version of OLDER material. So when you insert a new version you would say:
UPDATE thetable SET version = version + 1 WHERE id = :id
INSERT INTO thetable (id, version, title, ...) VALUES (:id, 0, :title, ...)
Then this query would just be
SELECT id, title, ... FROM thetable WHERE version = 0
No subqueries, no MAX aggregation. You always know what the current version is. You never have to select max(version) in order to insert the new record.
Maybe something like this?
SELECT
t2.id,
t2.title,
t2.contenttext,
t2.fk_idothertable,
t2.version
FROM mytable t1, mytable t2
WHERE t1.fk_idothertable == t2.fk_idothertable
GROUP BY t2.fk_idothertable, t2.version
HAVING t2.version=MAX(t1.version)
Just a wild guess...
You Might be able to make the MAX a table alias that does group by.
It might look something like this:
SELECT
t1.id,
t1.title,
t1.contenttext,
t1.fk_idothertable
t1.version
FROM mytable as t1 JOIN
(SELECT fk_idothertable, MAX(version) AS topversion
FROM mytable
GROUP BY fk_idothertable) as t2
ON t1.version = t2.topversion
I think FerranB was close but didn't quite have the grouping right:
with
latest_versions as (
select
max(version) as latest_version,
fk_idothertable
from
mytable
group by
fk_idothertable
)
select
t1.id,
t1.title,
t1.contenttext,
t1.fk_idothertable,
t1.version
from
mytable as t1
join latest_versions on (t1.version = latest_versions.latest_version
and t1.fk_idothertable = latest_versions.fk_idothertable);
M
If SQL Server accepts LIMIT clause, I think the following should work:
SELECT
t1.id,
t1.title,
t1.contenttext,
t1.fk_idothertable
t1.version
FROM mytable as t1 ordery by t1.version DESC LIMIT 1;
(DESC - For descending sort; LIMIT 1 chooses only the first row and
DBMS usually does good optimization on seeing LIMIT).
I don't know how efficient this would be, but:
SELECT t1.*, t2.version
FROM mytable AS t1
JOIN (
SElECT mytable.fk_idothertable, MAX(mytable.version) AS version
FROM mytable
) t2 ON t1.fk_idothertable = t2.fk_idothertable
Like this...I assume that the 'mytable' in the subquery was a different actual table...so I called it mytable2. If it was the same table then this will still work, but then I imagine that fk_idothertable will just be 'id'.
SELECT
t1.id,
t1.title,
t1.contenttext,
t1.fk_idothertable
t1.version
FROM mytable as t1
INNER JOIN (SELECT MAX(Version) AS topversion,fk_idothertable FROM mytable2 GROUP BY fk_idothertable) t2
ON t1.id = t2.fk_idothertable AND t1.version = t2.topversion
Hope this helps