How to run the subquery with the non equality clause on Spark? - sql

I have this query and I want to execute it on spark
SELECT A.PFR,
A.MFR,
A.MST,
(SELECT COUNT(*)
FROM Table1 T2
WHERE T1.PFR = T2.PFR
AND T1.MFR = T2.MFR
AND T1.MST >= T2.MST) AS RANK
FROM Table1 A
But spark didn't support subquery with non equality clause
I get this error
The correlated scalar subquery can only contain equality predicates
So I tried to use group by but I didn't get the correct results (I have the input and the out result)
SELECT A.PFR,
A.MFR,
A.MST,
B.countRank
FROM Table1 A
LEFT OUTER JOIN
(SELECT PFR,
MFR,
MST,
COUNT(MFR) countRank
FROM Table1 B
GROUP BY PFR,
MFR,
MST) B ON B.PFR = A.PFR
There are a method to convert this query to a join query.
Thanks in advance.

Just use rank():
SELECT A.PFR, A.MFR, A.MST,
RANK() OVER (PARTITION BY PFR, MFR
ORDER BY MST DESC
) as rank
FROM Table1 A;
If rank() doesn't do exactly what you want, then perhaps row_number() or dense_rank() work.

Related

Using table from select result in its own nested select statement during join

This might be a duplicate or poor description of the problem, sorry if so. Is it possible to use sub from the following query in a nested select statement in the join as shown below:
SELECT
*
FROM
(
SELECT
several_values_from_several_tables
FROM
table1 t1
JOIN multiple_tables ON t1_and_eachother
WHERE
several_conditions_including_nested_select
ORDER BY
multiple_columns
) sub
INNER JOIN (
SELECT
some_id, MAX(some_time_value) AS max_time
FROM
sub
GROUP BY
some_id
) sub2 ON sub.some_id = sub2.some_id
WHERE
sub.time = sub2.time
I'd like to use sub in the joins select statement to avoid having to repeat the same select statement as it is quite large and expensive (I also have a second join on a different timestamp which would result in doing the same expensive query 3 times). I guess that sub is not created until joins are performed and where clause is to be executed. If anyone has found a solution or workaround to achieve the same result in a single query before I'd greatly appreciate some pointers.
CTE (a Common Table Expression, a.k.a. the WITH factoring clause) can be used for that, e.g.
WITH
sub
AS
( SELECT several_values_from_several_tables
FROM table1 t1 JOIN multiple_tables ON t1_and_eachother
WHERE several_conditions_including_nested_select
ORDER BY multiple_columns),
sub2
AS
( SELECT some_id, MAX (some_time_value) AS max_time
FROM sub
GROUP BY some_id)
SELECT *
FROM sub s JOIN sub2 s2 ON s.some_id = s2.some_id
WHERE s.time = s2.time
In this case, you should avoid the self-join and use an analytic function:
SELECT *
FROM (
SELECT several_values_from_several_tables,
MAX(some_time_value) OVER (PARTITION BY some_id) AS max_time
FROM table1 t1
JOIN multiple_tables ON t1_and_eachother
WHERE several_conditions_including_nested_select
ORDER BY
multiple_columns
)
WHERE some_time_value = max_time
Note: You could also use the RANK or DENSE_RANK analytic functions instead of MAX.

SQL MAX funtion where not all atributes are in the group by

So my current problem is that I have two tables that look like this:
table1(name, num_patient, quant, inst)
table2(inst_name, num_region)
Where I want to find the patient with max quantity per region.
I first had the idea of doing something like this:
SELECT num_region, num_patient, MAX(quant)
FROM
(SELECT num_patient, quant, num_region
FROM table1
INNER JOIN table2
ON table1.inst = table2.inst_name) AS joined_tables
GROUP BY num_region;
But this doesn't work since either num_patient has to be on the GROUP BY (and this way it doesn't return the max value by region anymore) or I have to remove it from the SELECT (also doesn't work because I need the name of each patient). I have tried to fix my issue with a WHERE quant = MAX() statement but couldn't get it to work. Is there any workaround to this?
Use DISTINCT ON:
SELECT DISTINCT ON (num_region), num_patient, quant, num_region
FROM table1 t1 JOIN
table2 t2
ON t1.inst = t2.inst_name
ORDER BY num_region, quant DESC;
DISTINCT ON is a convenient Postgres extension. It returns one row per keys specified in the SELECT, based on the ordering in the ORDER BY.
Being an extension, not all databases support this functionality -- even databases derived from Postgres. The traditional method would use ROW_NUMBER():
SELECT t.*
FROM (SELECT num_patient, quant, num_region,
ROW_NUMBER() OVER (PARTITION BY num_region ORDER BY quant DESC) as seqnum
FROM table1 t1 JOIN
table2 t2
ON t1.inst = t2.inst_name
) t
WHERE seqnum = 1;
This is a duplicate of the DISTINCT ON question I linked.
SELECT distinct on (num_region) num_patient, quant, num_region
FROM table1
INNER JOIN table2
ON table1.inst = table2.inst_name
ORDER BY num_region, quant desc

entry cannot be referenced in this part of the query (subquery) Error

I'm getting the following error on my query:
here is an entry for table "table1", but it cannot be referenced from this part of the query.
This is my query:
SELECT id
FROM property_import_image_results table1
LEFT JOIN (
SELECT created_at
FROM property_import_image_results
WHERE external_url = table1.external_url
ORDER BY created_at DESC NULLS LAST
LIMIT 1
) as table2 ON (pimr.created_at = table2.created_at)
WHERE table2.created_at is NULL
You need a lateral join to be able to reference the outer table in the sub-select for the join.
You are also referencing an alias pimr in the join condition, which isn't available anywhere in the query. So you need to change that to table1 in the join condition.
You should also given the table in the inner query an alias to avoid confusion:
SELECT id
FROM property_import_image_results table1
LEFT JOIN LATERAL (
SELECT p2.created_at
FROM property_import_image_results p2
WHERE p2.external_url = table1.external_url
ORDER BY p2.created_at DESC NULLS LAST
LIMIT 1
) as table2 ON (table1.created_at = table2.created_at)
WHERE table2.created_at is NULL
Edit
This kind of query can also be solved using window functions:
select id
from (
select id,
max(created_at) over (partition by external_url) as max_created
FROM property_import_image_results
) t
where created_at <> max_created;
This might be faster than aggregating and joining as you do. But it's hard to tell. The lateral joins are quite efficient as well. It has the advantage that you can add any column you like to the result because no grouping is required.

How to use order by and rownum without subselect?

I need to build a query with a order by and rownum but without use a sublect.
It is needed to get the first row of the query ordered.
In other words, I want the result of
select * from (
SELECT CAMP1,ORDERCAMP
FROM TABLENAME
ORDER BY ORDERCAMP) where rownum=1;
but whithout use a subselect. Is it possible?
I have a Oracle 11. You could say this is my whole query:
SELECT T1.CAMP_ID,
T2.CAMP
(SELECT OT.CAMP
FROM OTHERTABLE OT
WHERE OT.FK_TO_TABLE1=T1.CAMP_ID
ORDER BY OT.ORDERCAMP
)
FROM TABLE1 T1,
TABLE2 T2
WHERE T1.FK_TO_T2=T2.PK;
The subquery returns more than one row, and I cant use another subquery like
SELECT T1.CAMP_ID,
T2.CAMP
(SELECT *
FROM
(SELECT OT.CAMP
FROM OTHERTABLE OT
WHERE OT.FK_TO_TABLE1=T1.CAMP_ID
ORDER BY OT.ORDERCAMP
)
WHERE ROWNUM=1
)
FROM TABLE1 T1,
TABLE2 T2
WHERE T1.FK_TO_T2=T2.PK;
SELECT CAMP1,ORDERCAMP FROM TABLE2 ORDER BY ORDERCAMP
Because the T1.CAMP_ID is an invalid identifier in the third level subquery.
I hope I have explained myself enough.
Your current query (without the invalid ORDER BY) gets ORA-01427: single-row subquery returns more than one row. You can nest subqueries, but you can only refer back one level when joining; so if you did:
SELECT T1.CAMP_ID, T2.CAMP,
(SELECT CAMP FROM
FROM
(SELECT OT.CAMP
FROM OTHERTABLE OT
WHERE OT.FK_TO_TABLE1=T1.CAMP_ID
ORDER BY OT.ORDERCAMP
)
WHERE ROWNUM = 1)
FROM TABLE1 T1, TABLE2 T2 WHERE T1.FK_TO_T2=T2.PK;
... then you would get ORA-00904: "T1"."CAMP_ID": invalid identifier. Hence your question, presumably.
What you could do instead is join to the third table, and use the analytic ROW_NUMBER() function to assign the row number, and then use an outer select wrapped around the whole thing to only find the records with the lowest ORDERCAMP:
SELECT CAMP_ID, CAMP, OT_CAMP
FROM (
SELECT T1.CAMP_ID, T2.CAMP, OT.CAMP AS OT_CAMP,
ROW_NUMBER() OVER (PARTITION BY T1.CAMP_ID ORDER BY OT.ORDERCAMP) AS RN
FROM TABLE2 T2
JOIN TABLE1 T1 ON T1.FK_TO_T2=T2.PK
JOIN OTHERTABLE OT ON OT.FK_TO_TABLE1=T1.CAMP_ID
)
WHERE RN = 1;
The ROW_NUMBER() can partition on the T1.CAMP_ID primary key value, or anything else that is unique.
SQL Fiddle demo, including the inner query run on its own so you can see the RN numbers assigned before the outer filter is applied.
Another approach is to use the aggregate KEEP DENSE_RANK FIRST function
SELECT T1.CAMP_ID, T2.CAMP,
MAX(OT.CAMP) KEEP (DENSE_RANK FIRST ORDER BY OT.ORDERCAMP) AS OT_CAMP
FROM TABLE2 T2
JOIN TABLE1 T1 ON T1.FK_TO_T2=T2.PK
JOIN OTHERTABLE OT ON OT.FK_TO_TABLE1=T1.CAMP_ID
GROUP BY T1.CAMP_ID, T2.CAMP;
Which is a bit shorter and doesn't need an inner query. I'm not sure if there's any real advantage of one over the other.
SQL Fiddle demo.
In the most recent version of Oracle, you can do:
SELECT CAMP1, ORDERCAMP
FROM TABLENAME
ORDER BY ORDERCAMP
FETCH FIRST 1 ROWS ONLY;
Otherwise, I think you need a subquery of some sort.
You could use LIMIT or SELECT TOP 1
SELECT CAMP1, ORDERCAMP FROM TABLENAME ORDER BY ORDERCAMP LIMIT 1

SQL Server ROW_NUMBER Left Join + when you don't know column names

I'm writing a page that will create a query (for non-db users) and it create the query and run it returning the results for them.
I am using row_number to handle custom pagination.
How do I do a left join and a row_number in a subquery when I don't know the specific columns I need to return. I tried to use * but I get an error that
The column '' was specified multiple times
Here is the query I tried:
SELECT * FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Test) AS ROW_NUMBER, *
FROM table1 a
LEFT JOIN table2 b
ON a.ID = b.ID) x
WHERE ROW_NUMBER BETWEEN 1 AND 50
Your query is going to fail in SQL Server regardless of the row_number() call. The * returns all columns, including a.id and b.id. These both have the same name. This is fine for a query, but for a subquery, all columns need distinct names.
You can use row_number() for an arbitrary ordering by using a "subquery with constant" in the order by clause:
SELECT * FROM
(SELECT ROW_NUMBER() OVER (ORDER BY (select NULL)) AS ROW_NUMBER, *
FROM table1 a
LEFT JOIN table2 b
ON a.ID = b.ID) x
WHERE ROW_NUMBER BETWEEN 1 AND 50 ;
This removes the dependency on the underlying column name (assuming none are named ROW_NUMBER).
Try this sql. It should work.
SELECT * FROM
(SELECT ROW_NUMBER() OVER (ORDER BY a.Test) AS ROW_NUMBER, a.*,b.*
FROM table1 a
LEFT JOIN table2 b
ON a.ID = b.ID) x
WHERE ROW_NUMBER BETWEEN 1 AND 50