Spark SQL statement broadcast - sql

Is there a way to use broadcast in Spark SQL statement?
For example:
SELECT
Column
FROM
broadcast (Table 1)
JOIN
Table 2
ON
Table1.key = Table2.key
And in my case, Table 1 is also a sub query.

Below is the syntax for Broadcast join:
SELECT /*+ BROADCAST(Table 2) */ COLUMN
FROM Table 1 join Table 2
on Table1.key= Table2.key
To check if broadcast join occurs or not you can check in Spark UI port number 18080 in the SQL tab.
The reason we need to ensure whether broadcast join is actually working is because earlier we were using the below syntax:
/* BROADCASTJOIN(Table2) */ which did not throw syntax error but in the UI it was performing sort merge join
Hence it is essential to ensure our joins are working as expected

In Spark 2.2 or later you can use planner hints:
SELECT /*+ MAPJOIN(Table1) */ COLUMN
FROM Table1 JOIN Table2
ON Table1.key = Table2.key

Related

How to run the subquery first in presto

I have the following query:
select *
from Table1
where NUMid in (select NUMid
from Table2
where email = 'xyz#gmail.com')
My intention is to get the list of all the NUMids from table2 having an email value equal to xyz#gmail.com and use those list of NUMids to query from Table1.
In presto, the query is running the outer query first. Is there a way to run and store the result of inner query and then use it in the outer query in presto?
The optimizer can do what it likes. In this case, it should be running the inner query once and then essentially doing a JOIN (technically a "semi-join") operation.
In many databases, exists with appropriate indexes solves the performance problem.
If you want to ensure that the subquery is evaluated only once, you can move it to the ON clause. The correct equivalent query looks like:
select t1.*
from Table1 t1 join
(select distinct t2.NUMid
from Table2 t2
where t2.email = 'xyz#gmail.com'
) t2
on t1.NUMid = t2.NUMid;
The select distinct is important for the join code to be equivalent to the in code. However, if you know there are no duplicates, this is more colloquially written without a subquery:
select t1.*
from Table1 t1 join
Table2 t2
on t1.NUMid = t2.NUMid
where t2.email = 'xyz#gmail.com'
Presto and Trino (formerly known as PrestoSQL) execute that query as a "semi join" operation: it builds an in-memory index with the rows coming from the inner query and probes the rows of the outer query against that index. If value is present, the row from the outer query is emitted, otherwise, it's filtered out.
In recent versions of Trino, there's a feature called "dynamic filtering", which allows the query engine to dynamically filter and prune data for the outer query at the source based on information obtained dynamically from the inner query. You can read more about it in these blog posts:
Dynamic filtering for highly-selective join optimization
Dynamic partition pruning

Hive - Multiple sub-queries in where clause is failing

I am trying to create a table by checking two sub-query expressions within the where clause but my query fails with the below error :
Unsupported sub query expression. Only 1 sub query expression is
supported
Code snippet is as follows (Not the exact code. Just for better understanding) :
Create table winners row format delimited fields terminated by '|' as
select
games,
players
from olympics
where
exists (select 1 from dom_sports where dom_sports.players = olympics.players)
and not exists (select 1 from dom_sports where dom_sports.games = olympics.games)
If I execute same command with only one sub-query in where clause it is getting executed successfully. Having said that is there any alternative to achieve the same in a different way ?
Of course. You can use left join.
Inner join will act as exists. and left join + where clause will mimic the not exists.
There can be issue with granularity but that depends on your data.
select distinct
olympics.games,
olympics.players
from olympics
inner join dom_sports dom_sports on dom_sports.players = olympics.players
left join dom_sports dom_sports2 where dom_sports2.games = olympics.games
where dom_sports2.games is null

driving_site hint for multiple remote tables

I have a query of the following format. It uses two remote tables and a local table.
SELECT *
FROM table1#db2 t1 INNER JOIN table2#db2 t2 -- two large remote tables on the same DB
ON t1.id = t2.id
WHERE t1.prop = '1'
AND t2.prop = '2'
AND t1.prop2 IN (SELECT val FROM tinylocaltable)
I'm wondering how to properly use the DRIVING_SITE query hint to push the bulk of the work to db2 (i.e. ensure the join and conditions are applied on db2). Most of the examples I see of DRIVING_SITE reference only one remote table. Is SELECT /*+DRIVING_SITE(t1)*/ * sufficient or do I need to list both remote tables (t1 and t2) in the hint? If the latter, what is the proper syntax?
(If you're wondering why this isn't being executed on db2 to start with, it's because this is actually one UNION ALL section of a larger query, where the other UNION ALL sections use the local DB).
The DRIVING_SITE hint instructs the optimizer to execute the query at a different site than that selected by the database
Your query uses
FROM table1#db2 t1 INNER JOIN table2#db2 t2
where both tables are on the same "different site", so
SELECT /*+ DRIVING_SITE(t1)*/
should be OK (in my opinion. Can't find anything in documentation that would suggest different).

Querying a Partitioned table in BigQuery using a reference from a joined table

I would like to run a query that partitions table A using a value from table B.
For example:
#standard SQL
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where B.date = '2018-01-01'
This query will scan all the partitions in table A and will not take into consideration the date I specified in the where clause (for partitioning purposes). I have tried running this query in several different ways but all produced the same result - scanning all partitions in table A.
Is there any way around it?
Thanks in advance.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
DECLARE date_filter ARRAY<DATETIME>
DEFAULT (SELECT ARRAY_AGG(date) FROM B WHERE ...);
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where A._partitiontime IN UNNEST(date_filter)
The doc says this about your use case:
Express the predicate filter as closely as possible to the table
identifier. Complex queries that require the evaluation of multiple
stages of a query in order to resolve the predicate (such as inner
queries or subqueries) will not prune partitions from the query.
The following query does not prune partitions (note the use of a subquery):
#standardSQL
SELECT
t1.name,
t2.category
FROM
table1 t1
INNER JOIN
table2 t2
ON
t1.id_field = t2.field2
WHERE
t1.ts = (SELECT timestamp from table3 where key = 2)

Performance of join vs pre-select on MsSQL

I can do the same query in two ways as following, will #1 be more efficient as we don't have join?
1
select table1.* from table1
inner join table2 on table1.key = table2.key
where table2.id = 1
2
select * from table1
where key = (select key from table2 where id=1)
These are doing two different things. The second will return an error if more than one row is returned by the subquery.
In practice, my guess is that you have an index on table2(id) or table2(id, key), and that id is unique in table2. In that case, both should be doing index lookups and the performance should be very comparable.
And, the general answer to performance question is: try them on your servers with your data. That is really the only way to know if the performance difference makes a difference in your environment.
I executed these two statements after running set statistics io on (on SQL Server 2008 R2 Enterprise - which supposedly has the best optimization compared to Standard).
select top 5 * from x2 inner join ##prices on
x1.LIST_PRICE = ##prices.i1
and
select top 5 * from x2 where LIST_PRICE in (select i1 from ##prices)
and the statistics matched exactly. I have always preferred the first type of join but the second allows me to select just that part and see what rows are being returned.
I was taught that joins vs subqueries are mostly equivalent when it comes to performance. I would also look at the resulting query plans to see if one is better then the other. The query plans matched exactly.
MS SQL Server is smart enough to understand that it is the same action in such a simple query.
However if you have more than 1 record in subquery then you'll probably use IN. In is slow operation and it will never work faster than JOIN. It can be the same but never faster.
The best option for your case is to use EXISTS. It will be always faster or the same as JOIN or IN operation. Example:
select * from table1 t1
where EXISTS (select * from table2 t2 where id=1 AND t1.key = t2.key)