SparkSQL Query performance improvement by CLUSTER By - apache-spark-sql

I am new to SparkSQL and I primarily work with writing SparkSQL queries. We often need to JOIN big tables in the queries and it did not take long to face performance issues pertaining to them (eg. Joins, aggregates etc).
While searching for remedies online, I recently came across the terms - COALESCE(), REPARTITION(), DISTRIBUTE BY, CLUSTER BY etc and the fact that they could probably be used for enhancing performance of slow running SparkSQL queries.
Unfortunately, I could not find enough examples around, for me to understand them clearly and start applying them to my queries. I am primarily looking for examples explaining their syntax, hints and usage scenarios.
Can anyone please help me out here and provide SparkSQL query examples of their usage and when to use them ? E.g.
syntax
hint syntax
tips
scenarios
Note: I only have access to writing SparkSQL Queries but don't have access to PySpark-SQL.
Any help is much appreciated.
Thanks

coalesce
coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null.
Examples:
SELECT coalesce(NULL, 1, NULL);
1
Since: 1.0.0
Distribute By and REPARTITION
Repartitions a DataFrame by the given expressions. The number of partitions is equal to spark.sql.shuffle.partitions. Note that in Spark, when a DataFrame is partitioned by some expression, all the rows for which this expression is equal are on the same partition (but not necessarily vice-versa)!
This is how it looks in practice. Let’s say we have a DataFrame with two columns: key and value.
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df DISTRIBUTE BY key
Equivalent in DataFrame API:
df.repartition($"key", 2)
Cluster By
This is just a shortcut for using distribute by and sort by together on the same set of expressions.
In SQL:
SET spark.sql.shuffle.partitions = 2
SELECT * FROM df CLUSTER BY key

Related

How to time and benchmark Bigquery Standard SQL scripts?

I have different versions of the Bigquery Standard scripts i.e. I follow the Standard API here. I am trying to find out Matlab-style timeit or warmed-up timing measures to benchmark different scripts:
Version A: very readabled code with modular code
WITH a AS
(
SELECT * FROM SOURCE
),
a_ AS
(
SELECT ... FROM a
)
SELECT * FROM a_
Version B: very unreadable code with subqueries but claimed to be efficient
SELECT * FROM (SELECT * FROM (SELECT * FROM SOURCE))
How can I time and benchmark different Standard SQL queries in Google Bigquery?
Possible perspectives to address
Do I need to warm up BQ Standard SQL queries like in Matlab?
What are the common performance differences between the version A and B? Any Pros and Cons? How can you demonstrate that in Bigquery?
Any documentation or recommendations available for the two different approaches (overusing subqueries vs modular coding)?
I summarires the comments and I recite on the speedup opportunities from the perspective of costs.
A demonstrates CTE (Common Table Expressions) while B demonstrates construction with subqueries like sets. The user Used_By_Already commentted that the difference should not matter in the performance so not convey speedup opportunities.
"No. That is the inverse of what I tried to convey. You can take almost any subquery and make it a cte, or vice versa take the type of ctes in your question and make it a subquery. The net effect it most likely to be zero or unmeasurable."
What matters is the prefiltering, more here as forwarded by Mikhail Berlyant in a comment. So you can make your queries more efficient by prefiltering data early.
As a side note, more early filtering and hence less computation may not affect on charging, more here, because charging is based on the input data, not on the amount of computation. So query optimisation for speedup opportunity makes little sense in Bigquery in terms of monetary costs, even though you may save some time.

SQL index on a sum of columns - is that possible?

Given two tables, T1 with column a and T2 with column b, is it possible to apply an index to a sum of columns T1.a + T2.b? I recently got a question involving this index and was quite surprised, as the question was not whether it was possible (which I believe is not), but rather would it speed up some example query.
If it is possible, what exactly are we indexing? Would it be helpful in queries like WHERE T1.a+T2.b = 3 or in some other cases? Thanks!
Yes, most (not all) database systems allow you to create an index on the result of an expression, so creating an index on the sum of two columns is possible in those systems.
Would it be helpful in queries like WHERE T1.a+T2.b = 3 or in some other cases?
That depends completely on the query and what plan the compiler decides to use to evaluate the query. If you filter on the sum of two columns, and there are relatively few records that meet that criteria, then yes, an index will reduce the amount of scanning that needs to be done to find matching records.
Depending on your SQL product, it could be possible to index a view which can contain a group by to get the persisted summary values.
HOWEVER
This is a local optimization (google "no free lunch"), as that it will result in faster select performance for you at the expense of slower inserts and updates for others.

Spark sql queries vs dataframe functions

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().
Any idea? :)
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style.
In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
By using DataFrame, one can break the SQL into multiple statements/queries, which helps in debugging, easy enhancements and code maintenance.
Breaking complex SQL queries into simpler queries and assigning the result to a DF brings better understanding.
By splitting query into multiple DFs, developer gain the advantage of using cache, reparation (to distribute data evenly across the partitions using unique/close-to-unique key).
The only thing that matters is what kind of underlying algorithm is used for grouping.
HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n)
HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map.
Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

Surprising SQL speed increase

I’ve just found out that the execution plan performance between the following two select statements are massively different:
select * from your_large_table
where LEFT(some_string_field, 4) = '2505'
select * from your_large_table
where some_string_field like '2505%'
The execution plans are 98% and 2% respectively. Bit of a difference in speed then. I was actually shocked when I saw it.
I've always done LEFT(xxx) = 'yyy' as it reads well.
I actually found this out by checking the LINQ generated SQL against my hand crafted SQL. I assumed the LIKE command would be slower, but is in fact much much faster.
My question is why is the LEFT() slower than the LIKE '%..'. They are afterall identical?
Also, is there a CPU hit by using LEFT()?
More generally speaking, you should never use a function on the LEFT side of a WHERE clause in a query. If you do, SQL won't use an index--it has to evaluate the function for every row of the table. The goal is to make sure that your where clause is "Sargable"
Some other examples:
Bad: Select ... WHERE isNull(FullName,'') = 'Ed Jones'
Fixed: Select ... WHERE ((FullName = 'Ed Jones') OR (FullName IS NULL))
Bad: Select ... WHERE SUBSTRING(DealerName,4) = 'Ford'
Fixed: Select ... WHERE DealerName Like 'Ford%'
Bad: Select ... WHERE DateDiff(mm,OrderDate,GetDate()) >= 30
Fixed: Select ... WHERE OrderDate < DateAdd(mm,-30,GetDate())
Bad: Select ... WHERE Year(OrderDate) = 2003
Fixed: Select ... WHERE OrderDate >= '2003-1-1' AND OrderDate < '2004-1-1'
It looks like the expression LEFT(some_string_field, 4) is evaluated for every row of a full table scan, while the "like" expression will use the index.
Optimizing "like" to use an index if it is a front-anchored pattern is a much easier optimization than analyzing arbitrary expressions involving string functions.
There's a huge impact on using function calls in where clauses as SQL Server must calculate the result for each row. On the other hand, like is a built in language feature which is highly optimized.
If you use a function on a column with an index then the db no longer uses the index (at least with Oracle anyway)
So I am guessing that your example field 'some_string_field' has an index on it which doesn't get used for the query with 'LEFT'
Why do you say they are identical? They might solve the same problem, but their approach is different. At least it seems like that...
The query using LEFT optimizes the test, since it already knows about the length of the prefix and etc., so in a C/C++/... program or without an index, an algorithm using LEFT to implement a certain LIKE behavior would be the fastest. But contrasted to most non-declarative languages, on a SQL database, a lot op optimizations are done for you. For example LIKE is probably implemented by first looking for the % sign and if it is noticed that the % is the last char in the string, the query can be optimized much in the same way as you did using LEFT, but directly using an index.
So, indeed I think you were right after all, they probably are identical in their approach. The only difference being that the db server can use an index in the query using LIKE because there is not a function transforming the column value to something unknown in the WHERE clause.
What happened here is either that the RDBMS is not capable of using an index on the LEFT() predicate and is capable of using it on the LIKE, or it simply made the wrong call in which would be the more appropriate access method.
Firstly, it may be true for some RDBMSs that applying a function to a column prevents an index-based access method from being used, but that is not a universal truth, nor is there any logical reason why it needs to be. An index-based access method (such as Oracle's full index scan or fast full index scan) might be beneficial but in some cases the RDBMS is not capable of the operation in the context of a function-based predicate.
Secondly, the optimiser may simply get the arithmetic wrong in estimating the benefits of the different available access methods. Assuming that the system can perform an index-based access method it has first to make an estimate of the number of rows that will match the predicate, either from statistics on the table, statistics on the column, by sampling the data at parse time, or be using a heuristic rule (eg. "assume 5% of rows will match"). Then it has to assess the relative costs of a full table scan or the available index-based methods. Sometimes it will get the arithmetic wrong, sometimes the statistics will be misleading or innaccurate, and sometimes the heuristic rules will not be appropriate for the data set.
The key point is to be aware of a number of issues:
What operations can your RDBMS support?
What would be the most appropriate operation in the
case you are working with?
Is the system's choice correct?
What can be done to either allow the system to perform a more efficient operation (eg. add a missing not null constraint, update the statistics etc)?
In my experience this is not a trivial task, and is often best left to experts. Or on the other hand, just post the problem to Stackoverflow -- some of us find this stuff fascinating, dog help us.
As #BradC mentioned, you shouldn't use functions in a WHERE clause if you have indexes and want to take advantage of them.
If you read the section entitled "Use LIKE instead of LEFT() or SUBSTRING() in WHERE clauses when Indexes are present" from these SQL Performance Tips, there are more examples.
It also hints at questions you'll encounter on the MCSE SQL Server 2012 exams if you're interested in taking those too. :-)

Working around UDF Performance Issues - Manual caching

My system does some pretty heavy processing, and I've been attacking the performance in order to give me the ability to run more test runs in shorter times.
I have quite a few cases where a UDF has to get called on say, 5 million rows (and I pretty much thought there was no way around it).
Well, it turns out, there is a way to work around it and it gives huge performance improvements when UDFs are called over a set of distinct parameters somewhat smaller than the total set of rows.
Consider a UDF that takes a set of inputs and returns a result based on complex logic, but for the set of inputs over 5m rows, there are only 100,000 distinct inputs, say, and so it will only produce 100,000 distinct result tuples (my particular cases vary from interest rates to complex code assignments, but they are all discrete - the fundamental point with this technique is that you can simply determine if the trick will work by running the SELECT DISTINCT).
I found that by doing something like this:
INSERT INTO PreCalcs
SELECT param1
,param2
,dbo.udf_result(param1, param2) AS result
FROM (
SELECT DISTINCT param1, param2 FROM big_table
)
When PreCalcs is suitably indexed, the combination of that with:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON PreCalcs.param1 = big_table.param1
AND PreCalcs.param2 = big_table.param2
You get a HUGE boost in performance. Apparently, just because something is deterministic, it doesn't mean SQL Server is caching the past calls and re-using them, as one might think.
The only thing you have to watch out for is where NULL are allowed, then you need to fix up your joins carefully:
SELECT big_table.param1
,big_table.param2
,PreCalcs.result
FROM big_table
INNER JOIN PreCalcs
ON (
PreCalcs.param1 = big_table.param1
OR COALESCE(PreCalcs.param1, big_table.param1) IS NULL
)
AND (
PreCalcs.param2 = big_table.param2
OR COALESCE(PreCalcs.param2, big_table.param2) IS NULL
)
Hope this helps and any similar tricks with UDFs, or refactoring queries for performance are welcome.
I guess the question is, why is manual caching like this necessary - isn't that the point of the server knowing that the function is deterministic? And if it makes such a big difference, and if UDFs are so expensive, why doesn't the optimizer just do it in the execution plan?
Yes, the optimizer will not manually memoize UDFs for you. Your trick is very nice in the cases where you can collapse the output set down in this way.
Another technique that can improve performance if your UDF's parameters are indices into other tables, and the UDF selects values from those tables to calculate the scalar result, is to rewrite your scalar UDF as a table-valued UDF that selects the result value over all your potential parameters.
I've used this approach when the tables we based the UDF query on were subject to a lot of inserts and updates, the involved query was relatively complex, and the number of rows the original UDF had to be applied to were large. You can achieve some great improvement in performance in this case, as the table-values UDF only needs to be run once and can run as an optimized set-oriented query.
How would SQL Server know that you have 100,000 discrete combinations within 5 million rows?
By using the PreCalcs table, you are simply running the udf over 100k rows rather that 5 million rows, before expanding back out again.
No optimiser in existence would be able to divine this useful information.
The scalar udf is a black box.
For a more practical solution, I'd use a computed, persisted columns that does the udf call.
So it's available in all queries can be indexed/included.
This suits OLTP more, maybe... I query a table to get trading cash and positions in real time in many different ways so this approach suits me to avoid the udf math overhead every time.