I am using spark 2.4v with java.
In my spark job , I am doing some aggregations like avg, percentiles and etc.
Its written in "group by" clause, but its damn slow.
Hence I tried to write the same aggregations with Window function partition by and order by clauses. But it is slower.
Windows functions suppose to be faster right , so how to tune it for better performance.
Any good resources, notes for the highly appreciated.
Related
I wonder how this query is executing successfully. As we know 'having' clause execute before the select one then here how alias name used in 'select' statement working in having condition and not giving any error.
As we know 'having' clause execute before the select one
This affirmation is wrong. The HAVING clause is used to apply filters in aggregation functions (such as SUM, AVG, COUNT, MIN and MAX). Since they need to be calculated BEFORE applying any filter, in fact, the SELECT statement is done when the HAVING clause start to be processed.
Even if the previous paragraph was not true, it is important to consider that SQL statements are interpreted as a whole before any processing. Due to this, it doesn't really matter the order of the instructions: the interpreter can link all references so they make sense in runtime.
So it would be perfectly feasible to put the HAVING clause before the SELECT or in any part of the instruction, because this is just a syntax decision. Currently, HAVING clause is after GROUP BY clause because someone decided that this syntax makes more sense in SQL.
Finally, you should consider that allowing you to reference something by an alias is much more a language feature than a rational on how the instruction is processed.
the order of exution is
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
I still don't get, why you are asking about, syntactially your seelect qiery is correct, but if it the correct result we can not know
Spark SQL engine is obviously different than the normal SQL engine because it is a distributed SQL engine. The normal SQL order of execution does not applied here because when you execute a query via Spark SQL, the engine converts it into optimized DAG before it is distributed across your worker nodes. The worker nodes then do map, shuffle, and reduce tasks before the result is aggregated and returned to the driver node. Read more about Spark DAG here.
Therefore, there are more than just one selecting, filtering, aggregation happening before it returns any result. You can see it yourself by clicking on Spark job view on the Databricks query result panel and then select Associated SQL Query.
So, when it comes to Spark SQL, I recommend we refer to Spark document which clearly indicates that Having clause can refer to aggregation function by its alias.
In spark one can write sql queries as well use spark api functions. ReduceByKey should always be used than groupbykey as it prevents more shuffling.
I would like to know, when you use sql queries by registering the dataframe how can we use reduceby ? In sql queries there is only group by no reduce by. Do internally it optimises to use reduceBykey than a group by ?
I got it. I actually did an explain to understand the physical plan and it first executes a function as partial_sum and then after that executes the function sum which implies that it has first performed a sum within executors and then shuffled across.
I ran into a Hive query calculating a count distinct without grouping, which runs very slow. So I was wondering how is this functionality implemented in Hive, is there a UDAFCountDistinct for this?
Hive 1.2.0+ provides auto-rewrite optimization for count(distinct). Check this setting:
hive.optimize.distinct.rewrite=true;
I am looking for at what point do windowed functions happen in sql.
I know they can be used in the SELECT and ORDER BY clauses, so I am inclined to think they happen after ORDER BY, but before TOP
Window functions happen when the optimizer decides that they should happen. This is best understood looking at the query plan.
SQL Server advertises the logical processing of queries. This is used to explaining scoping rules (in particular). It is not related to how the query is actually executed.
Clearly, the rules for window functions are:
The effect is after the FROM, WHERE, GROUP BY, and HAVING clauses are processed.
The effect is not related to the ORDER BY (even if you use order by (select null))).
TOP does not affect the processing.
The processing occurs before SELECT DISTINCT.
I think the conclusion is that they are parsed in the SELECT or ORDER BY as with other expressions in those clauses. There is no separate place for them.
Are there any specialized databases - rdbms, nosql, key-value, or anything else - that are optimised for running fast aggregate queries or map-reduces like this over very large data sets:
select date, count(*)
from Sales
where [various combinations of filters]
group by date
So far I've run benchmarks on MongoDB and SQL Server, but I'm wondering if there's a more specialized solution, preferably one that can scale data horizontally.
In my experience, the real issue has less to do with aggregate query performance, which I find good in all major databases I've tried, than it has to do with the way queries are written.
I've lost count of the number of times I've seen enormous report queries with huge amounts of joins and inline subquery aggregates all over the place.
Off the top of my head, the typical steps to make these things faster are:
Use window functions where available and applicable (i.e. the over () operator). There's absolutely no point in refetching data multiple times.
Use common table expressions (with queries) where available and applicable (i.e. sets that you know will be reasonably small).
Use temporary tables for large intermediary results, and create indexes on them (and analyze them) before using them.
Work on small result sets by filtering rows earlier when possible: select id, aggregate from (aggregate on id) where id in (?) group by id can made much faster by rewriting it as select id, aggregate from (aggregate on id where id in (?)) group by id.
Use union/except/intersect all rather than union/except/intersect where applicable. This removes pointless sorting of result sets.
As a bonus the first three steps all tend to make the report queries more readable and thus more maintainable.
Oracle, DB2 Warehouse edition, and to a lesser degree SQLServer enterprise are all very good on these aggregate queries -- of course these are expensive solutions and it depends very much on your budget and business case whether its worth it.
Pretty much any OLAP database, this is exactly the type of thing they're designed for.
OLAP data cubes are designed for this. You denormalize data into forms that they can compute on quickly. The denormalization and pre computation steps can take time, so these databases are typically built only for reporting and are separate from the real time transactional data.
For certain kinds of data (large volumes, time series) kx.com provides probably the best solution: kdb+. If it looks like your kind of data, give it a try. Note: they don't use SQL, but rather a more general, more powerful, and more crazy set-theoretical language.