List of aggregation functions in Spark SQL - apache-spark-sql

I'm looking for a list of pre-defined aggregation functions in Spark SQL. I have in mind something analogous to Presto Aggregate Functions.
I Ctrl+F'd around a little in the SQL API docs to no avail... it's also hard to tell at a glance which functions are for aggregation vs. not. For example, if I didn't know avg is an aggregation function I'd be hard pressed to tell it is one (in a way that's actually scalable to the full set of functions):
avg - avg(expr) - Returns the mean calculated from values of a group.
If such a list doesn't exist, can someone at least confirm to me that there's no pre-defined function like any/bool_or or all/bool_and to determine if any or all of a boolean column in a group are true (or false)?
For now, my workaround is
select grp_col, count(if(bool_col, true, NULL)) > 0 any_agg

Just take a look at Spark Docs on Aggregate functions section

The list of functions is here under Relational Grouped Dataset - specifically the API's that return DataFrame (not RelationalGroupedDataSet):
https://spark.apache.org/docs/latest/api/scala/index.html?org/apache/spark/sql/RelationalGroupedDataset.html#org.apache.spark.sql.RelationalGroupedDataset

Related

Qlik sense: How to aggregate strings into single row in script

I am trying to aggregate strings that belong to the same product code in one row. Which Qlik sense aggregation function should I use?
image
I am able to aggregate integers in such example, but failed for string aggregation.
Have you tried maxstring() - this is a string aggregation function.
As x3ja mentioned, you can use an aggregation function in charts that will work for strings, including:
MaxString()
Only()
Concat()
These can result in the type of thing you're looking for:
It's worth noting, though, that this sort of problem is almost always an issue with the underlying data model. Depending on what your source data looks like, you should consider investigating your use of Join and/or Concatenate. You can see more info on how to use those functions on this Qlik Help page.
Here's a very basic example of using a Join to properly combine the data in a way that results in all data showing up a single record without needing any aggregations in the table chart:

CURRENT in BigQuery?

I've noticed that CURRENT is a reserved keyword for BigQuery at: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical.
What exactly does CURRENT do? I've only seen it as a prefix for things such as CURRENT_TIME(), CURRENT_DATE(), and other such stuff but have never seen it by itself. Is this just reserved for future usage or do any SQL statements contain that as a keyword?
Just to add on the comment of #Jaytiger:
CURRENT keyword seems to be reserved as a part of SQL 2016 spec. en.wikipedia.org/wiki/SQL_reserved_words And you can check it's usage
in another DBMS implementations like Oracle, SQL Server. hope this is
helpful.
stackoverflow.com/questions/49110728/where-current-of-in-pl-sql
In BigQuery CURRENT clause is used on defining frame_start and frame_end in window functions.
A window function, also known as an analytic function, computes values
over a group of rows and returns a single result for each row.
A common usage for this is calculating a cumulative sum for each category in the table. See BigQuery window function examples for reference.

How to write Window functions using Druid?

For example, i wanted to write Window functions like sum over (window)
Since over clause is not supported by Druid, how do i achieve the same using Druid Native query API or SQL API?
You should use a GroupBy Query. As Druid is a time series database, you have to specify your interval (window) where you want to query data from. You can use aggregation methods over this data, for example a SUM() aggregation.
If you want, you can also do extra filtering within your aggregation, like "only sum records where city=paris"). You could also apply the SUM aggregation only to records which exists in a certain time window within your selected interval.
If you are a PHP user then maybe this package is handy for you: https://github.com/level23/druid-client#sum
We have tried to implement an easy way to query such data.

Spark SQL - How to avoid sort-based-aggregation with string aggregated columns

I use Spark SQL 2.2.0.
When executing query such as:
spark.sql("select COL1, min(STRING_COL2)
from TB1 group by COL1").explain()
Spark will use sort aggregate since STRING_COL2 is a string column. In most cases sort based aggregation is much more expensive than hash based aggregation.
Specifying string column in the GROUP BY clause will not force sort based aggregation.
If you replace min(STRING_COL1) with sort_array(collect_set(STRING_COL1))[0], Spark will use ObjectHashAggregation which is much better than SortAggregate (two times faster in my case).
However, collecting a set of distinct values, sorting it, and finally taking the first value require more memory and consume more CPU resources than just comparing two values (as MIN is supposed to do). In addition, ObjectHashAggregation will fallback to SortAggregate if to many entries are aggregated.
How can I avoid the heavy sort while with increasing memory consumption?
Why MIN and MAX of string columns are not supported by HashAggregate?
When will it be supported?
Thanks.
Hash-based aggregation is default, but it may fallback to sort-based aggregation when there are too many keys in GROUP BY, exceeding the buffer size of hash-based aggregation.
See this blog.
Possibly too late to answer. Anyways what i found from the code is that only limited set of functions use object hash and the list can be found in the link below. So functions like min and max still use sort aggregate if datatype is unmutable.
The reason for other functions like min and max to not derive from TypedInmperativeAggregative the way functions like percentile etc. do is because they accept expressions not just column names.
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Expression-TypedImperativeAggregate.html
you can config the
spark.sql.execution.useObjectHashAggregateExec = true first,
it use since spark 2.2.0, if it doesn't work, you can try it in high-version spark.
and check string_C1 is too long;

What is the use case for Merge function SQL Clr?

I am writing a CLR userdefinedAggregate function to implement median. While I understand all the other function which I have to implement. I can not understand, what is the use of the merge function.
I am getting a vague idea that if aggregated function is partially evaluated ( i.e. evaluated for some rows with one group and the remaining in other ) then the values needs to be aggregated. If its the case is there a way to test this ?
Please let me know if any of the above is not clear or if you need any further information.
Your vague idea is correct.
From Requirements for CLR User-Defined Aggregates
This method can be used to merge another instance of this aggregate
class with the current instance. The query processor uses this method
to merge multiple partial computations of an aggregation.
The parameter to merge is another instance of your aggregate and you should merge the aggregated data in that instance to your current instance.
You can have a look at the sample string concatenate aggregate. The merge method add the concatenated strings from the parameter to the current instance of the aggregate class.