Spark groupBy aggregation result joined back to the initial data frame - sql

I realize that I am often tempted to do the following:
var df_mean = df.groupBy("category").agg(mean("column1") as "mean")
val df_with_mean = df.join(df_mean, Seq("category"))
So basically I want all the rows of my initial dataframe to have a column which is the mean value of their category.
Is it the correct way to achieve that? Any better idea?

It is correct (yields expected results) and idiomatic. DataFrame DSL is just a wrapper around SQL and standard SQL solution can be expressed as follows:
WITH means AS (SELECT category, avg(column1) AS mean FROM df GROUP BY category)
SELECT df.category, df.column1, means.mean
FROM df JOIN means ON df.category = means.category
You can easily check that this generates the same execution plan as df_with_mean.
It is possible to express the same logic using window functions:
SELECT *, avg(column1) OVER w AS mean FROM df
WINDOW w AS (
PARTITION BY category
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
with DSL equivalent:
val w = Window.partitionBy($"category").rowsBetween(Long.MinValue, Long.MaxValue)
df.select($"*", avg($"column1").over(w).alias("mean"))
but in general Spark doesn't perform particularly well with UNBOUNDED FOLLOWING frame.

Related

Get max(timestamp) for each group with repetition (IBM DB2)

ANSWER: Need to use LEAD and PARTITION BY functions. Please refer to Gordon's answer.
I have the following dataset :
I want to get rows 1,3,5,7 in the result set.
RESULT SET SHOULD LOOK LIKE :
11/10/2020 19:36:11.548955 IN_REVIEW
11/8/2020 19:36:11.548955 EXPIRED
11/6/2020 19:36:11.548955 IN_REVIEW
11/4/2020 19:36:11.548955 ACTIVE
Use window functions. LEAD() gets the value from the "next" row, so filter only when the value changes:
SELECT t.*
FROM (SELECT t.*,
LEAD(interac_Reg_stat) OVER (PARTITION BY Acct_No ORDER BY xcn_tmstmp) as next_interac_Reg_stat
FROM TABLE
) t
WHERE interac_Reg_stat <> next_interac_Reg_stat OR
next_interac_Reg_stat IS NULL;
Well, since you are grouping by interac_Reg_stat, you will never get 2 separate rows for the IN_REVIEW status. To get the result you want, you will need to add a sub-query to find all items with a certain interac_Reg_stat that are before another specific interac_Reg_stat.

Stream Analytics UDF - using the output from one UDF in another

The following code results in my GT2HP value being null in the follow on UDFs:
SELECT
UDF.GT2HP(Collect()) as GT2HP,
UDF.LPLPReturns(Collect()) as LPLPReturns,
UDF.LPGasHeater(Collect()) as LPGasHeater,
UDF.HPRaisedSW(Collect(), AVG(GT2HP)) as HPRaisedSW,
UDF.HPCustomerDemand(Collect(), AVG(GT2HP)) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM IotHubInput
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
GROUP BY TumblingWindow(second, 60)
The following code works:
SELECT
UDF.GT2HP(Collect()) as GT2HP,
UDF.LPLPReturns(Collect()) as LPLPReturns,
UDF.LPGasHeater(Collect()) as LPGasHeater,
UDF.HPRaisedSW(Collect(), UDF.GT2HP(Collect())) as HPRaisedSW,
UDF.HPCustomerDemand(Collect(), UDF.GT2HP(Collect())) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM IotHubInput
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
GROUP BY TumblingWindow(second, 60)
Obviously the second code is way more computationally expensive than the first and I'd like to avoid it if possible.
I'd like to use the output of the first UDF in my follow on UDFs, but it seems to pass on null. All the select statements appear to execute in parallel not serial, which probably explains the null.
Is there a way to use the output of one UDF in another UDF?
The reason that GT2HP column referenced in the AVG(GT2HP) is always null is due to the SQL semantics.
Columns in the SELECT clause can only refer to sources referenced in FROM, and since there is no IotHubInput.GT2HP - it is interpreted as null.
If you separate your query into multiple steps, as Vignesh suggested you will end up with the first step just computing the COLLECT over the 60 sec window:
SELECT Collect() AS c
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
FROM IotHubInput
GROUP BY TumblingWindow(second, 60)
Let's name it step1. Now, since you are grouping only by a window, you will have just one value of column c every 60 sec.
Any aggregation over this is not necessary unless you increase the size of the window to aggregate more than one value...
So the AVG in the AVG(GT2HP) is unnecessary.
The second step then will be:
SELECT
c,
GT2HP = UDF.GT2HP(c)
FROM step1
Let's call this step step2.
Now the final selection will be:
SELECT
GT2HP,
UDF.LPLPReturns(c) as LPLPReturns,
UDF.LPGasHeater(c) as LPGasHeater,
UDF.HPRaisedSW(c, GT2HP) as HPRaisedSW,
UDF.HPCustomerDemand(c, GT2HP) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM step2
And putting it all together:
WITH step1 AS (
SELECT Collect() AS c
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
FROM IotHubInput
GROUP BY TumblingWindow(second, 60)
),
step2 AS (
SELECT
c,
GT2HP = UDF.GT2HP(c)
FROM step1
)
SELECT
GT2HP,
UDF.LPLPReturns(c) as LPLPReturns,
UDF.LPGasHeater(c) as LPGasHeater,
UDF.HPRaisedSW(c, GT2HP) as HPRaisedSW,
UDF.HPCustomerDemand(c, GT2HP) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM step2
You can write it as two statements. First one selects Collect() and avg() with a group by. Second select uses the results to call UDF.

SPARK SQL Equivalent of Qualify + Row_number statements

Does anyone know the best way for Apache Spark SQL to achieve the same results as the standard SQL qualify() + rnk or row_number statements?
For example:
I have a Spark Dataframe called statement_data with 12 monthly records each for 100 unique account_numbers, therefore 1200 records in total
Each monthly record has a field called "statement_date" that can be used for determining the most recent record
I want my final result to be a new Spark Dataframe with the 3 most recent records (as determined by statement_date descending) for each of the 100 unique account_numbers, therefore 300 final records in total.
In standard Teradata SQL, I can do the following:
select * from statement_data
qualify row_number ()
over(partition by acct_id order by statement_date desc) <= 3
Apache Spark SQL does not have a standalone qualify function that I'm aware of, maybe I'm screwing up the syntax or can't find documentation that qualify exists.
It is fine if I need to do this in two steps as long as those two steps are:
A select query or alternative method to assign rank/row numbering for each account_number's records
A select query where I'm selecting all records with rank <= 3 (i.e. choose 1st, 2nd, and 3rd most recent records).
EDIT 1 - 7/23 2:09pm:
The initial solution provided by zero323 was not working for me in Spark 1.4.1 with Spark SQL 1.4.1 dependency installed.
EDIT 2 - 7/23 3:24pm:
It turns out the error was related to using SQL Context objects for my query instead of Hive Context. I am now able to run the below solution correctly after adding the following code to create and use a Hive Context:
final JavaSparkContext sc2;
final HiveContext hc2;
DataFrame df;
hc2 = TestHive$.MODULE$;
sc2 = new JavaSparkContext(hc2.sparkContext());
....
// Initial Spark/SQL contexts to set up Dataframes
SparkConf conf = new SparkConf().setAppName("Statement Test");
...
DataFrame stmtSummary =
hc2.sql("SELECT * FROM (SELECT acct_id, stmt_end_dt, stmt_curr_bal, row_number() over (partition by acct_id order by stmt_curr_bal DESC) rank_num FROM stmt_data) tmp WHERE rank_num <= 3");
There is no qualify (it is usually useful to check parser source) but you can use subquery like this:
SELECT * FROM (
SELECT *, row_number() OVER (
PARTITION BY acct_id ORDER BY statement_date DESC
) rank FROM df
) tmp WHERE rank <= 3
See also SPARK : failure: ``union'' expected but `(' found

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

BigQuery: GROUP BY clause for QUANTILES

Based on the bigquery query reference, currently Quantiles do not allow any kind of grouping by another column. I am mainly interested in getting medians grouped by a certain column. The only work around I see right now is to generate a quantile query per distinct group member where the group member is a condition in the where clause.
For example I use the below query for every distinct row in column-y if I want to get the desired result.
SELECT QUANTILE( <column-x>, 1001)
FROM <table>
WHERE
<column-y> == <each distinct row in column-y>
Does the big query team plan on having some functionality to allow grouping on quantiles in the future?
Is there a better way to get what I am trying to get here?
Thanks
With the recently announced percentile_cont() window function you can get medians.
Look at the example in the announcement blog post:
http://googlecloudplatform.blogspot.com/2013/06/google-bigquery-bigger-faster-smarter-analytics-functions.html
SELECT MAX(median) AS median, room FROM (
SELECT percentile_cont(0.5) OVER (PARTITION BY room ORDER BY data) AS median, room
FROM [io_sensor_data.moscone_io13]
WHERE sensortype='temperature'
)
GROUP BY room
While there are efficient algorithms to compute quantiles they are somewhat memory intensive - trying to do multiple quantile calculations in a single query gets expensive.
There are plans to improve QUANTILES, but I don't know what the timeline is.
Do you need median? Can you filter outliers and do an average of the remainder?
If your per-group size is fixed, you may be able to hack it using combination of order, nest and nth. For instance, if there are 9 distinct values of f2 per value of f1, for median:
select f1,nth(5,f2) within record from (
select f1,nest(f2) f2 from (
select f1, f2 from table
group by f1,f2
order by f2
) group by f1
);
Not sure if the sorted order in subquery is guaranteed to survive the second group, but it worked in a simple test I tried.