Spark SQL: get the value of a column when another column is max value inside a groupBy().agg() - dataframe

I have a dataframe that looks like this:
|-- value: int (nullable = true)
|-- date: date (nullable = true)
I'd like to return value where value is the latest date in the dataframe.
Does this problem change if I need to make a groupBy and agg?
My actual problem looks like this:
val result = df
.filter(df("date")>= somedate && df("date")<= some other date)
Here I want to put valueFromColumn4 where date is max after the filter
I know I can get these values by creating a second dataframe and then making a join. But I'd like to avoid the join operation if possible.
Input sample:
Column 1 | Column 2 | Date | Column 4
A 1 2006 5
A 5 2018 2
A 3 2000 3
B 13 2007 4
Output sameple (filter is date >= 2006, date <= 2018):
Column 1 | Column 2 | Date | Column 4
A 1 2018 2 <- I got 2 from the first row which has the highest date
B 13 2007 4

A solution would be to use a struct to bind the value and the date together. It would look like this:
val result = df
.filter(df("date")>= somedate && df("date")<= some other date)
.withColumn("s", struct(df("date") as "date", df(valueFromColumn4) as "value"))
// since date is the first value of the struct,
// this selects the tuple that maximizes date, and the associated value.
max(col("s")) as "s",
.withColumn("date", col(""))
.withColumn(valueFromColumn4, col("s.value"))

you can use either groupBy with struct :
or with Window:

The operation which you want to do is ordering within a group of data(here grouped on Column1). This is perfect use case of windowed function, which does perform calculation over a group of records(window).
Here we can partition window on Column1, and pick the maximum of date from each such window. Let's define windowedPartition as :
val windowedPartition = Window.partitionBy("col1").orderBy(col("date").desc)
Then we can apply this window function on our data set to select the row with the highest rank. (I have not added filtering logic in the code below as I think that is not brining any complexity here and will not affect the solution )
Working code :
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val data = Seq(("a" , 1, 2006, 5), ("a", 5, 2018, 2), ("a", 3, 2000, 3), ("b", 13, 2007, 4)).toDF("col1", "col2", "date", "col4")
data: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 2 more fields]
| a| 1|2006| 5|
| a| 5|2018| 2|
| a| 3|2000| 3|
| b| 13|2007| 4|
scala> val windowedPartition = Window.partitionBy("col1").orderBy(col("date").desc)
windowedPartition: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#39613474
scala> data.withColumn("row_number", row_number().over(windowedPartition)).show
| b| 13|2007| 4| 1|
| a| 5|2018| 2| 1|
| a| 1|2006| 5| 2|
| a| 3|2000| 3| 3|
scala> data.withColumn("row_number", row_number().over(windowedPartition)).where(col("row_number") === 1).show
| b| 13|2007| 4| 1|
| a| 5|2018| 2| 1|
scala> data.withColumn("row_number", row_number().over(windowedPartition)).where(col("row_number") === 1).drop(col("row_number")).show
| b| 13|2007| 4|
| a| 5|2018| 2|
I believe this will be more scalable solution than struct since if the number of column increases we might have to add those columns as well in struct, in this solution that case will be taken care of.
One question though -
In your o/p the value in col2 should be 5(for col1=A) right? How is the value of col2 changing to 1?


Consolidate each row of dataframe returning a dataframe into ouput dataframe

I am looking for help in a scenario where I have a scala dataframe PARENT. I need to
loop through each record in PARENT dataframe
Query the records from a database based on a filter using ID value of
parent (the output of this step is dataframe)
append few attributes from parent to queried dataframe
id parentname
1 X
2 Y
Queried Dataframe for id 1
id queryid name
1 23 lobo
1 45 sobo
1 56 aobo
Queried Dataframe for id 2
id queryid name
2 53 lama
2 67 dama
2 56 pama
Final output required :
id parentname queryid name
1 X 23 lobo
1 X 45 sobo
1 X 56 aobo
2 Y 53 lama
2 Y 67 dama
2 Y 56 pama
I tried using foreachpartition and use foreach internally to loop through each record and got below error.
error: Unable to find encoder for type org.apache.spark.sql.DataFrame. An implicit Encoder[org.apache.spark.sql.DataFrame] is needed to store org.apache.spark.sql.DataFrame instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.>{
I need to do this with scalability plz. Any help is really appreciated.
The solution is pretty straightforward, you just need to join your parentDF and your other one.
As you're caring about scalability, In case your "otherDF" is quite small (it has less than 10K rows for example with 2-3 cols), you should consider using broadcast join : parentDF.join(broadcast(otherDF), Seq("id"), "left).
You can use the .join method on a dataframe for this one.
Some example code would be something like this:
val df = Seq((1, "X"), (2, "Y")).toDF("id", "parentname")
| id|parentname|
| 1| X|
| 2| Y|
val df2 = Seq((1, 23, "lobo"), (1, 45, "sobo"), (1, 56, "aobo"), (2, 53, "lama"), (2, 67, "dama"), (2, 56, "pama")).toDF("id", "queryid", "name")
| id|queryid|name|
| 1| 23|lobo|
| 1| 45|sobo|
| 1| 56|aobo|
| 2| 53|lama|
| 2| 67|dama|
| 2| 56|pama|
val output=df.join(df2, Seq("id"))
| id|parentname|queryid|name|
| 1| X| 23|lobo|
| 1| X| 45|sobo|
| 1| X| 56|aobo|
| 2| Y| 53|lama|
| 2| Y| 67|dama|
| 2| Y| 56|pama|
Hope this helps! :)

Drop null column in a spark dataframe and print column name

I have this dataframe :
|brand |Timestamp |Weight |
|BR1 |1632899456|null |
|BR1 |1632901256|null |
|BR300 |1632901796|null |
|BR300 |1632899155|null |
|BR200 |1632899155|null |
And this list which contains the name of the columns:
val column_names : Seq[String] = Seq("brand", "Timestamp", "Weight")
I would like to go through this list, check if the correspondant column contains only null values, drop the column if it is the case and log a message containing the name of the column that was dropped.
In this case, the result would be :
|brand |Timestamp |
|BR1 |1632899456|
|BR1 |1632901256|
|BR300 |1632901796|
|BR300 |1632899155|
|BR200 |1632899155|
I am using Spark version 3.2.1 and SQLContext, with scala language
you can use Dataset.summary which returns a DataFrame with statistics about every column. Then, use this DataFrame to get what columns have null value, or min=max=null. Then, drop those columns in original DF.
case class Test(field1: String, field2: String)
val df = List(Test("1",null), Test("2",null), Test("3",null)).toDF("field1", "field2")
| 1| null|
| 2| null|
| 3| null|
scala> df.summary("mean", "min", "max").show()
| mean| 2.0| null|
| min| 1| null|
| max| 3| null|
Null column names can be received with "min" function. Then this names can be printed, or dropped:
import org.apache.spark.sql.functions.{min}
val column_names = Seq("brand", "Timestamp", "Weight")
val df = List(("1", null, 1), ("2", null, 2), ("3", null, 3)).toDF("brand", "Timestamp", "Weight")
val minColumns = => min(name).alias(name))
val minValuesRow = _*).first
val nullColumnNames = column_names
.filter({ case (_, index) => minValuesRow.isNullAt(index) })

How to get whole row's size in df using scala

DataFrame has multiple columns. I need add a new column for the whole row size which means I need add all columns size together. Is there a simple way to do it efficiently? Thanks
Here is the sample:
val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
I want to add a column to df that it can sum length of each column. In this sample only two columns, but actually I have hundred columns in the df.
val df = Seq(("Alice", "He is girl"),
("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
| name| string|
|Alice| He is girl|
| Bob|She is girl|
| Ben| null|
Get rid of null values:
val dfNoNull ="")
| name| string|
|Alice| He is girl|
| Bob|She is girl|
| Ben| |
Create list of columns with applied length function to each of them:
val cols = => length(col(x)))
Select data based on these columns/expressions:
val dfColCounts =*)
| 5| 10|
| 3| 11|
| 3| 0|
Get these new colum names:
val countCols = => col(x))
Apply reduce to sum up all column values which are ints by now:
val dfPerRowCounts = dfColCounts
.withColumn("countPerRow", countCols.reduce(_ + _))
| 15|
| 14|
| 3|

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
| id|type| ts|
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
df2 = sc.parallelize([
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df3 = df2.join(df1,"td", F.col("ts")-F.col("fts"))
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
Will update performance details as soon as I'll have any.
Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
.filter(ts_df["type"] == "A")
ts_df_td = (
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
| d| 2|
| c| 4|
| b| 5|
Let me know if this is what you are looking for?
Hussain Bohra

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
I want to get the result as below
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on =
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
'select id, sum(value) over (partition by id) as value from table_view'
| id|value|
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
'select id, max(value) over (partition by id) as value from table_view'
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
| id|value|
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
You can use the same functions in scala.