Consolidate each row of dataframe returning a dataframe into ouput dataframe - dataframe

I am looking for help in a scenario where I have a scala dataframe PARENT. I need to
loop through each record in PARENT dataframe
Query the records from a database based on a filter using ID value of
parent (the output of this step is dataframe)
append few attributes from parent to queried dataframe
Ex:
ParentDF
id parentname
1 X
2 Y
Queried Dataframe for id 1
id queryid name
1 23 lobo
1 45 sobo
1 56 aobo
Queried Dataframe for id 2
id queryid name
2 53 lama
2 67 dama
2 56 pama
Final output required :
id parentname queryid name
1 X 23 lobo
1 X 45 sobo
1 X 56 aobo
2 Y 53 lama
2 Y 67 dama
2 Y 56 pama
Update1:
I tried using foreachpartition and use foreach internally to loop through each record and got below error.
error: Unable to find encoder for type org.apache.spark.sql.DataFrame. An implicit Encoder[org.apache.spark.sql.DataFrame] is needed to store org.apache.spark.sql.DataFrame instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
falttenedData.map(row=>{
I need to do this with scalability plz. Any help is really appreciated.

The solution is pretty straightforward, you just need to join your parentDF and your other one.
parentDF.join(
otherDF,
Seq("id"),
"left"
)
As you're caring about scalability, In case your "otherDF" is quite small (it has less than 10K rows for example with 2-3 cols), you should consider using broadcast join : parentDF.join(broadcast(otherDF), Seq("id"), "left).

You can use the .join method on a dataframe for this one.
Some example code would be something like this:
val df = Seq((1, "X"), (2, "Y")).toDF("id", "parentname")
df.show
+---+----------+
| id|parentname|
+---+----------+
| 1| X|
| 2| Y|
+---+----------+
val df2 = Seq((1, 23, "lobo"), (1, 45, "sobo"), (1, 56, "aobo"), (2, 53, "lama"), (2, 67, "dama"), (2, 56, "pama")).toDF("id", "queryid", "name")
df2.show
+---+-------+----+
| id|queryid|name|
+---+-------+----+
| 1| 23|lobo|
| 1| 45|sobo|
| 1| 56|aobo|
| 2| 53|lama|
| 2| 67|dama|
| 2| 56|pama|
+---+-------+----+
val output=df.join(df2, Seq("id"))
output.show
+---+----------+-------+----+
| id|parentname|queryid|name|
+---+----------+-------+----+
| 1| X| 23|lobo|
| 1| X| 45|sobo|
| 1| X| 56|aobo|
| 2| Y| 53|lama|
| 2| Y| 67|dama|
| 2| Y| 56|pama|
+---+----------+-------+----+
Hope this helps! :)

Related

Transforming data frame in Spark scala

I have a data frame here where I need to some transformation. The Col_X and Col_Y here are the columns which need to be worked on. The suffix for Col_X and Col_Y are X and Y and I need this values as in the new column Col_D and the values of Col_x and col_y should be splitted into different rows. I gone through pivot table option but seems to be not working. Is there a way I can transform the data efficiently in Spark scala
ColA ColB Col_x Col_y
a 1 10 20
b 2 30 40
Table required:
ColA ColB ColC Col_D
a 1 10 X
a 1 20 Y
b 2 30 X
b 2 40 Y
You can use stack function,
val df = // input
df.selectExpr("ColA", "ColB", "stack(2, 'X', Col_x, 'Y', Col_y) as (ColD, ColC)")
.show()
+----+----+----+----+
|ColA|ColB|ColD|ColC|
+----+----+----+----+
| a| 1| X| 10|
| a| 1| Y| 20|
| b| 2| X| 30|
| b| 2| Y| 40|
+----+----+----+----+

How to efficiently split a dataframe in Spark based on a condition?

I have a situtation like that with this Spark dataframe:
id
value
1
0
1
3
2
4
1
0
2
2
3
0
4
1
Now what I want to obtain is to efficiently split this single dataframe in 3 different one such that each dataframe extracted from the original one is between two 0 in the "value" column (with the first zero indicating the beginning of each dataframe) using Apache Spark, so that I would obtain this as result:
Dataframe 1 (rows from first 0 value to the last value before the next 0):
id
value
1
0
1
3
2
4
Dataframe 2 (rows from the second zero value to the last value before the 3rd zero):
id
value
1
0
2
2
Dataframe 3:
id
value
3
0
4
1
While as samkart said it is not efficient/easy way to break data on basis of order of rows still if you are using spark v3.2+ you can leverage pandas on pyspark to do it in spark way like below
import pyspark.pandas as ps
from pyspark.sql import functions as F
from pyspark.sql import Window
pdf=ps.read_csv("/FileStore/tmp4/pand.txt")
sdf = pdf.to_spark(index_col='index')
sdf=sdf.withColumn("run",F.sum(F.when(F.col("value")==0,1).otherwise(0)).over(Window.orderBy("index")))
toval= sdf.agg(F.max(F.col("run"))).collect()[0][0]
for x in range (1,toval+1):
globals()[f"sdf{x}"]=sdf.filter(F.col("run")==x).drop("index","run")
For above data it will create 3 dataframe sdf1,sdf2,sdf3 like below
sdf1.show()
sdf2.show()
sdf3.show()
#output
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 1| 3|
| 2| 4|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 1| 0|
| 2| 2|
+---+-----+
+---+-----+
| id|value|
+---+-----+
| 3| 0|
| 4| 1|
+---+-----+

Spark SQL: get the value of a column when another column is max value inside a groupBy().agg()

I have a dataframe that looks like this:
root
|-- value: int (nullable = true)
|-- date: date (nullable = true)
I'd like to return value where value is the latest date in the dataframe.
Does this problem change if I need to make a groupBy and agg?
My actual problem looks like this:
val result = df
.filter(df("date")>= somedate && df("date")<= some other date)
.groupBy(valueFromColumn1)
.agg(
max(date),
min(valueFromColumn2),
Here I want to put valueFromColumn4 where date is max after the filter
)
I know I can get these values by creating a second dataframe and then making a join. But I'd like to avoid the join operation if possible.
Input sample:
Column 1 | Column 2 | Date | Column 4
A 1 2006 5
A 5 2018 2
A 3 2000 3
B 13 2007 4
Output sameple (filter is date >= 2006, date <= 2018):
Column 1 | Column 2 | Date | Column 4
A 1 2018 2 <- I got 2 from the first row which has the highest date
B 13 2007 4
A solution would be to use a struct to bind the value and the date together. It would look like this:
val result = df
.filter(df("date")>= somedate && df("date")<= some other date)
.withColumn("s", struct(df("date") as "date", df(valueFromColumn4) as "value"))
.groupBy(valueFromColumn1)
.agg(
// since date is the first value of the struct,
// this selects the tuple that maximizes date, and the associated value.
max(col("s")) as "s",
min(col(valueFromColumn2)),
)
.withColumn("date", col("s.date"))
.withColumn(valueFromColumn4, col("s.value"))
you can use either groupBy with struct :
df
.groupBy()
.agg(max(struct($"date",$"value")).as("latest"))
.select($"latest.*")
or with Window:
df
.withColumn("rnk",row_number().over(Window.orderBy($"date".desc)))
.where($"rnk"===1).drop($"rnk")
The operation which you want to do is ordering within a group of data(here grouped on Column1). This is perfect use case of windowed function, which does perform calculation over a group of records(window).
Here we can partition window on Column1, and pick the maximum of date from each such window. Let's define windowedPartition as :
val windowedPartition = Window.partitionBy("col1").orderBy(col("date").desc)
Then we can apply this window function on our data set to select the row with the highest rank. (I have not added filtering logic in the code below as I think that is not brining any complexity here and will not affect the solution )
Working code :
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val data = Seq(("a" , 1, 2006, 5), ("a", 5, 2018, 2), ("a", 3, 2000, 3), ("b", 13, 2007, 4)).toDF("col1", "col2", "date", "col4")
data: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 2 more fields]
scala> data.show
+----+----+----+----+
|col1|col2|date|col4|
+----+----+----+----+
| a| 1|2006| 5|
| a| 5|2018| 2|
| a| 3|2000| 3|
| b| 13|2007| 4|
+----+----+----+----+
scala> val windowedPartition = Window.partitionBy("col1").orderBy(col("date").desc)
windowedPartition: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#39613474
scala> data.withColumn("row_number", row_number().over(windowedPartition)).show
+----+----+----+----+----------+
|col1|col2|date|col4|row_number|
+----+----+----+----+----------+
| b| 13|2007| 4| 1|
| a| 5|2018| 2| 1|
| a| 1|2006| 5| 2|
| a| 3|2000| 3| 3|
+----+----+----+----+----------+
scala> data.withColumn("row_number", row_number().over(windowedPartition)).where(col("row_number") === 1).show
+----+----+----+----+----------+
|col1|col2|date|col4|row_number|
+----+----+----+----+----------+
| b| 13|2007| 4| 1|
| a| 5|2018| 2| 1|
+----+----+----+----+----------+
scala> data.withColumn("row_number", row_number().over(windowedPartition)).where(col("row_number") === 1).drop(col("row_number")).show
+----+----+----+----+
|col1|col2|date|col4|
+----+----+----+----+
| b| 13|2007| 4|
| a| 5|2018| 2|
+----+----+----+----+
I believe this will be more scalable solution than struct since if the number of column increases we might have to add those columns as well in struct, in this solution that case will be taken care of.
One question though -
In your o/p the value in col2 should be 5(for col1=A) right? How is the value of col2 changing to 1?

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!
Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

How to filter dates column with one condtion from the other column in Pyspark?

Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks
We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+