Transforming data frame in Spark scala - dataframe

I have a data frame here where I need to some transformation. The Col_X and Col_Y here are the columns which need to be worked on. The suffix for Col_X and Col_Y are X and Y and I need this values as in the new column Col_D and the values of Col_x and col_y should be splitted into different rows. I gone through pivot table option but seems to be not working. Is there a way I can transform the data efficiently in Spark scala
ColA ColB Col_x Col_y
a 1 10 20
b 2 30 40
Table required:
ColA ColB ColC Col_D
a 1 10 X
a 1 20 Y
b 2 30 X
b 2 40 Y

You can use stack function,
val df = // input
df.selectExpr("ColA", "ColB", "stack(2, 'X', Col_x, 'Y', Col_y) as (ColD, ColC)")
.show()
+----+----+----+----+
|ColA|ColB|ColD|ColC|
+----+----+----+----+
| a| 1| X| 10|
| a| 1| Y| 20|
| b| 2| X| 30|
| b| 2| Y| 40|
+----+----+----+----+

Related

Consolidate each row of dataframe returning a dataframe into ouput dataframe

I am looking for help in a scenario where I have a scala dataframe PARENT. I need to
loop through each record in PARENT dataframe
Query the records from a database based on a filter using ID value of
parent (the output of this step is dataframe)
append few attributes from parent to queried dataframe
Ex:
ParentDF
id parentname
1 X
2 Y
Queried Dataframe for id 1
id queryid name
1 23 lobo
1 45 sobo
1 56 aobo
Queried Dataframe for id 2
id queryid name
2 53 lama
2 67 dama
2 56 pama
Final output required :
id parentname queryid name
1 X 23 lobo
1 X 45 sobo
1 X 56 aobo
2 Y 53 lama
2 Y 67 dama
2 Y 56 pama
Update1:
I tried using foreachpartition and use foreach internally to loop through each record and got below error.
error: Unable to find encoder for type org.apache.spark.sql.DataFrame. An implicit Encoder[org.apache.spark.sql.DataFrame] is needed to store org.apache.spark.sql.DataFrame instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
falttenedData.map(row=>{
I need to do this with scalability plz. Any help is really appreciated.
The solution is pretty straightforward, you just need to join your parentDF and your other one.
parentDF.join(
otherDF,
Seq("id"),
"left"
)
As you're caring about scalability, In case your "otherDF" is quite small (it has less than 10K rows for example with 2-3 cols), you should consider using broadcast join : parentDF.join(broadcast(otherDF), Seq("id"), "left).
You can use the .join method on a dataframe for this one.
Some example code would be something like this:
val df = Seq((1, "X"), (2, "Y")).toDF("id", "parentname")
df.show
+---+----------+
| id|parentname|
+---+----------+
| 1| X|
| 2| Y|
+---+----------+
val df2 = Seq((1, 23, "lobo"), (1, 45, "sobo"), (1, 56, "aobo"), (2, 53, "lama"), (2, 67, "dama"), (2, 56, "pama")).toDF("id", "queryid", "name")
df2.show
+---+-------+----+
| id|queryid|name|
+---+-------+----+
| 1| 23|lobo|
| 1| 45|sobo|
| 1| 56|aobo|
| 2| 53|lama|
| 2| 67|dama|
| 2| 56|pama|
+---+-------+----+
val output=df.join(df2, Seq("id"))
output.show
+---+----------+-------+----+
| id|parentname|queryid|name|
+---+----------+-------+----+
| 1| X| 23|lobo|
| 1| X| 45|sobo|
| 1| X| 56|aobo|
| 2| Y| 53|lama|
| 2| Y| 67|dama|
| 2| Y| 56|pama|
+---+----------+-------+----+
Hope this helps! :)

PySpark pivot as SQL query

Looking to write the full-SQL equivalent of a pivot implemented in pyspark. Code below creates a pandas DataFrame.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'name': ['up','down','left','up','down','left','right','up','down'],
'count': [6,7,5,3,4,2,9,12,4]})
# id name count
# 0 a up 6
# 1 a down 7
# 2 a left 5
# 3 b up 3
# 4 b down 4
# 5 b left 2
# 6 b right 9
# 7 c up 12
# 8 c down 4
Code below then converts to a pyspark DataFrame and implements a pivot on the name column.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
ds = spark.createDataFrame(df)
dp = ds.groupBy('id').pivot('name').max().toPandas()
# id down left right up
# 0 c 4 NaN NaN 12
# 1 b 4 2.0 9.0 3
# 2 a 7 5.0 NaN 6
Trying to do the equivalent of ds.groupBy('id').pivot('name').max() in full-SQL, ie something like
ds.createOrReplaceTempView('ds')
dp = spark.sql(f"""
SELECT * FROM ds
PIVOT
(MAX(count)
FOR
...)""").toPandas()
Taking reference from SparkSQL Pivot -
Pivot
spark.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
FOR name in ('up','down','left','right')
)""").show()
+---+---+----+----+-----+
| id| up|down|left|right|
+---+---+----+----+-----+
| c| 12| 4|null| null|
| b| 3| 4| 2| 9|
| a| 6| 7| 5| null|
+---+---+----+----+-----+
Dynamic Approach
You can dynamically create the PIVOT Clause , I tried to create a general wrapper around the same
def pivot_by(inp_df,by):
distinct_by = inp_df[by].unique()
distinct_name_str = ''
for i,name in enumerate(distinct_by):
if i == 0:
distinct_name_str += f'\'{name}\''
else:
distinct_name_str += f',\'{name}\''
final_str = f'FOR {by} in ({distinct_name_str})'
return final_str
pivot_clause_str = pivot_by(df,'name')
### O/p - FOR name in ('up','down','left','right')
spark.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
{pivot_clause_str}
)""").show()
+---+---+----+----+-----+
| id| up|down|left|right|
+---+---+----+----+-----+
| c| 12| 4|null| null|
| b| 3| 4| 2| 9|
| a| 6| 7| 5| null|
+---+---+----+----+-----+
Dynamic Approach Usage
pivot_clause_str = pivot_by(df,'id')
##O/p - FOR id in ('a','b','c')
ds.createOrReplaceTempView('ds')
sql.sql(f"""
SELECT * FROM ds
PIVOT (
MAX(count)
{pivot_clause_str}
)""").show()
+-----+----+---+----+
| name| a| b| c|
+-----+----+---+----+
| down| 7| 4| 4|
| left| 5| 2|null|
| up| 6| 3| 12|
|right|null| 9|null|
+-----+----+---+----+

how to create & sort by an ordered categorical variable in pyspark

I'm migrating some code from pandas to pyspark. My source dataframe looks like this:
a b c
0 1 insert 1
1 2 update 1
2 3 seed 1
3 4 insert 2
4 5 update 2
5 6 delete 2
6 7 snapshot 1
and the operation (in python / pandas) that I'm applying is:
df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])
df.sort_values(['c', 'b'])
resulting in the output dataframe:
a b c
0 1 insert 1
2 3 seed 1
1 2 update 1
6 7 snapshot 1
3 4 insert 2
4 5 update 2
5 6 delete 2
I'm unsure how best to set up ordered categoricals using pyspark, and my initial approach creates a new column using case-when and attempts to use that subsequently:
df = df.withColumn(
"_precedence",
when(col("b") == "insert", 1)
.when(col("b") == "seed", 2)
.when(col("b") == "update", 3)
.when(col("b") == "snapshot", 4)
.when(col("b") == "delete", 5)
)
You can use a map:
from pyspark.sql.functions import create_map, lit, col
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
# per #HaleemurAli, adjusted the below list comprehension to create map
map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
#Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>
df.orderBy('c', map1[col('b')]).show()
+---+---+--------+---+
| id| a| b| c|
+---+---+--------+---+
| 0| 1| insert| 1|
| 2| 3| seed| 1|
| 1| 2| update| 1|
| 6| 7|snapshot| 1|
| 3| 4| insert| 2|
| 4| 5| update| 2|
| 5| 6| delete| 2|
+---+---+--------+---+
to reverse the order on column-b: df.orderBy('c', map1[col('b')].desc()).show()
You could also do this using coalesce with ur when statements.
from pyspark.sql import functions as F
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
cols=[(F.when(F.col("b")==x,F.lit(y))) for x,y in zip(categories,[x for x in (range(1, len(categories)+1))])]
df.orderBy("c",F.coalesce(*cols)).show()
#+---+--------+---+
#| a| b| c|
#+---+--------+---+
#| 1| insert| 1|
#| 3| seed| 1|
#| 2| update| 1|
#| 7|snapshot| 1|
#| 4| insert| 2|
#| 5| update| 2|
#| 6| delete| 2|
#+---+--------+---+

Selecting 'Exclusive Rows' from a PySpark Dataframe

I have a PySpark dataframe like this:
+----------+-----+
|account_no|types|
+----------+-----+
| 1| K|
| 1| A|
| 1| S|
| 2| M|
| 2| D|
| 2| S|
| 3| S|
| 3| S|
| 4| M|
| 5| K|
| 1| S|
| 6| S|
+----------+-----+
and I am trying to pick the account numbers for which Exclusively 'S' exists.
For example: Even though '1' has type ='S', I will not pick it because it has also got other types. But I will pick 3 and 6, because they have just one type 'S'.
What I am doing right now is:
- First get all accounts for which 'K' exists and remove them; which in this example removes '1' and '5'
- Second find all accounts for which 'D' exists and remove them, which removes '2'
- Third find all accounts for which 'M' exists, and remove '4' ('2' has also got 'M' but it was removed at step 2)
- Fourth find all accounts for which 'A' exists, and remove them
So, now '1', '2', '4' and '5' are removed and I get '3' and '6' which have exclusive 'S'.
But this is a long process, how do I optimize it?
Thank you
Another alternative is counting distinct over a window and then filter where Distinct count == 1 and types == S , for ordering you can assign a monotonically increasing id and then orderBy the same.
from pyspark.sql import functions as F
W = Window.partitionBy('account_no')
out = (df.withColumn("idx",F.monotonically_increasing_id())
.withColumn("Distinct",F.approx_count_distinct(F.col("types")).over(W)).orderBy("idx")
.filter("Distinct==1 AND types =='S'")).drop('idx','Distinct')
out.show()
+----------+-----+
|account_no|types|
+----------+-----+
| 3| S|
| 3| S|
| 6| S|
+----------+-----+
One way to do this is to use Window functions. First we get a sum of the number of S in each account_no grouping. Then we compare that to the total number of entries for that group, in the filter, if they match we keep that number.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("account_no")
w1=Window().partitionBy("account_no").orderBy("types")
df.withColumn("sum_S", F.sum(F.when(F.col("types")=='S', F.lit(1)).otherwise(F.lit(0))).over(w))\
.withColumn("total", F.max(F.row_number().over(w1)).over(w))\
.filter('total=sum_S').drop("total","Sum_S").show()
#+----------+-----+
#|account_no|types|
#+----------+-----+
#| 6| S|
#| 3| S|
#| 3| S|
#+----------+-----+
You can simply detect the amount of distinct types an account has and then filter the 'S' accounts which only have 1 distinct type.
Here is my code for it:
from pyspark.sql.functions import countDistinct
data = [(1, 'k'),
(1, 'a'),
(1, 's'),
(2, 'm'),
(2, 'd'),
(2, 's'),
(3, 's'),
(3, 's'),
(4, 'm'),
(5, 'k'),
(1, 's'),
(6, 's')]
df = spark.createDataFrame(data, ['account_no', 'types']).distinct()
exclusive_s_accounts = (df.groupBy('account_no').agg(countDistinct('types').alias('distinct_count'))
.join(df, 'account_no')
.where((col('types') == 's') & (col('distinct_count') == 1))
.drop('distinct_count'))
Another alternate approach could be to get all the types under one column and then apply filter operations to exclude which has non "S" values.
from pyspark.sql.functions import concat_ws
from pyspark.sql.functions import collectivist
from pyspark.sql.functions import col
df = spark.read.csv("/Users/Downloads/account.csv", header=True, inferSchema=True, sep=",")
type_df = df.groupBy("account_no").agg(concat_ws(",", collect_list("types")).alias("all_types")).select(col("account_no"), col("all_types"))
+----------+---------+
|account_no|all_types|
+----------+---------+
| 1| K,A,S,S|
| 6| S|
| 3| S,S|
| 5| K|
| 4| M|
| 2| M,D,S|
+----------+---------+
further filtering using regular expression
only_s_df = type_df.withColumn("S_status",F.col("all_types").rlike("K|A|M|D"))
only_s_df.show()
+----------+---------+----------+
|account_no|all_types|S_status |
+----------+---------+----------+
| 1| K,A,S,S| true|
| 6| S| false|
| 3| S,S| false|
| 5| K| true|
| 4| M| true|
| 2| M,D,S| true|
+----------+---------+----------+
hope this way you can get the answer and further processing.

Partition PySpark DataFrame depending on unique values in column (Custom Partitioning)

I have a PySpark data frame in which I have separate columns for names, types, days and values. An example of the dataframe can be seen below:
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
| name4| b| 1| 145|
| name5| b| 1| 185|
| name6| c| 1| 155|
| name7| c| 1| 160|
| name8| a| 2| 120|
| name9| a| 2| 110|
|name10| b| 2| 125|
|name11| b| 2| 185|
|name12| c| 3| 195|
+------+----+---+-----+
For a selected value of Type, I want to create separate dataframes depending on the unique values of the column titled Day. Let's say, I have chosen a as my preferred Type. In the aforementioned example, I have three unique values of Day (viz. 1, 2 , 3). For each unique value of Day which has a row with the chosen Type a - (that is days 1 and 2 in the above data), I want to create a dataframe which has all rows with the chosen chosen Type and Day. In the example mentioned above, I will have two dataframe which will look as below
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name1| a| 1| 140|
| name2| a| 1| 180|
| name3| a| 1| 150|
+------+----+---+-----+
and
+------+----+---+-----+
| Name|Type|Day|Value|
+------+----+---+-----+
| name8| a| 2| 120|
| name9| a| 2| 110|
+------+----+---+-----+
How can I do this? In the actual data that I will be working with, I have millions of columns. So, I want to know about the most efficient way in which I can realize the above mentioned aim.
You can use the below mentioned code to generate the example given above.
from pyspark.sql import *
import numpy as np
Stats = Row("Name", "Type", "Day", "Value")
stat1 = Stats('name1', 'a', 1, 140)
stat2 = Stats('name2', 'a', 1, 180)
stat3 = Stats('name3', 'a', 1, 150)
stat4 = Stats('name4', 'b', 1, 145)
stat5 = Stats('name5', 'b', 1, 185)
stat6 = Stats('name6', 'c', 1, 155)
stat7 = Stats('name7', 'c', 1, 160)
stat8 = Stats('name8', 'a', 2, 120)
stat9 = Stats('name9', 'a', 2, 110)
stat10 = Stats('name10', 'b', 2, 125)
stat11 = Stats('name11', 'b', 2, 185)
stat12 = Stats('name12', 'c', 3, 195)
You can just use df.repartition("Type", "Day")
Docs for the same.
When I validate using the following function, I get the mentioned output
def validate(partition):
count = 0
for row in partition:
print(row)
count += 1
print(count)
My data
+------+--------------------+-------+-------+
|amount| trans_date|user_id|row_num|
+------+--------------------+-------+-------+
| 99.1|2019-06-04T00:00:...| 101| 1|
| 89.27|2019-06-04T00:00:...| 102| 2|
| 89.1|2019-03-04T00:00:...| 102| 3|
| 73.11|2019-09-10T00:00:...| 103| 4|
|-69.81|2019-09-11T00:00:...| 101| 5|
| 12.51|2018-12-14T00:00:...| 101| 6|
| 43.23|2018-09-11T00:00:...| 101| 7|
+------+--------------------+-------+-------+
After df.repartition("user_id") I get the following:
Output
Row(amount=73.11, trans_date='2019-09-10T00:00:00.000+05:30', user_id='103', row_num=4)
1
Row(amount=89.27, trans_date='2019-06-04T00:00:00.000+05:30', user_id='102', row_num=2)
Row(amount=89.1, trans_date='2019-03-04T00:00:00.000+05:30', user_id='102', row_num=3)
2
Row(amount=99.1, trans_date='2019-06-04T00:00:00.000+05:30', user_id='101', row_num=1)
Row(amount=-69.81, trans_date='2019-09-11T00:00:00.000+05:30', user_id='101', row_num=5)
Row(amount=12.51, trans_date='2018-12-14T00:00:00.000+05:30', user_id='101', row_num=6)
Row(amount=43.23, trans_date='2018-09-11T00:00:00.000+05:30', user_id='101', row_num=7)
4