pyspark: rearrange nested array of struct sequence

pyspark: rearrange nested array of struct sequence - dataframe

I've a dataframe in this format and I would like to rearrange the fields inside item column.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
So this is the desired format I'm looking for.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- b: long (nullable = true)

You can use transform function:
from pyspark.sql import functions as F
result = df.withColumn(
"item",
F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))")
)

Related

Spark - Merge two columns of array struct type

I have a dataframe of schema -
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- C: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge column B & C (array_union). But array_union is not working because of different data types of these columns. Structs of B & C have pretty much same columns except z. I don't care about z - whether it is present or not - in their merged output.
What would be a good way to achieve this?

Sure, drop Z in B and then array_join()
new = (df1.withColumn('B',expr("transform(B,s->struct(s.key as key,s.x as x, s.y as y))"))#drop Z
.withColumn('D', array_union(col('B'),col('C')))#array_join
.drop('B','C')#Drop B and C if not needed
).printSchema()
root
|-- A: string (nullable = false)
|-- D: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)

Transform the column 'C' like this and use the array_union after:
import pyspark.sql.functions as f
df = (df
.withColumn('z', f.expr("transform(C, element -> cast(1 AS double))"))
.withColumn('C', f.expr("transform(C, (element, idx) -> struct(element_at(C.x, idx + 1) AS x, element_at(C.y, idx + 1) AS y, element_at(z, idx + 1) AS z))"))
.drop('z')
)

Flatten dataframe with nested struct ArrayType using pyspark

I have a dataframe with this schema
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
How to flat all columns into one level with Pyspark ?

Using inline function:
df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
.selectExpr("*", "inline(Chapters)")
.drop("Chapters")
)
Or explode:
from pyspark.sql import functions as F
df2 = (df.withColumn("Books", F.explode("Books"))
.select("*", "Books.*")
.withColumn("Chapters", F.explode("Chapters"))
.select("*", "Chapters.*")
)

Change schema of dataframe to other schema

I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?

You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.

You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)

How to update the column value in pyspark?

In my scenario the column my_url can be at first level or inside the nested column
how to change the value of the column recursively.the nested column can be a StructType or ArrayType and my_url
either it can be at 2nd level
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- websites: struct (nullable = true)
| |-- cb_url: string (nullable = true)
| |-- domain_url: string (nullable = true)
| |-- email: string (nullable = true)
| |-- facebook_url: string (nullable = true)
| |-- homepage_url: string (nullable = true)
| |-- linkedin_url: string (nullable = true)
| |-- my_url: string (nullable = true)
| |-- phone: string (nullable = true)
| |-- twitter_url: string (nullable = true)
or it can be at 1st level:
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- my_url: string (nullable = true)
|-- facebook_url: string (nullable = true)
or like the below
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- investments: struct (nullable = true)
| |-- investment_list: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- funding_round_info: struct (nullable = true)
| | | | |-- announced_on: timestamp (nullable = true)
| | | | |-- my_url: string (nullable = true)
it can be at any level.

translate from spark function calls to SQL

I have a Dataset with the schema below.
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
Now I want to change the current code to a SQL statement. The current code is like this:
val result = ds.select(col("*"), explode(col("transactions")).as("txn"))
.where("IsValidUDF(txn) = TRUE").groupBy("prty_ol_prfl_id")
.agg(collect_list("txn").as("transactions"))
which produces the following schema:
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
The IsValidUDF just checks the column tagged for certain values.
Any help would be appreciated.
Thanks

The translaton of your code to a spark sql statement is:
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
(just replace first(id) with first(.) with every column you want to have in the resulting dataframe.
Beforehand make sur that your udf is registered:
spark.udf.register("IsValidUDF", is_valid_udf)
Here is the complete code with a toy example:
import org.apache.spark.sql.Row
// Toy example
val df = Seq((0, List(66,1) ),(1, List(98, 2)),(2, List(90))).toDF("id", "transactions")
df.createOrReplaceTempView("df")
val is_valid_udf = udf((r: Int) => r > 50)
// register udf
spark.udf.register("IsValidUDF", is_valid_udf)
// query
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
Output:
+---+------------+
| id|TRANSACTIONS|
+---+------------+
| 1| [98]|
| 2| [90]|
| 0| [66]|
+---+------------+
which is the original dataframe with transactions > 50 removed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pyspark: rearrange nested array of struct sequence - dataframe

You can use transform function: from pyspark.sql import functions as F result = df.withColumn( "item", F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))") )

Related

Spark - Merge two columns of array struct type

Flatten dataframe with nested struct ArrayType using pyspark

Change schema of dataframe to other schema

How to update the column value in pyspark?

translate from spark function calls to SQL

Categories

Resources