How to update the column value in pyspark?

How to update the column value in pyspark? - apache-spark-sql

In my scenario the column my_url can be at first level or inside the nested column
how to change the value of the column recursively.the nested column can be a StructType or ArrayType and my_url
either it can be at 2nd level
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- websites: struct (nullable = true)
| |-- cb_url: string (nullable = true)
| |-- domain_url: string (nullable = true)
| |-- email: string (nullable = true)
| |-- facebook_url: string (nullable = true)
| |-- homepage_url: string (nullable = true)
| |-- linkedin_url: string (nullable = true)
| |-- my_url: string (nullable = true)
| |-- phone: string (nullable = true)
| |-- twitter_url: string (nullable = true)
or it can be at 1st level:
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- my_url: string (nullable = true)
|-- facebook_url: string (nullable = true)
or like the below
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- investments: struct (nullable = true)
| |-- investment_list: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- funding_round_info: struct (nullable = true)
| | | | |-- announced_on: timestamp (nullable = true)
| | | | |-- my_url: string (nullable = true)
it can be at any level.

Related

Merge arrays of a nested struct after groupBy/aggregation

I have a schema like this:
root
|-- first_name: string (nullable = true)
|-- medical_group: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- zip: string (nullable = true)
| | |-- secondary_phone_number: string (nullable = true)
| | |-- address1: string (nullable = true)
| |-- offices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: struct (nullable = true)
| | | | |-- phone_number: string (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- address2: string (nullable = true)
| | | | |-- zip: string (nullable = true)
| | | | |-- secondary_phone_number: string (nullable = true)
| | | | |-- address1: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- unique_id: string (nullable = true)
| |-- name: string (nullable = true)
|-- unique_id: string (nullable = true)
|-- last_name: string (nullable = true)
And I'm trying to groupBy/aggregate the data by key1 and key2. The aggregation process is to remove duplicate rows in the DataFrame primarily, as well as aggregate the array of offices.
agg_list = [
first("first_name").alias("first_name"),
first("last_name").alias("last_name"),
first("medical_group").alias("medical_group"),
# Maybe this? collect_list("medical_group.offices").alias("mg.offices")
]
provider_structs_grouped = \
provider_structs_structure \
.groupBy(col('unique_id'), col('medical_group.unique_id')) \
.agg(*agg_list)
I thought I could create a temporary column that uses collect_list, update the nested struct's offices value, and then drop the temporary column, but I was struggling to update the nested struct's value.
Question: How can I aggregate/collect_list the offices, and update the nested offices array with that latest value? (Or perhaps there's a better way?)

pyspark: rearrange nested array of struct sequence

I've a dataframe in this format and I would like to rearrange the fields inside item column.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
So this is the desired format I'm looking for.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- b: long (nullable = true)

You can use transform function:
from pyspark.sql import functions as F
result = df.withColumn(
"item",
F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))")
)

Datframe Struct fieldType to Array of field except last field on Pyspark

I have a spark dataframe with the following schema:
stat_chiamate
|
chiamate_ricevute: struct (nullable = true)
| | |-- h_0: string (nullable = true)
| | |-- h_1: string (nullable = true)
| | |-- h_10: string (nullable = true)
| | |-- h_11: string (nullable = true)
| | |-- h_12: string (nullable = true)
| | |-- h_13: string (nullable = true)
| | |-- h_14: string (nullable = true)
| | |-- h_15: string (nullable = true)
| | |-- h_16: string (nullable = true)
| | |-- h_17: string (nullable = true)
| | |-- h_18: string (nullable = true)
| | |-- h_19: string (nullable = true)
| | |-- h_2: string (nullable = true)
| | |-- h_20: string (nullable = true)
| | |-- h_21: string (nullable = true)
| | |-- h_22: string (nullable = true)
| | |-- h_23: string (nullable = true)
| | |-- h_3: string (nullable = true)
| | |-- h_4: string (nullable = true)
| | |-- h_5: string (nullable = true)
| | |-- h_6: string (nullable = true)
| | |-- h_7: string (nullable = true)
| | |-- h_8: string (nullable = true)
| | |-- h_9: string (nullable = true)
| | |-- n_totale: string (nullable = true)
I want a dataframe like:
stat_chiamate: struct (nullable = true)
|
chiamate_ricevute: Array
|-- element(String)
where chiamate_ricevute is a list of value of field for example:
h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412
I want:
[0,1,2....,23] <-- I don't want n_totale values
In my code i use df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1] but i have only single fieldsName but how i can use them?
df=df.select(F.array(*[field for field in
df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))

You could use the schema of the dataframe, and in particular the schema of your struct to extract all the field names but n_totale and then wrap them into an array.
from pyspark.sql import functions as f
fields = ['chiamate_ricevute.' + field.name for field in df.schema[0].dataType
if field.name != 'n_totale']
result = df.select(f.array(fields).alias("chiamate_ricevute"))

Change schema of dataframe to other schema

I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?

You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.

You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)

translate from spark function calls to SQL

I have a Dataset with the schema below.
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
Now I want to change the current code to a SQL statement. The current code is like this:
val result = ds.select(col("*"), explode(col("transactions")).as("txn"))
.where("IsValidUDF(txn) = TRUE").groupBy("prty_ol_prfl_id")
.agg(collect_list("txn").as("transactions"))
which produces the following schema:
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
The IsValidUDF just checks the column tagged for certain values.
Any help would be appreciated.
Thanks

The translaton of your code to a spark sql statement is:
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
(just replace first(id) with first(.) with every column you want to have in the resulting dataframe.
Beforehand make sur that your udf is registered:
spark.udf.register("IsValidUDF", is_valid_udf)
Here is the complete code with a toy example:
import org.apache.spark.sql.Row
// Toy example
val df = Seq((0, List(66,1) ),(1, List(98, 2)),(2, List(90))).toDF("id", "transactions")
df.createOrReplaceTempView("df")
val is_valid_udf = udf((r: Int) => r > 50)
// register udf
spark.udf.register("IsValidUDF", is_valid_udf)
// query
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
Output:
+---+------------+
| id|TRANSACTIONS|
+---+------------+
| 1| [98]|
| 2| [90]|
| 0| [66]|
+---+------------+
which is the original dataframe with transactions > 50 removed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to update the column value in pyspark? - apache-spark-sql

Related

Merge arrays of a nested struct after groupBy/aggregation

pyspark: rearrange nested array of struct sequence

Datframe Struct fieldType to Array of field except last field on Pyspark

Change schema of dataframe to other schema

translate from spark function calls to SQL

Categories

Resources