In my scenario the column my_url can be at first level or inside the nested column
how to change the value of the column recursively.the nested column can be a StructType or ArrayType and my_url
either it can be at 2nd level
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- websites: struct (nullable = true)
| |-- cb_url: string (nullable = true)
| |-- domain_url: string (nullable = true)
| |-- email: string (nullable = true)
| |-- facebook_url: string (nullable = true)
| |-- homepage_url: string (nullable = true)
| |-- linkedin_url: string (nullable = true)
| |-- my_url: string (nullable = true)
| |-- phone: string (nullable = true)
| |-- twitter_url: string (nullable = true)
or it can be at 1st level:
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- my_url: string (nullable = true)
|-- facebook_url: string (nullable = true)
or like the below
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- investments: struct (nullable = true)
| |-- investment_list: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- funding_round_info: struct (nullable = true)
| | | | |-- announced_on: timestamp (nullable = true)
| | | | |-- my_url: string (nullable = true)
it can be at any level.
Related
I have a schema like this:
root
|-- first_name: string (nullable = true)
|-- medical_group: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- zip: string (nullable = true)
| | |-- secondary_phone_number: string (nullable = true)
| | |-- address1: string (nullable = true)
| |-- offices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: struct (nullable = true)
| | | | |-- phone_number: string (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- address2: string (nullable = true)
| | | | |-- zip: string (nullable = true)
| | | | |-- secondary_phone_number: string (nullable = true)
| | | | |-- address1: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- unique_id: string (nullable = true)
| |-- name: string (nullable = true)
|-- unique_id: string (nullable = true)
|-- last_name: string (nullable = true)
And I'm trying to groupBy/aggregate the data by key1 and key2. The aggregation process is to remove duplicate rows in the DataFrame primarily, as well as aggregate the array of offices.
agg_list = [
first("first_name").alias("first_name"),
first("last_name").alias("last_name"),
first("medical_group").alias("medical_group"),
# Maybe this? collect_list("medical_group.offices").alias("mg.offices")
]
provider_structs_grouped = \
provider_structs_structure \
.groupBy(col('unique_id'), col('medical_group.unique_id')) \
.agg(*agg_list)
I thought I could create a temporary column that uses collect_list, update the nested struct's offices value, and then drop the temporary column, but I was struggling to update the nested struct's value.
Question: How can I aggregate/collect_list the offices, and update the nested offices array with that latest value? (Or perhaps there's a better way?)
I've a dataframe in this format and I would like to rearrange the fields inside item column.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
So this is the desired format I'm looking for.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- b: long (nullable = true)
You can use transform function:
from pyspark.sql import functions as F
result = df.withColumn(
"item",
F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))")
)
I have a spark dataframe with the following schema:
stat_chiamate
|
chiamate_ricevute: struct (nullable = true)
| | |-- h_0: string (nullable = true)
| | |-- h_1: string (nullable = true)
| | |-- h_10: string (nullable = true)
| | |-- h_11: string (nullable = true)
| | |-- h_12: string (nullable = true)
| | |-- h_13: string (nullable = true)
| | |-- h_14: string (nullable = true)
| | |-- h_15: string (nullable = true)
| | |-- h_16: string (nullable = true)
| | |-- h_17: string (nullable = true)
| | |-- h_18: string (nullable = true)
| | |-- h_19: string (nullable = true)
| | |-- h_2: string (nullable = true)
| | |-- h_20: string (nullable = true)
| | |-- h_21: string (nullable = true)
| | |-- h_22: string (nullable = true)
| | |-- h_23: string (nullable = true)
| | |-- h_3: string (nullable = true)
| | |-- h_4: string (nullable = true)
| | |-- h_5: string (nullable = true)
| | |-- h_6: string (nullable = true)
| | |-- h_7: string (nullable = true)
| | |-- h_8: string (nullable = true)
| | |-- h_9: string (nullable = true)
| | |-- n_totale: string (nullable = true)
I want a dataframe like:
stat_chiamate: struct (nullable = true)
|
chiamate_ricevute: Array
|-- element(String)
where chiamate_ricevute is a list of value of field for example:
h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412
I want:
[0,1,2....,23] <-- I don't want n_totale values
In my code i use df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1] but i have only single fieldsName but how i can use them?
df=df.select(F.array(*[field for field in
df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))
You could use the schema of the dataframe, and in particular the schema of your struct to extract all the field names but n_totale and then wrap them into an array.
from pyspark.sql import functions as f
fields = ['chiamate_ricevute.' + field.name for field in df.schema[0].dataType
if field.name != 'n_totale']
result = df.select(f.array(fields).alias("chiamate_ricevute"))
I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?
You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.
You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)
I have a Dataset with the schema below.
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
Now I want to change the current code to a SQL statement. The current code is like this:
val result = ds.select(col("*"), explode(col("transactions")).as("txn"))
.where("IsValidUDF(txn) = TRUE").groupBy("prty_ol_prfl_id")
.agg(collect_list("txn").as("transactions"))
which produces the following schema:
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
The IsValidUDF just checks the column tagged for certain values.
Any help would be appreciated.
Thanks
The translaton of your code to a spark sql statement is:
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
(just replace first(id) with first(.) with every column you want to have in the resulting dataframe.
Beforehand make sur that your udf is registered:
spark.udf.register("IsValidUDF", is_valid_udf)
Here is the complete code with a toy example:
import org.apache.spark.sql.Row
// Toy example
val df = Seq((0, List(66,1) ),(1, List(98, 2)),(2, List(90))).toDF("id", "transactions")
df.createOrReplaceTempView("df")
val is_valid_udf = udf((r: Int) => r > 50)
// register udf
spark.udf.register("IsValidUDF", is_valid_udf)
// query
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
Output:
+---+------------+
| id|TRANSACTIONS|
+---+------------+
| 1| [98]|
| 2| [90]|
| 0| [66]|
+---+------------+
which is the original dataframe with transactions > 50 removed.