I have a schema like this:
root
|-- first_name: string (nullable = true)
|-- medical_group: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- zip: string (nullable = true)
| | |-- secondary_phone_number: string (nullable = true)
| | |-- address1: string (nullable = true)
| |-- offices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: struct (nullable = true)
| | | | |-- phone_number: string (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- address2: string (nullable = true)
| | | | |-- zip: string (nullable = true)
| | | | |-- secondary_phone_number: string (nullable = true)
| | | | |-- address1: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- unique_id: string (nullable = true)
| |-- name: string (nullable = true)
|-- unique_id: string (nullable = true)
|-- last_name: string (nullable = true)
And I'm trying to groupBy/aggregate the data by key1 and key2. The aggregation process is to remove duplicate rows in the DataFrame primarily, as well as aggregate the array of offices.
agg_list = [
first("first_name").alias("first_name"),
first("last_name").alias("last_name"),
first("medical_group").alias("medical_group"),
# Maybe this? collect_list("medical_group.offices").alias("mg.offices")
]
provider_structs_grouped = \
provider_structs_structure \
.groupBy(col('unique_id'), col('medical_group.unique_id')) \
.agg(*agg_list)
I thought I could create a temporary column that uses collect_list, update the nested struct's offices value, and then drop the temporary column, but I was struggling to update the nested struct's value.
Question: How can I aggregate/collect_list the offices, and update the nested offices array with that latest value? (Or perhaps there's a better way?)
Related
I have a JSON Schema with keys in camel case and I am trying to convert all data type to lower case.
I am facing issue with the ArrayType.
import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types.{DataType, StructType}
import spark.implicits._
val spark: SparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
var sample_schema = spark.read.json("path").schema
def columnsToLowercase(schema: StructType): StructType = {
def recurRename(schema: StructType): Seq[StructField] =
schema.fields.map {
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(name.toLowerCase, StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(name.toLowerCase, dtype, nullable, meta)
}
StructType(recurRename(schema))
}
val jsonDFrame: DataFrame = spark.read.schema(columnsToLowercase(sample_schema)).json("path")
Sample Schema:
root
|-- id: string (nullable = true)
|-- master: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- provInfo: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- claimInfo: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- claimId: string (nullable = true)
| | | |-- demoInfo: struct (nullable = true)
| | | | |-- family: struct (nullable = true)
| | | | | |-- outOrder: struct (nullable = true)
| | | | | | |-- LocOut: boolean (nullable = true)
| | | | | | |-- found: boolean (nullable = true)
| |-- claimAddr: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: string (nullable = true)
|-- system: string (nullable = true)
You should be able to lowercase fields nested in ArrayType by adding another case clause. For array columns, you also need to check its sub-elements type:
def columnsToLowercase(schema: StructType): StructType = {
// ....
case StructField(name, dtype: ArrayType, nullable, meta) => dtype.elementType match {
case s: StructType => StructField(name.toLowerCase, ArrayType(StructType(recurRename(s)), true), nullable, meta)
case dt => StructField(name.toLowerCase, dt, nullable, meta)
}
//....
}
Applying on your schema:
df.printSchema
//root
// |-- id: string (nullable = true)
// |-- master: struct (nullable = true)
// | |-- provInfo: struct (nullable = true)
// | | |-- claimInfo: array (nullable = true)
// | | | |-- element: struct (containsNull = true)
// | | | | |-- claimId: string (nullable = true)
// | | |-- demoInfo: struct (nullable = true)
// | | | |-- family: struct (nullable = true)
// | | | | |-- outOrder: struct (nullable = true)
// | | | | | |-- LocOut: boolean (nullable = false)
// | | | | | |-- found: boolean (nullable = false)
// | |-- claimAddr: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- address: string (nullable = true)
// |-- system: string (nullable = true)
columnsToLowercase(df.schema).printTreeString()
//root
// |-- id: string (nullable = true)
// |-- master: struct (nullable = true)
// | |-- provinfo: struct (nullable = true)
// | | |-- claiminfo: array (nullable = true)
// | | | |-- element: struct (containsNull = true)
// | | | | |-- claimid: string (nullable = true)
// | | |-- demoinfo: struct (nullable = true)
// | | | |-- family: struct (nullable = true)
// | | | | |-- outorder: struct (nullable = true)
// | | | | | |-- locout: boolean (nullable = false)
// | | | | | |-- found: boolean (nullable = false)
// | |-- claimaddr: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- address: string (nullable = true)
// |-- system: string (nullable = true)
I have a spark dataframe with the following schema:
stat_chiamate
|
chiamate_ricevute: struct (nullable = true)
| | |-- h_0: string (nullable = true)
| | |-- h_1: string (nullable = true)
| | |-- h_10: string (nullable = true)
| | |-- h_11: string (nullable = true)
| | |-- h_12: string (nullable = true)
| | |-- h_13: string (nullable = true)
| | |-- h_14: string (nullable = true)
| | |-- h_15: string (nullable = true)
| | |-- h_16: string (nullable = true)
| | |-- h_17: string (nullable = true)
| | |-- h_18: string (nullable = true)
| | |-- h_19: string (nullable = true)
| | |-- h_2: string (nullable = true)
| | |-- h_20: string (nullable = true)
| | |-- h_21: string (nullable = true)
| | |-- h_22: string (nullable = true)
| | |-- h_23: string (nullable = true)
| | |-- h_3: string (nullable = true)
| | |-- h_4: string (nullable = true)
| | |-- h_5: string (nullable = true)
| | |-- h_6: string (nullable = true)
| | |-- h_7: string (nullable = true)
| | |-- h_8: string (nullable = true)
| | |-- h_9: string (nullable = true)
| | |-- n_totale: string (nullable = true)
I want a dataframe like:
stat_chiamate: struct (nullable = true)
|
chiamate_ricevute: Array
|-- element(String)
where chiamate_ricevute is a list of value of field for example:
h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412
I want:
[0,1,2....,23] <-- I don't want n_totale values
In my code i use df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1] but i have only single fieldsName but how i can use them?
df=df.select(F.array(*[field for field in
df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))
You could use the schema of the dataframe, and in particular the schema of your struct to extract all the field names but n_totale and then wrap them into an array.
from pyspark.sql import functions as f
fields = ['chiamate_ricevute.' + field.name for field in df.schema[0].dataType
if field.name != 'n_totale']
result = df.select(f.array(fields).alias("chiamate_ricevute"))
I have a Dataset with the schema below.
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
Now I want to change the current code to a SQL statement. The current code is like this:
val result = ds.select(col("*"), explode(col("transactions")).as("txn"))
.where("IsValidUDF(txn) = TRUE").groupBy("prty_ol_prfl_id")
.agg(collect_list("txn").as("transactions"))
which produces the following schema:
root
|-- acct_id: long (nullable = true)
|-- firm_bnkg_id: integer (nullable = true)
|-- tagged: long (nullable = true)
|-- transactions: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- mo_yr_buckt: string (nullable = false)
| | |-- acct_id: long (nullable = false)
| | |-- eff_dt: date (nullable = true)
| | |-- extn_txn_cd: string (nullable = true)
| | |-- mntr_txn_am: double (nullable = true)
| | |-- cr_dr_in: string (nullable = true)
| | |-- txn_desc_tx: string (nullable = true)
| | |-- txn_auth_dt: date (nullable = false)
| | |-- txn_auth_ts: string (nullable = false)
| | |-- tagged: long (nullable = true)
| | |-- firm_bnkg_id: integer (nullable = false)
| | |-- txn_pst_sq_nb: string (nullable = false)
| | |-- pst_dt: integer (nullable = false)
|-- prty_ol_prfl_id: long (nullable = true)
|-- prod_cd: string (nullable = true)
|-- acct_type_cd: string (nullable = true)
|-- acct_state_cd: string (nullable = true)
The IsValidUDF just checks the column tagged for certain values.
Any help would be appreciated.
Thanks
The translaton of your code to a spark sql statement is:
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
(just replace first(id) with first(.) with every column you want to have in the resulting dataframe.
Beforehand make sur that your udf is registered:
spark.udf.register("IsValidUDF", is_valid_udf)
Here is the complete code with a toy example:
import org.apache.spark.sql.Row
// Toy example
val df = Seq((0, List(66,1) ),(1, List(98, 2)),(2, List(90))).toDF("id", "transactions")
df.createOrReplaceTempView("df")
val is_valid_udf = udf((r: Int) => r > 50)
// register udf
spark.udf.register("IsValidUDF", is_valid_udf)
// query
val new_df = spark.sql("""
WITH temp AS(
SELECT *, explode(transactions) AS txn FROM df
)
SELECT first(id) id, collect_list(txn) AS TRANSACTIONS FROM temp WHERE IsValidUDF(txn) = TRUE GROUP BY id
""")
Output:
+---+------------+
| id|TRANSACTIONS|
+---+------------+
| 1| [98]|
| 2| [90]|
| 0| [66]|
+---+------------+
which is the original dataframe with transactions > 50 removed.
This is the schema of the data and wanted to extract 'from' in this.
Tried using the
df3 =df.select(df.transcript.data.from.alias("Type"))
and getting invalid syntax error.
How to extract this.
root
|-- contactId: long (nullable = true)
|-- mediaLegId: string (nullable = true)
|-- transcript: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action: string (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- chatId: string (nullable = true)
| | | |-- customerInfo: struct (nullable = true)
| | | | |-- customerIdentifierToken: string (nullable = true)
| | | | |-- customerIdentifierType: string (nullable = true)
| | | | |-- customerName: string (nullable = true)
| | | | |-- initialQuestion: string (nullable = true)
| | | |-- entryPoint: string (nullable = true)
| | | |-- from: string (nullable = true)
| | | |-- lang: string (nullable = true)
| | | |-- parkDuration: long (nullable = true)
| | | |-- parkNote: string (nullable = true)
| | | |-- participant: struct (nullable = true)
| | | | |-- disconnectReason: string (nullable = true)
| | | | |-- displayName: string (nullable = true)
| | | | |-- participantId: string (nullable = true)
| | | | |-- preferences: struct (nullable = true)
| | | | | |-- language: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- userName: string (nullable = true)
| | | |-- reconnected: boolean (nullable = true)
| | | |-- relatedData: string (nullable = true)
| | | |-- text: string (nullable = true)
| | | |-- timestamp: long (nullable = true)
| | | |-- transcriptText: string (nullable = true)
| | | |-- transferNote: string (nullable = true)
| | | |-- transcriptText: string (nullable = true)
| | | |-- transferNote: string (nullable = true)
Try using it like this
from pyspark.sql import functions as F
df.select(F.explode("transcript").alias('transcript')).select('transcript.*').select("data.*").select("from").show()
I had a DataFrame and here's the schema. Numbers of element is unknown but some of the elements(for example element1 and element3) must exist and uniqueness
root
|-- context: struct (nullable = true)
|---|-- key: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- dimensions: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- element1: string (nullable = true)
| | | | |-- element2: string (nullable = true)
| | | | |-- element3: string (nullable = true)
| | | | |-- *** : string (nullable = true)
| | | | |-- elementN: string (nullable = true)
How can I transform it to schema like this?
root
|-- context: struct (nullable = true)
|---|-- key: string (nullable = true)
|---|-- element1: string (nullable = true)
|---|-- element3: string (nullable = true)
Thanks a lot.
Can you please try the explode function. These are following links, please go through them.
Extract columns in nested Spark DataFrame
Extract value from structure within an array of arrays in spark using scala