How to convert JSON Schema from Camel case to lower case

How to convert JSON Schema from Camel case to lower case - dataframe

I have a JSON Schema with keys in camel case and I am trying to convert all data type to lower case.
I am facing issue with the ArrayType.
import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types.{DataType, StructType}
import spark.implicits._
val spark: SparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
var sample_schema = spark.read.json("path").schema
def columnsToLowercase(schema: StructType): StructType = {
def recurRename(schema: StructType): Seq[StructField] =
schema.fields.map {
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(name.toLowerCase, StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(name.toLowerCase, dtype, nullable, meta)
}
StructType(recurRename(schema))
}
val jsonDFrame: DataFrame = spark.read.schema(columnsToLowercase(sample_schema)).json("path")
Sample Schema:
root
|-- id: string (nullable = true)
|-- master: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- provInfo: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- claimInfo: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- claimId: string (nullable = true)
| | | |-- demoInfo: struct (nullable = true)
| | | | |-- family: struct (nullable = true)
| | | | | |-- outOrder: struct (nullable = true)
| | | | | | |-- LocOut: boolean (nullable = true)
| | | | | | |-- found: boolean (nullable = true)
| |-- claimAddr: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: string (nullable = true)
|-- system: string (nullable = true)

You should be able to lowercase fields nested in ArrayType by adding another case clause. For array columns, you also need to check its sub-elements type:
def columnsToLowercase(schema: StructType): StructType = {
// ....
case StructField(name, dtype: ArrayType, nullable, meta) => dtype.elementType match {
case s: StructType => StructField(name.toLowerCase, ArrayType(StructType(recurRename(s)), true), nullable, meta)
case dt => StructField(name.toLowerCase, dt, nullable, meta)
}
//....
}
Applying on your schema:
df.printSchema
//root
// |-- id: string (nullable = true)
// |-- master: struct (nullable = true)
// | |-- provInfo: struct (nullable = true)
// | | |-- claimInfo: array (nullable = true)
// | | | |-- element: struct (containsNull = true)
// | | | | |-- claimId: string (nullable = true)
// | | |-- demoInfo: struct (nullable = true)
// | | | |-- family: struct (nullable = true)
// | | | | |-- outOrder: struct (nullable = true)
// | | | | | |-- LocOut: boolean (nullable = false)
// | | | | | |-- found: boolean (nullable = false)
// | |-- claimAddr: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- address: string (nullable = true)
// |-- system: string (nullable = true)
columnsToLowercase(df.schema).printTreeString()
//root
// |-- id: string (nullable = true)
// |-- master: struct (nullable = true)
// | |-- provinfo: struct (nullable = true)
// | | |-- claiminfo: array (nullable = true)
// | | | |-- element: struct (containsNull = true)
// | | | | |-- claimid: string (nullable = true)
// | | |-- demoinfo: struct (nullable = true)
// | | | |-- family: struct (nullable = true)
// | | | | |-- outorder: struct (nullable = true)
// | | | | | |-- locout: boolean (nullable = false)
// | | | | | |-- found: boolean (nullable = false)
// | |-- claimaddr: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- address: string (nullable = true)
// |-- system: string (nullable = true)

Related

Merge arrays of a nested struct after groupBy/aggregation

I have a schema like this:
root
|-- first_name: string (nullable = true)
|-- medical_group: struct (nullable = true)
| |-- address: struct (nullable = true)
| | |-- phone_number: string (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- zip: string (nullable = true)
| | |-- secondary_phone_number: string (nullable = true)
| | |-- address1: string (nullable = true)
| |-- offices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- address: struct (nullable = true)
| | | | |-- phone_number: string (nullable = true)
| | | | |-- city: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- address2: string (nullable = true)
| | | | |-- zip: string (nullable = true)
| | | | |-- secondary_phone_number: string (nullable = true)
| | | | |-- address1: string (nullable = true)
| | | |-- name: string (nullable = true)
| |-- unique_id: string (nullable = true)
| |-- name: string (nullable = true)
|-- unique_id: string (nullable = true)
|-- last_name: string (nullable = true)
And I'm trying to groupBy/aggregate the data by key1 and key2. The aggregation process is to remove duplicate rows in the DataFrame primarily, as well as aggregate the array of offices.
agg_list = [
first("first_name").alias("first_name"),
first("last_name").alias("last_name"),
first("medical_group").alias("medical_group"),
# Maybe this? collect_list("medical_group.offices").alias("mg.offices")
]
provider_structs_grouped = \
provider_structs_structure \
.groupBy(col('unique_id'), col('medical_group.unique_id')) \
.agg(*agg_list)
I thought I could create a temporary column that uses collect_list, update the nested struct's offices value, and then drop the temporary column, but I was struggling to update the nested struct's value.
Question: How can I aggregate/collect_list the offices, and update the nested offices array with that latest value? (Or perhaps there's a better way?)

Flatten dataframe with nested struct ArrayType using pyspark

I have a dataframe with this schema
root
|-- AUTHOR_ID: integer (nullable = false)
|-- NAME: string (nullable = true)
|-- Books: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- BOOK_ID: integer (nullable = false)
| | |-- Chapters: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- NAME: string (nullable = true)
| | | | |-- NUMBER_PAGES: integer (nullable = true)
How to flat all columns into one level with Pyspark ?

Using inline function:
df2 = (df.selectExpr("AUTHOR_ID", "NAME", "inline(Books)")
.selectExpr("*", "inline(Chapters)")
.drop("Chapters")
)
Or explode:
from pyspark.sql import functions as F
df2 = (df.withColumn("Books", F.explode("Books"))
.select("*", "Books.*")
.withColumn("Chapters", F.explode("Chapters"))
.select("*", "Chapters.*")
)

Datframe Struct fieldType to Array of field except last field on Pyspark

I have a spark dataframe with the following schema:
stat_chiamate
|
chiamate_ricevute: struct (nullable = true)
| | |-- h_0: string (nullable = true)
| | |-- h_1: string (nullable = true)
| | |-- h_10: string (nullable = true)
| | |-- h_11: string (nullable = true)
| | |-- h_12: string (nullable = true)
| | |-- h_13: string (nullable = true)
| | |-- h_14: string (nullable = true)
| | |-- h_15: string (nullable = true)
| | |-- h_16: string (nullable = true)
| | |-- h_17: string (nullable = true)
| | |-- h_18: string (nullable = true)
| | |-- h_19: string (nullable = true)
| | |-- h_2: string (nullable = true)
| | |-- h_20: string (nullable = true)
| | |-- h_21: string (nullable = true)
| | |-- h_22: string (nullable = true)
| | |-- h_23: string (nullable = true)
| | |-- h_3: string (nullable = true)
| | |-- h_4: string (nullable = true)
| | |-- h_5: string (nullable = true)
| | |-- h_6: string (nullable = true)
| | |-- h_7: string (nullable = true)
| | |-- h_8: string (nullable = true)
| | |-- h_9: string (nullable = true)
| | |-- n_totale: string (nullable = true)
I want a dataframe like:
stat_chiamate: struct (nullable = true)
|
chiamate_ricevute: Array
|-- element(String)
where chiamate_ricevute is a list of value of field for example:
h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412
I want:
[0,1,2....,23] <-- I don't want n_totale values
In my code i use df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1] but i have only single fieldsName but how i can use them?
df=df.select(F.array(*[field for field in
df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))

You could use the schema of the dataframe, and in particular the schema of your struct to extract all the field names but n_totale and then wrap them into an array.
from pyspark.sql import functions as f
fields = ['chiamate_ricevute.' + field.name for field in df.schema[0].dataType
if field.name != 'n_totale']
result = df.select(f.array(fields).alias("chiamate_ricevute"))

Extracting data from nested json file

This is the schema of the data and wanted to extract 'from' in this.
Tried using the
df3 =df.select(df.transcript.data.from.alias("Type"))
and getting invalid syntax error.
How to extract this.
root
|-- contactId: long (nullable = true)
|-- mediaLegId: string (nullable = true)
|-- transcript: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action: string (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- chatId: string (nullable = true)
| | | |-- customerInfo: struct (nullable = true)
| | | | |-- customerIdentifierToken: string (nullable = true)
| | | | |-- customerIdentifierType: string (nullable = true)
| | | | |-- customerName: string (nullable = true)
| | | | |-- initialQuestion: string (nullable = true)
| | | |-- entryPoint: string (nullable = true)
| | | |-- from: string (nullable = true)
| | | |-- lang: string (nullable = true)
| | | |-- parkDuration: long (nullable = true)
| | | |-- parkNote: string (nullable = true)
| | | |-- participant: struct (nullable = true)
| | | | |-- disconnectReason: string (nullable = true)
| | | | |-- displayName: string (nullable = true)
| | | | |-- participantId: string (nullable = true)
| | | | |-- preferences: struct (nullable = true)
| | | | | |-- language: string (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- userName: string (nullable = true)
| | | |-- reconnected: boolean (nullable = true)
| | | |-- relatedData: string (nullable = true)
| | | |-- text: string (nullable = true)
| | | |-- timestamp: long (nullable = true)
| | | |-- transcriptText: string (nullable = true)
| | | |-- transferNote: string (nullable = true)
| | | |-- transcriptText: string (nullable = true)
| | | |-- transferNote: string (nullable = true)

Try using it like this
from pyspark.sql import functions as F
df.select(F.explode("transcript").alias('transcript')).select('transcript.*').select("data.*").select("from").show()

transformation of spark dataframe

I had a DataFrame and here's the schema. Numbers of element is unknown but some of the elements(for example element1 and element3) must exist and uniqueness
root
|-- context: struct (nullable = true)
|---|-- key: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- dimensions: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- element1: string (nullable = true)
| | | | |-- element2: string (nullable = true)
| | | | |-- element3: string (nullable = true)
| | | | |-- *** : string (nullable = true)
| | | | |-- elementN: string (nullable = true)
How can I transform it to schema like this?
root
|-- context: struct (nullable = true)
|---|-- key: string (nullable = true)
|---|-- element1: string (nullable = true)
|---|-- element3: string (nullable = true)
Thanks a lot.

Can you please try the explode function. These are following links, please go through them.
Extract columns in nested Spark DataFrame
Extract value from structure within an array of arrays in spark using scala

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to convert JSON Schema from Camel case to lower case - dataframe

Related

Merge arrays of a nested struct after groupBy/aggregation

Flatten dataframe with nested struct ArrayType using pyspark

Datframe Struct fieldType to Array of field except last field on Pyspark

Extracting data from nested json file

transformation of spark dataframe

Categories

Resources