PySpark - Remove Illegal Hive Character from schema

PySpark - Remove Illegal Hive Character from schema - hive

I have a really large pyspark dataframe which gets data from json files. This is an example of the schema
|-- Col1: array (nullable = true)
| |-- element: double (containsNull = true)
|-- Col2: struct (nullable = true)
| |-- Col2-Col1: string (nullable = true)
| |-- Col2-Col2: string (nullable = true)
| |-- Col2-Col3: string (nullable = true)
When I do the following, I'm not able to get all the column names within the struct.
df.columns
out: ['Col1', 'Col2']
I need to replace all the hyphens with an underscore so that I can write it to Hive. Hive does not accept '-', '[', '/' etc. within the column name.
For example,
The column names should change to
|-- Col1: array (nullable = true)
| |-- element: double (containsNull = true)
|-- Col2: struct (nullable = true)
| |-- Col2_Col1: string (nullable = true)
| |-- Col2_Col2: string (nullable = true)
| |-- Col2_Col3: string (nullable = true)
The code needs to be generic enough such that many columns can be renamed without hard coding the values.

Related

pyspark: rearrange nested array of struct sequence

I've a dataframe in this format and I would like to rearrange the fields inside item column.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
So this is the desired format I'm looking for.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- b: long (nullable = true)

You can use transform function:
from pyspark.sql import functions as F
result = df.withColumn(
"item",
F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))")
)

Adding up two structs by field

I have a DataFrame schema as follows (in PySpark):
root
|-- ID: string (nullable = true)
|-- 2022: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
|-- 2021: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
I want to add 2021 and 2022 columns into one column AGG such that it will contain the sum of 2021 and 2022 for each row in the DataFrame.
I tried the following:
df.select(
'ID',
(F.col("2021") + F.col("2022")).alias("AGG")
).printSchema()
Where the desired output schema should be:
root
|-- ID: string (nullable = true)
|-- AGG: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
But it seems that spark only supports adding numeric types.
Is there a way to add them implicitly without writing it directly for each field in the struct?

No, you can't addition to struct columns that way. You need to create a new struct column by summing the nested fields.
You can use a list comprehension over the list of inner fields to create a new column like this:
from pyspark.sql import functions as F
def add_struct(common_fields, s1, s2):
return F.struct(*[
(s1[f] + s2[f]).alias(f) for f in common_fields
])
# get list of struct fields from schema
fields = df.select("2022.*").columns
df.select(
'ID',
add_struct(fields, F.col("2021"), F.col("2022")).alias("AGG")
)

Writing Pyspark Dataframe to TFrecords file

I have a dataframe with schema, and want to convert this into tfRecords
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col3: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col4: array (nullable = true)
| |-- element: float (containsNull = true)
|-- col5: array (nullable = true)
| |-- element: float (containsNull = true)
|-- col6: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- col7: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col8: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col9: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using spark tensorflow connector
df.write.mode("overwrite").format("tfrecords").option("recordType", "Example").save("targetpath.tf")
Error which I'm getting while saving the data into tfrecords
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps
I have tried similar approach in databricks community edition as well , also got the similar erro
Can anyone help here ?

The most probable cause (judging from Maven Central information) is that you're using connector compiled for Scala 2.11 on the Databricks runtime that uses Scala 2.12.
Either you need to use DBR 6.4 for that conversion, or compile connector for Scala 2.12 & use.

How to change all columns data types in StructType or ArrayType columns?

I have a DataFrame including some columns with StructType and ArrayType. I want to cast all IntegerType columns to DoubleType. I found some solutions for this problem. For example this answer does the thing similar to what I want. But the problem is, it does not change the data types of columns that are nested in a StructType or ArrayType column.
For example I have a DataFrame with below schema:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: integer (nullable = true)
|-- percentage: integer (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
After executing below script:
val doubleSchema = df.schema.fields.map{f =>
f match{
case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
case _ => col(f.name)
}
}
df.select(doubleSchema:_*).printSchema
It turns out like this:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
As you can see some columns are converted to DoubleType, but the columns that are in ArrayType and StructType are not converted.
I want final schema be like this:
|-- carCategories: array (nullable = true)
| |-- element: double (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: double (nullable = true)
| |-- min: double (nullable = true)
How can I achieve such a thing?
Thank you in advance

You can add case clauses to handle ArrayType and StructType like this:
def castIntToDouble(schema: StructType): Seq[Column] = {
schema.fields.map { f =>
f.dataType match {
case IntegerType => col(f.name).cast(DoubleType)
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
)
case dt: ArrayType =>
dt.elementType match {
case IntegerType => col(f.name).cast(ArrayType(DoubleType))
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
)
case _ => col(f.name)
}
case _ => col(f.name)
}
}
}
The function uses casting with DLL strings format when the type of column is StructType or array of nested structs. e.g. if you have to cast the struct column ratio which has type struct<max:int,min:int> without having to recreate the the whole struct you'd do:
df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))
Now applying this to your input example:
val df = (
Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
.toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
.withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)
df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// | |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// | |-- max: double (nullable = true)
// | |-- min: double (nullable = true)

Change schema of dataframe to other schema

I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?

You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.

You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PySpark - Remove Illegal Hive Character from schema - hive

Related

pyspark: rearrange nested array of struct sequence

Adding up two structs by field

Writing Pyspark Dataframe to TFrecords file

How to change all columns data types in StructType or ArrayType columns?

Change schema of dataframe to other schema

Categories

Resources