Writing Pyspark Dataframe to TFrecords file - tensorflow

I have a dataframe with schema, and want to convert this into tfRecords
root
|-- col1: string (nullable = true)
|-- col2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col3: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col4: array (nullable = true)
| |-- element: float (containsNull = true)
|-- col5: array (nullable = true)
| |-- element: float (containsNull = true)
|-- col6: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- col7: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col8: array (nullable = true)
| |-- element: string (containsNull = true)
|-- col9: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using spark tensorflow connector
df.write.mode("overwrite").format("tfrecords").option("recordType", "Example").save("targetpath.tf")
Error which I'm getting while saving the data into tfrecords
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)Lscala/collection/mutable/ArrayOps
I have tried similar approach in databricks community edition as well , also got the similar erro
Can anyone help here ?

The most probable cause (judging from Maven Central information) is that you're using connector compiled for Scala 2.11 on the Databricks runtime that uses Scala 2.12.
Either you need to use DBR 6.4 for that conversion, or compile connector for Scala 2.12 & use.

Related

pyspark: rearrange nested array of struct sequence

I've a dataframe in this format and I would like to rearrange the fields inside item column.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
So this is the desired format I'm looking for.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- b: long (nullable = true)
You can use transform function:
from pyspark.sql import functions as F
result = df.withColumn(
"item",
F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))")
)

Error while performing aggregate functions in Spark: ArrayType cannot be cast to org.apache.spark.sql.types.StructType

I am creating a Spark DF from a json containing gps data. When I try to calculate the average of a column, I'm getting the following error:
Py4JJavaError: An error occurred while calling o470.collectToPython.
: java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
I don't understand this error, since I don't have an ArrayType. This is my schema:
root
|-- LastUpdateData: string (nullable = true)
|-- DataGenerated: string (nullable = true)
|-- Delay: long (nullable = true)
|-- GPSQuality: long (nullable = true)
|-- Lat: double (nullable = true)
|-- Line: string (nullable = true)
|-- Lon: double (nullable = true)
|-- Route: string (nullable = true)
|-- Speed: long (nullable = true)
|-- VehicleCode: string (nullable = true)
|-- VehicleId: long (nullable = true)
|-- VehicleService: string (nullable = true)
StructType(List(StructField(LastUpdateData,StringType,true),StructField(DataGenerated,StringType,true),StructField(Delay,LongType,true),StructField(GPSQuality,LongType,true),StructField(Lat,DoubleType,true),StructField(Line,StringType,true),StructField(Lon,DoubleType,true),StructField(Route,StringType,true),StructField(Speed,LongType,true),StructField(VehicleCode,StringType,true),StructField(VehicleId,LongType,true),StructField(VehicleService,StringType,true)))
Here is my code:
df.agg({"Delay": "avg"}).collect()
Try the following.
from pyspark.sql import functions
#returns the average value in the Delay Column
delay_df = df.agg(functions.avg("Delay"))
#view the output
delay_df.show()

How to change all columns data types in StructType or ArrayType columns?

I have a DataFrame including some columns with StructType and ArrayType. I want to cast all IntegerType columns to DoubleType. I found some solutions for this problem. For example this answer does the thing similar to what I want. But the problem is, it does not change the data types of columns that are nested in a StructType or ArrayType column.
For example I have a DataFrame with below schema:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: integer (nullable = true)
|-- percentage: integer (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
After executing below script:
val doubleSchema = df.schema.fields.map{f =>
f match{
case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
case _ => col(f.name)
}
}
df.select(doubleSchema:_*).printSchema
It turns out like this:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
As you can see some columns are converted to DoubleType, but the columns that are in ArrayType and StructType are not converted.
I want final schema be like this:
|-- carCategories: array (nullable = true)
| |-- element: double (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: double (nullable = true)
| |-- min: double (nullable = true)
How can I achieve such a thing?
Thank you in advance
You can add case clauses to handle ArrayType and StructType like this:
def castIntToDouble(schema: StructType): Seq[Column] = {
schema.fields.map { f =>
f.dataType match {
case IntegerType => col(f.name).cast(DoubleType)
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
)
case dt: ArrayType =>
dt.elementType match {
case IntegerType => col(f.name).cast(ArrayType(DoubleType))
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
)
case _ => col(f.name)
}
case _ => col(f.name)
}
}
}
The function uses casting with DLL strings format when the type of column is StructType or array of nested structs. e.g. if you have to cast the struct column ratio which has type struct<max:int,min:int> without having to recreate the the whole struct you'd do:
df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))
Now applying this to your input example:
val df = (
Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
.toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
.withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)
df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// | |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// | |-- max: double (nullable = true)
// | |-- min: double (nullable = true)

Change schema of dataframe to other schema

I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?
You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.
You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)

PySpark - Remove Illegal Hive Character from schema

I have a really large pyspark dataframe which gets data from json files. This is an example of the schema
|-- Col1: array (nullable = true)
| |-- element: double (containsNull = true)
|-- Col2: struct (nullable = true)
| |-- Col2-Col1: string (nullable = true)
| |-- Col2-Col2: string (nullable = true)
| |-- Col2-Col3: string (nullable = true)
When I do the following, I'm not able to get all the column names within the struct.
df.columns
out: ['Col1', 'Col2']
I need to replace all the hyphens with an underscore so that I can write it to Hive. Hive does not accept '-', '[', '/' etc. within the column name.
For example,
The column names should change to
|-- Col1: array (nullable = true)
| |-- element: double (containsNull = true)
|-- Col2: struct (nullable = true)
| |-- Col2_Col1: string (nullable = true)
| |-- Col2_Col2: string (nullable = true)
| |-- Col2_Col3: string (nullable = true)
The code needs to be generic enough such that many columns can be renamed without hard coding the values.