I am converting sas program into python
one of such operation is merging two tables
one of the tables as 25 columns and other one has 2 columns
sas merges this without any issues
where pyspark union needs to have same number of columns?
how to approach this problem
full join gives diff number of records
Table 1:
df.printSchema()
|-- Port1: string (nullable = true)
|-- Port12: string (nullable = true)
|-- Col1: string (nullable = true)
|-- Col2: string (nullable = true)
|-- Col3: string (nullable = true)
|-- Col4: string (nullable = true)
|-- Col5: string (nullable = true)
|-- Col6: string (nullable = true)
|-- Col7: string (nullable = true)
|-- Col8: string (nullable = true)
|-- Col9: string (nullable = true)
|-- Col10: double (nullable = true)
|-- Col11: string (nullable = true)
|-- Col12: string (nullable = true)
|-- Col13: double (nullable = true)
|-- Col14: double (nullable = true)
|-- Col15: double (nullable = true)
|-- Col16: double (nullable = true)
|-- Col7: double (nullable = true)
|-- Col18: double (nullable = true)
|-- Col19: string (nullable = true)
|-- Col20: string (nullable = true)
|-- Col21: double (nullable = true)
|-- Col22: integer (nullable = true)
|-- Col23: string (nullable = true)
Table 2 :
df2.printSchema()
root
|-- Port1: string (nullable = true)
|-- Port2: string (nullable = true)
Related
I am creating a Spark DF from a json containing gps data. When I try to calculate the average of a column, I'm getting the following error:
Py4JJavaError: An error occurred while calling o470.collectToPython.
: java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
I don't understand this error, since I don't have an ArrayType. This is my schema:
root
|-- LastUpdateData: string (nullable = true)
|-- DataGenerated: string (nullable = true)
|-- Delay: long (nullable = true)
|-- GPSQuality: long (nullable = true)
|-- Lat: double (nullable = true)
|-- Line: string (nullable = true)
|-- Lon: double (nullable = true)
|-- Route: string (nullable = true)
|-- Speed: long (nullable = true)
|-- VehicleCode: string (nullable = true)
|-- VehicleId: long (nullable = true)
|-- VehicleService: string (nullable = true)
StructType(List(StructField(LastUpdateData,StringType,true),StructField(DataGenerated,StringType,true),StructField(Delay,LongType,true),StructField(GPSQuality,LongType,true),StructField(Lat,DoubleType,true),StructField(Line,StringType,true),StructField(Lon,DoubleType,true),StructField(Route,StringType,true),StructField(Speed,LongType,true),StructField(VehicleCode,StringType,true),StructField(VehicleId,LongType,true),StructField(VehicleService,StringType,true)))
Here is my code:
df.agg({"Delay": "avg"}).collect()
Try the following.
from pyspark.sql import functions
#returns the average value in the Delay Column
delay_df = df.agg(functions.avg("Delay"))
#view the output
delay_df.show()
I have a DataFrame including some columns with StructType and ArrayType. I want to cast all IntegerType columns to DoubleType. I found some solutions for this problem. For example this answer does the thing similar to what I want. But the problem is, it does not change the data types of columns that are nested in a StructType or ArrayType column.
For example I have a DataFrame with below schema:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: integer (nullable = true)
|-- percentage: integer (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
After executing below script:
val doubleSchema = df.schema.fields.map{f =>
f match{
case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
case _ => col(f.name)
}
}
df.select(doubleSchema:_*).printSchema
It turns out like this:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
As you can see some columns are converted to DoubleType, but the columns that are in ArrayType and StructType are not converted.
I want final schema be like this:
|-- carCategories: array (nullable = true)
| |-- element: double (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: double (nullable = true)
| |-- min: double (nullable = true)
How can I achieve such a thing?
Thank you in advance
You can add case clauses to handle ArrayType and StructType like this:
def castIntToDouble(schema: StructType): Seq[Column] = {
schema.fields.map { f =>
f.dataType match {
case IntegerType => col(f.name).cast(DoubleType)
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
)
case dt: ArrayType =>
dt.elementType match {
case IntegerType => col(f.name).cast(ArrayType(DoubleType))
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
)
case _ => col(f.name)
}
case _ => col(f.name)
}
}
}
The function uses casting with DLL strings format when the type of column is StructType or array of nested structs. e.g. if you have to cast the struct column ratio which has type struct<max:int,min:int> without having to recreate the the whole struct you'd do:
df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))
Now applying this to your input example:
val df = (
Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
.toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
.withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)
df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// | |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// | |-- max: double (nullable = true)
// | |-- min: double (nullable = true)
I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?
You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.
You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)
In my scenario the column my_url can be at first level or inside the nested column
how to change the value of the column recursively.the nested column can be a StructType or ArrayType and my_url
either it can be at 2nd level
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- websites: struct (nullable = true)
| |-- cb_url: string (nullable = true)
| |-- domain_url: string (nullable = true)
| |-- email: string (nullable = true)
| |-- facebook_url: string (nullable = true)
| |-- homepage_url: string (nullable = true)
| |-- linkedin_url: string (nullable = true)
| |-- my_url: string (nullable = true)
| |-- phone: string (nullable = true)
| |-- twitter_url: string (nullable = true)
or it can be at 1st level:
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- my_url: string (nullable = true)
|-- facebook_url: string (nullable = true)
or like the below
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- investments: struct (nullable = true)
| |-- investment_list: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- funding_round_info: struct (nullable = true)
| | | | |-- announced_on: timestamp (nullable = true)
| | | | |-- my_url: string (nullable = true)
it can be at any level.
Condition is: Column names starting with Data-C are StringType columns, Data-D are DateType columns and Data-N are DoubleType columns. I have dataframe in which all column's datatype is a string so I am trying to update their datatype in such way that:
import org.apache.spark.sql.functions._
import sparkSession.sqlContext.implicits._
val diff_set = Seq("col7", "col8", "col15", "Data-C-col1", "Data-C-col3", "Data-N-col2", "Data-N-col4", "Data-D-col16", "Data-D-col18", "Data-D-col20").toSet
var df = (1 to 10).toDF
df = df.select(df.columns.map(c => col(c).as(c)) ++ diff_set.map(c => lit(null).cast("string").as(c)): _*)
df.printSchema()
// This foreach loop yields slow performance
df.columns.foreach(x => {
if (x.startsWith("Data-C")) {
df = df.withColumn(x, col(x).cast(StringType))
} else if (x.startsWith("Data-D")) {
df = df.withColumn(x, col(x).cast(DateType))
} else if (x.startsWith("Data-N")) {
df = df.withColumn(x, col(x).cast(DoubleType))
}
}
)
df.printSchema()
can this be done more elegantly and efficiently(performance wise) in scala-spark?
Check below code.
scala> df.printSchema
root
|-- value: integer (nullable = false)
|-- Data-C-col1: string (nullable = true)
|-- Data-D-col18: string (nullable = true)
|-- Data-N-col4: string (nullable = true)
|-- Data-N-col2: string (nullable = true)
|-- col15: string (nullable = true)
|-- Data-D-col16: string (nullable = true)
|-- Data-D-col20: string (nullable = true)
|-- col8: string (nullable = true)
|-- col7: string (nullable = true)
|-- Data-C-col3: string (nullable = true)
val colum_datatype_mapping =
Map(
"Data-C" -> "string",
"Data-D" -> "date",
"Data-N" -> "double"
)
val columns = df
.columns
.map { c =>
val key = c.split("-").init.mkString("-")
if(colum_datatype_mapping.contains(key))
col(c).cast(colum_datatype_mapping(key))
else
col(c)
}
scala> df.select(columns:_*).printSchema
root
|-- value: integer (nullable = false)
|-- Data-C-col1: string (nullable = true)
|-- Data-D-col18: date (nullable = true)
|-- Data-N-col4: double (nullable = true)
|-- Data-N-col2: double (nullable = true)
|-- col15: string (nullable = true)
|-- Data-D-col16: date (nullable = true)
|-- Data-D-col20: date (nullable = true)
|-- col8: string (nullable = true)
|-- col7: string (nullable = true)
|-- Data-C-col3: string (nullable = true)