I hava dataframe that looks like this
df.printSchema()
root
|-- id: integer (nullable = true)
|-- data: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- bar01 string (nullable = true)
| |-- foo02 string (nullable = true)
| |-- bar02 string (nullable = true)
and I want to transform it to
root
|-- id: integer (nullable = true)
|-- foo: struct (nullable = true)
| |-- foo01 string (nullable = true)
| |-- foo02 string (nullable = true)
|-- bar: struct (nullable = true)
| |-- bar01 string (nullable = true)
| |-- bar02 string (nullable = true)
What is the best way to go about this?
You can simply use the struct Pyspark function.
from pyspark.sql.functions import struct
new_df = df.select(
'id',
struct('data.foo01', 'data.foo02').alias('foo'),
struct('data.bar01', 'data.bar02').alias('bar'),
)
Additional note related to the struct Pyspark function: It can either take a list of string column names to only move columns into the struct or if you need a list of expressions.
You can use struct function with select as below:
from pyspark.sql import functions as F
finalDF = df.select( "id",
F.struct("data.foo01", "data.foo02").alias("foo"),
F.struct("data.bar01", "data.bar02").alias("bar")
)
finalDF.printSchema
Schema:
root
|-- id: string (nullable = true)
|-- foo: struct (nullable = false)
| |-- foo01: string (nullable = true)
| |-- foo02: string (nullable = true)
|-- bar: struct (nullable = false)
| |-- bar01: string (nullable = true)
| |-- bar02: string (nullable = true)
Related
I've a dataframe in this format and I would like to rearrange the fields inside item column.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
So this is the desired format I'm looking for.
root
|-- order: string (nullable = true)
|-- dt: struct (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- c: string (nullable = true)
| | |-- b: long (nullable = true)
You can use transform function:
from pyspark.sql import functions as F
result = df.withColumn(
"item",
F.expr("transform(item, x -> struct(x.a as a, x.c as c, x.b as b))")
)
I have a DataFrame schema as follows (in PySpark):
root
|-- ID: string (nullable = true)
|-- 2022: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
|-- 2021: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
I want to add 2021 and 2022 columns into one column AGG such that it will contain the sum of 2021 and 2022 for each row in the DataFrame.
I tried the following:
df.select(
'ID',
(F.col("2021") + F.col("2022")).alias("AGG")
).printSchema()
Where the desired output schema should be:
root
|-- ID: string (nullable = true)
|-- AGG: struct (nullable = true)
| |-- val_1: int (nullable = true)
| |-- val_2: double (nullable = true)
| |-- val_3: double (nullable = true)
But it seems that spark only supports adding numeric types.
Is there a way to add them implicitly without writing it directly for each field in the struct?
No, you can't addition to struct columns that way. You need to create a new struct column by summing the nested fields.
You can use a list comprehension over the list of inner fields to create a new column like this:
from pyspark.sql import functions as F
def add_struct(common_fields, s1, s2):
return F.struct(*[
(s1[f] + s2[f]).alias(f) for f in common_fields
])
# get list of struct fields from schema
fields = df.select("2022.*").columns
df.select(
'ID',
add_struct(fields, F.col("2021"), F.col("2022")).alias("AGG")
)
I have a DataFrame including some columns with StructType and ArrayType. I want to cast all IntegerType columns to DoubleType. I found some solutions for this problem. For example this answer does the thing similar to what I want. But the problem is, it does not change the data types of columns that are nested in a StructType or ArrayType column.
For example I have a DataFrame with below schema:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: integer (nullable = true)
|-- percentage: integer (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
After executing below script:
val doubleSchema = df.schema.fields.map{f =>
f match{
case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
case _ => col(f.name)
}
}
df.select(doubleSchema:_*).printSchema
It turns out like this:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
As you can see some columns are converted to DoubleType, but the columns that are in ArrayType and StructType are not converted.
I want final schema be like this:
|-- carCategories: array (nullable = true)
| |-- element: double (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: double (nullable = true)
| |-- min: double (nullable = true)
How can I achieve such a thing?
Thank you in advance
You can add case clauses to handle ArrayType and StructType like this:
def castIntToDouble(schema: StructType): Seq[Column] = {
schema.fields.map { f =>
f.dataType match {
case IntegerType => col(f.name).cast(DoubleType)
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
)
case dt: ArrayType =>
dt.elementType match {
case IntegerType => col(f.name).cast(ArrayType(DoubleType))
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
)
case _ => col(f.name)
}
case _ => col(f.name)
}
}
}
The function uses casting with DLL strings format when the type of column is StructType or array of nested structs. e.g. if you have to cast the struct column ratio which has type struct<max:int,min:int> without having to recreate the the whole struct you'd do:
df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))
Now applying this to your input example:
val df = (
Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
.toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
.withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)
df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// | |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// | |-- max: double (nullable = true)
// | |-- min: double (nullable = true)
In my scenario the column my_url can be at first level or inside the nested column
how to change the value of the column recursively.the nested column can be a StructType or ArrayType and my_url
either it can be at 2nd level
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- websites: struct (nullable = true)
| |-- cb_url: string (nullable = true)
| |-- domain_url: string (nullable = true)
| |-- email: string (nullable = true)
| |-- facebook_url: string (nullable = true)
| |-- homepage_url: string (nullable = true)
| |-- linkedin_url: string (nullable = true)
| |-- my_url: string (nullable = true)
| |-- phone: string (nullable = true)
| |-- twitter_url: string (nullable = true)
or it can be at 1st level:
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- my_url: string (nullable = true)
|-- facebook_url: string (nullable = true)
or like the below
root
|-- _id: struct (nullable = true)
| |-- oid: string (nullable = true)
|-- investments: struct (nullable = true)
| |-- investment_list: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- funding_round_info: struct (nullable = true)
| | | | |-- announced_on: timestamp (nullable = true)
| | | | |-- my_url: string (nullable = true)
it can be at any level.
I am converting sas program into python
one of such operation is merging two tables
one of the tables as 25 columns and other one has 2 columns
sas merges this without any issues
where pyspark union needs to have same number of columns?
how to approach this problem
full join gives diff number of records
Table 1:
df.printSchema()
|-- Port1: string (nullable = true)
|-- Port12: string (nullable = true)
|-- Col1: string (nullable = true)
|-- Col2: string (nullable = true)
|-- Col3: string (nullable = true)
|-- Col4: string (nullable = true)
|-- Col5: string (nullable = true)
|-- Col6: string (nullable = true)
|-- Col7: string (nullable = true)
|-- Col8: string (nullable = true)
|-- Col9: string (nullable = true)
|-- Col10: double (nullable = true)
|-- Col11: string (nullable = true)
|-- Col12: string (nullable = true)
|-- Col13: double (nullable = true)
|-- Col14: double (nullable = true)
|-- Col15: double (nullable = true)
|-- Col16: double (nullable = true)
|-- Col7: double (nullable = true)
|-- Col18: double (nullable = true)
|-- Col19: string (nullable = true)
|-- Col20: string (nullable = true)
|-- Col21: double (nullable = true)
|-- Col22: integer (nullable = true)
|-- Col23: string (nullable = true)
Table 2 :
df2.printSchema()
root
|-- Port1: string (nullable = true)
|-- Port2: string (nullable = true)