The output of Spark DataFrame replace column names with "0", "1" when arrays_zip data originated nested - apache-spark-sql

I used spark sql functions arrays_zip combined with flatten to transform data from array of struct of inner array of the same length into array of struct. printSchema shows exactly I want. However, df output lost original column names and replace them with generic column name "0", "1", "2" etc. no matter in Parquet or Avro format. I like to output original column names.
Not to reveal the business of my company. The followings are similar but much simplified examples.
scala> c2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- month: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- num: array (nullable = true)
| | | |-- element: long (containsNull = true)
scala> c2.show(false)
+----------------------------------------------+
|cal |
+----------------------------------------------+
|[[[Jan, Feb, Mar], [1, 2, 3]], [[April], [4]]]|
+----------------------------------------------+
I like to transform to
scala> newC2.show(false)
+------------------------------------------+
|cal |
+------------------------------------------+
|[[Jan, 1], [Feb, 2], [Mar, 3], [April, 4]]|
+------------------------------------------+
with
scala> newC2.printSchema
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)
I know arrays_zip only work well on the top-level arrays. Therefore, I flatten them to top level. The followings codes work in this example
val newC2 = c2.withColumn("month", flatten(col("cal.month"))).withColumn("num", flatten(col("cal.num"))).withColumn("cal", arrays_zip(col("month"), col("num"))).drop("month", "num")
It generates exactly data and schema I want. However, it outputs all columns generically using "0", "1", "2" etc.
newC2.write.option("header", false).parquet("c2_parquet")
I tried another example that has original data of month array and num array at the top level. I can arrays_zip without flatten and get the same schema and data shown. However, it output original field name correctly in this case.
I tried add alias to flatten data. That does not work. I even tried manipulate columns like (assume field store the result of arrays_zip is 'zipped'
val columns: Array[Column] = inner.fields.map(_.name).map{x => col("zipped").getField(x).alias(x)}
val newB3 = newB2.withColumn("b", array(struct(columns:_*))).drop("zipped")
It ends up generate original schema ('month", array of string and "num", array of long).
To duplicate the problem, you can use the json input
"cal":[{"month":["Jan","Feb","Mar"],"num":[1,2,3]},{"month":["April"],"num":[4]}]}
the following json is for top-level arrays_zip
{"month":["Jan","Feb","Mar"],"num":[1,2,3]}
How Spark internally decide what field names to use? How can I get it to work? Please advise.

Since Spark 2.4, the schema transformation can be achieved using Higher Order Functions. In Scala the query can look like this:
import org.apache.spark.sql.functions.{expr, flatten}
val result = df
.withColumn("cal", flatten(expr("TRANSFORM(cal, x -> zip_with(x.month, x.num, (month, num) -> (month,num)))")))
After applying on your sample data i get this schema:
result.printSchema()
root
|-- cal: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- month: string (nullable = true)
| | |-- num: long (nullable = true)

Related

How to add the index of the array as a field to an array of structs in pyspark dataframe

I have a dataframe containing an array of structs. I would like to add the index of the array as a field within the struct. Is this possible?
So structure would go from:
|-- my_array_column: array
| |-- element: struct
| | |-- field1: string
| | |-- field2: string
to:
|-- my_array_column: array
| |-- element: struct
| | |-- field1: string
| | |-- field2: string
| | |-- index of element: integer
Many thanks
For Spark 3.1+, you can use transform function and withField to update each struct element of the array column like his:
from pyspark.sql import functions as F
df = df.withColumn(
"my_array_column",
F.transform("my_array_column", lambda x, i: x.withField("index", i))
)
For older version, you'll have to recreate the whole struct element in order to add a field:
df = df.withColumn(
"my_array_column",
F.expr("transform(my_array_column, (x, i) -> struct(x.field1 as field1, x.field2 as field2, i as index))")
)

How to filter a struct array in a spark dataframe?

I have the following code and outputs.
import org.apache.spark.sql.functions.{collect_list, struct}
import sqlContext.implicits._
val df = Seq(
("john", "tomato", 1.99),
("john", "carrot", 0.45),
("bill", "apple", 0.99),
("john", "banana", 1.29),
("bill", "taco", 2.59)
).toDF("name", "food", "price")
df.groupBy($"name")
.agg(collect_list(struct($"food", $"price")).as("foods"))
.show(false)
df.printSchema
Output and Schema:
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[tomato,1.99], [carrot,0.45], [banana,1.29]]|
|bill|[[apple,0.99], [taco,2.59]] |
+----+---------------------------------------------+
root
|-- name: string (nullable = true)
|-- foods: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- food: string (nullable = true)
| | |-- price: double (nullable = false)
I want to filter based on df("foods.price") > 1.00. How do I filter this to get the output below?
+----+---------------------------------------------+
|name|foods |
+----+---------------------------------------------+
|john|[[banana,1.29], [tomato,1.99]] |
|bill|[[[taco,2.59]] |
+----+---------------------------------------------+
I have tried df.filter($"foods.food" > 1.00), but this does not work as I'm getting an error. Anything else I can try?
you are trying to apply filter on an array, hence it will throw an error as the syntax will be wrong. You can apply filter on price before and then do transformation as needed.
val cf = df.filter("price > 1.0").groupBy($"name").agg(collect_list(struct($"food", $"price")).as("foods")

How to convert array<struct> to array<string>

I have a dataframe with the below schema
root
|-- ColA: array
|--|--element: struct
|--|--|--id: string (nullable = true)
|--|--|--name: string (nullable = true)
The schema of the expected dataframe would be,
root
|-- ColA: array
|--|--element: string (continsNull = true)
Is it possible to convert an array of struct to array of strings?
Any help would be much appreciated.
Thank you.
Assuming you need the field name, you can just select that array column as usual
(df
.withColumn('ColA', F.col('ColA.name'))
.printSchema()
)
# Input
+----------------+
| ColA|
+----------------+
|[{1, A}, {2, B}]|
+----------------+
# Output
+------+
| ColA|
+------+
|[A, B]|
+------+
You can use TRANSFORM expression:
expr = "TRANSFORM(ColA, x -> to_json(x))"
df = df.withColumn('ColA', f.expr(expr))
df.printSchema()
df.show(truncate=False)
Output:
root
|-- ColA: array (nullable = true)
| |-- element: string (containsNull = true)
+------------------------------------------------------+
|ColA |
+------------------------------------------------------+
|[{"id":"1","name":"Hello"}, {"id":"2","name":"World"}]|
+------------------------------------------------------+

Schema Conversion from String datatype to Array(Map(Array)) datatype in Pyspark

We are reading data from dynamo db so we are getting datatype as string, but we want writing string data type as array(map(array))
string Data :
{"policy_details":[{"cdhid":" 123","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}, {"cdhid":" 1234","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}]}
output required:
string data type need to convert to ARRAY(MAP(ARRAY))
We have tried with below schema:
ArrayType([
StructField("policy_num", MapType(ArrayType([
StructField("cdhid", StringType(), True),
StructField("role_id", StringType(), True),
StructField("role_desc", StringType(), True)
])))
])
getting below issue:
elementType [StructField(cdhid,StringType,true),
StructField(role_id,StringType,true),
StructField(role_desc,StringType,true)] should be an instance of < class 'pyspark.sql.types.DataType' >
Regarding your data, the schema your want is not the one that fits.
The schema of you data is :
from pyspark.sql import types as T
schm = T.StructType([T.StructField("policy_details",T.ArrayType(T.StructType([
T.StructField("cdhid", T.StringType(), True),
T.StructField("p2cid", T.StringType(), True),
T.StructField("roleDesc", T.StringType(), True),
T.StructField("positionnum", T.StringType(), True),
])), True)])
Then, you just need to use the from_json function.
from pyspark.sql import functions as F
df.show()
+--------------------+
| db_data|
+--------------------+
|{"policy_details"...|
+--------------------+
new_df = df.select(F.from_json("db_data", schm).alias("data"))
new_df.printSchema()
root
|-- data: struct (nullable = true)
| |-- policy_details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- cdhid: string (nullable = true)
| | | |-- p2cid: string (nullable = true)
| | | |-- roleDesc: string (nullable = true)
| | | |-- positionnum: string (nullable = true)
EDIT : If you want to use MapType, you can replace the schema with :
schm = T.StructType([
T.StructField(
"policy_details",
T.ArrayType(T.MapType(
T.StringType(),
T.StringType()
)),
True
)
])

reading csv from pyspark specifying schema wrong types

I am trying to output csv from a pyspark df an then re inputting it, but when I specify schema, for a column that is an array, it says that some of the rows are False.
Here is my df
avg(rating) belongs_to_collection budget \
0 2.909946 False 5000000
1 3.291962 False 18000000
2 3.239811 False 8000000
3 3.573318 False 1500000
4 3.516590 False 40000000
genres original_language
0 ['Drama', 'Romance'] en
1 ['Comedy'] en
2 ['Drama', 'Family'] en
3 ['Crime', 'Drama', 'Mystery', 'Thriller'] en
4 ['Crime', 'Drama', 'Thriller'] en
I first output to csv: df.drop('id').toPandas().to_csv('mergedDf.csv',index=False)
I tried reading using df = spark.read.csv('mergedDf.csv',schema=schema), but I get this error: 'CSV data source does not support array<string> data type.;'
So, I tried reading from pandas and then converting to spark df, but it tells me that the column that contains a list has a boolean value.
df = pd.read_csv('mergedDf.csv')
df = spark.createDataFrame(df,schema=schema)
TypeError: field genres: ArrayType(StringType,true) can not accept object False in type <class 'bool'>
However, when I check if some of the rows are == to False, I find that none of them are.
I checked :
df[df['genres']=="False"] and df[df['genres']==False]
Unfortunately, spark read csv function doesn't yet support complex datatypes like "array". You would have handle the logic of cast the string column into array column
Use pandas to write the spark dataframe as csv with header.
df.drop('id').toPandas().to_csv('mergedDf.csv',index=False,header=True)
df1 = spark.read.option('header','true').option("inferSchema","true").csv('mergedDf.csv')
df1.printSchema()
df1.show(10,False)
When you read back csv with spark, the array column would be convert to string type
root
|-- avg(rating): double (nullable = true)
|-- belongs_to_collection: boolean (nullable = true)
|-- budget: integer (nullable = true)
|-- genres: string (nullable = true)
|-- original_language: string (nullable = true)
+-----------+---------------------+--------+-----------------------------------------+-----------------+
|avg(rating)|belongs_to_collection|budget |genres |original_language|
+-----------+---------------------+--------+-----------------------------------------+-----------------+
|2.909946 |false |5000000 |['Drama', 'Romance'] |en |
|3.291962 |false |18000000|['Comedy'] |en |
|3.239811 |false |8000000 |['Drama', 'Family'] |en |
|3.573318 |false |1500000 |['Crime', 'Drama', 'Mystery', 'Thriller']|en |
|3.51659 |false |40000000|['Crime', 'Drama', 'Thriller'] |en |
+-----------+---------------------+--------+-----------------------------------------+-----------------+
Split the string column to create an array to get back your original format.
df2 = df1.withColumn('genres',split(regexp_replace(col('genres'), '\[|\]',''),',').cast('array<string>'))
df2.printSchema()
.
root
|-- avg(rating): double (nullable = true)
|-- belongs_to_collection: boolean (nullable = true)
|-- budget: integer (nullable = true)
|-- genres: array (nullable = true)
| |-- element: string (containsNull = true)
|-- original_language: string (nullable = true)