I have a dataframe with following structure:
root
|-- pk: string (nullable = true)
|-- sk: string (nullable = true)
|-- tags: string (nullable = true)
Sample data that I am dealing with:
+--------+---------+-----------------------------------------------------------------------------------------------------+
|pk |sk |tags |
+--------+---------+-----------------------------------------------------------------------------------------------------+
|123-8aab|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"},{"type":"xyz","value": "version"}]|
|125-5afs|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"}] |
|562-8608|4c4d-464a|[{"type": "version","value": "2"},{"type":"xyz","value": "version"}] |
|793-439b|4c4d-464a|[{"type": "version","value": "2"}] |
+--------+---------+-----------------------------------------------------------------------------------------------------+
The column tags is JSON and I am struggling to get the correct data for the column.
What I have so far:
tags_schema = spark.read.json(df_component.select('tags').rdd.map(lambda row: row[0])).schema
df_component = df_component.withColumn('tags', from_json(col('tags'), tags_schema))
df_component.printSchema()
root
|-- pk: string (nullable = true)
|-- sk: string (nullable = true)
|-- tags: struct (nullable = true)
| |-- type: string (nullable = true)
| |-- value: string (nullable = true)
After running the above code, most of the values are returning null. Below is the sample data:
+--------+---------+------------+
|pk |sk |tags |
+--------+---------+------------+
|123-8aab|464a-af2f|null |
|125-5afs|464a-af2f|null |
|562-8608|4c4d-464a|null |
|793-439b|4c4d-464a|[version, 2]|
+--------+---------+------------+
Any help would be appreciated.
Your schema identifies tags as a struct however the data inside tags is an array of struct,
Try using the following as the tag schema
from pyspark.sql.types import ArrayType,StructType, StructField, StringType
# from the example you have an array of structs with each struct having type and value
tags_schema = ArrayType(StructType([
StructField("type",StringType(),True),
StructField("value",StringType(),True),
],True),True)
df_component = df_component.withColumn("tagdata",from_json(col('tags'), tags_schema))
Debugging conversion
# show original and new column
df_component.show()
Related
In a dataframe in AWS Glue I want to update a nested field inside Array or Struct attribute in PySpark satisfying a particular condition.
I know to update a primitive attribute conditionally like this:
dataframe.withColumn('primitive_attribute', when(col("primaryKey") == "someValue", lit(None)).otherwise(col('primitive_attribute')))
My expected output format is:
root
|-- id: string (nullable = true)
|-- data: struct (nullable = true)
| |-- addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- street: string (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- pincode: string (nullable = true)
Here I want to update all pincodes in addresses as null for some specific ids.
But, is there any way to update nested field inside Array or Struct ?
So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?
Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
I am trying to extracting elements from wrapped array. But it is not working.
when i try to access to _image element it is throwing an error
Below is the schema of structure.
I have tried to flatten the image values using explode, but it is not working.
Can someone help me how to extract elements from below wrapped array.
root
|-- gender: string (nullable = true)
|-- item_number: string (nullable = true)
|-- price: double (nullable = true)
|-- size: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _description: string (nullable = true)
| | |-- color_swatch: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _VALUE: string (nullable = true)
| | | | |-- _image: string (nullable = true)
eg of below given data of nested array
[[Medium,WrappedArray([Red,red_cardigan.jpg],Burgundy,burgundy_cardigan.jpg])],
[Large,WrappedArray([Red,red_cardigan.jpg], [Burgundy,burgundy_cardigan.jpg])]]
code
val flatDF1 =
xmlDF.select(col("_description").as("description"),col("_product_image").
as("product_image"),
col("catalog_item._gender").as("gender"),
col("catalog_item.item_number").as("item_number"),col("catalog_item.price").as("price"),
col("catalog_item.size").as("size"))
Explode is the right built-in function to use.
Given the following structure and supposing you want to use Dataframe API :
case class ColorSwatch(_VALUE: String, _image: String)
case class Size(_description: String, color_swatch: Seq[ColorSwatch])
case class Cart(gender: String, item_number: String, price: Double, size: Seq[Size])
we can write :
import sparkSession.implicits._
import org.apache.spark.sql.functions.explode
val df = sparkSession.emptyDataset[Cart].toDF
df.select(explode($"size").as("size_elem"))
.select(explode($"size_elem.color_swatch").as("color_elem"))
.select($"color_elem._image")
.show(false)
Here I'm only selecting the image element but we can also retrieve all related attributes (gender, description, _value, ...)
Note : You can also use the Dataset API with a functional approach (consisting in flatMapping your structures)
I am trying to read a json into a dataframe and then to a dataset.
I am facing the below issue. Could someone quickly help me.
There is a PrintDF dataframe was created successfully and below is the schema.
scala> personDF.printSchema();
root
|-- personDF: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- line1: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- state: string (nullable = true)
| |-- type1: string (nullable = true)
I created a case class to capture the above DF.
scala> case class address1(city:String,line1:String,postalCode:String,state:String,type1:String)
defined class address1
Below is the data that personDF has at the moment
scala> personzDF.show()
+--------------------+
| personDF|
+--------------------+
|[CENTERPORT,5 PRO...|
|[HUNTINGTON,94 JA...|
|[RIVERHEAD,9 PATT...|
|[NORTHPORT,50 LIS...|
|[NORTHPORT,24 LAU...|
|[NORTHPORT,340 SC...|
|[GREENLAWN,166 BR...|
|[MELVILLE,1 MERID...|
+--------------------+
Finally while I am creating dataset, I am getting the below error.
scala> val ds = personDF.as[address1]
<console>:32: error: overloaded method value as with alternatives:
(alias: Symbol)org.apache.spark.sql.DataFrame <and>
(alias: String)org.apache.spark.sql.DataFrame
does not take type parameters
val ds = personDF.as[address1]
I did google search and could not find the reason.
Thanks,
Sivaram
Actually personDF is an array of structType personDF, but it is not in the structure of what you are expecting from your printSchema():
|-- personDF: struct (nullable = true)
Are u trying to convert personzDF to address1 type? Then try like this:
val ds = personzDF.map(rec => rec.split(","))
.map(rec => address1(rec(0), rec(1), rec(2), rec(3), rec(4)))
.toDF()
Hope it helps.
Here is the parquet file schema:
root
|-- id: string (nullable = false)
|-- samples: map (nullable = false)
| |-- key: string
| |-- value: struct (valueContainsNull = false)
| | |-- name: string (nullable = false)
| | |-- some_other_value: string (nullable = false)
I'd like to fetch all ids and matched samples where sample's key equals to some value. For example:
select id, samples where key(samples) in ("abc") // Note: pseudo SQL
How to write query like this for Apache Spark?