Remove duplicate column name in a Pyspark Dataframe from a json column nested object - apache-spark-sql

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?

Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)

Related

Update a nested field inside Array or Struct attribute conditionally in PySpark

In a dataframe in AWS Glue I want to update a nested field inside Array or Struct attribute in PySpark satisfying a particular condition.
I know to update a primitive attribute conditionally like this:
dataframe.withColumn('primitive_attribute', when(col("primaryKey") == "someValue", lit(None)).otherwise(col('primitive_attribute')))
My expected output format is:
root
|-- id: string (nullable = true)
|-- data: struct (nullable = true)
| |-- addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- street: string (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- pincode: string (nullable = true)
Here I want to update all pincodes in addresses as null for some specific ids.
But, is there any way to update nested field inside Array or Struct ?

Query JSON data column using Spark DataFrames

I have a dataframe with following structure:
root
|-- pk: string (nullable = true)
|-- sk: string (nullable = true)
|-- tags: string (nullable = true)
Sample data that I am dealing with:
+--------+---------+-----------------------------------------------------------------------------------------------------+
|pk |sk |tags |
+--------+---------+-----------------------------------------------------------------------------------------------------+
|123-8aab|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"},{"type":"xyz","value": "version"}]|
|125-5afs|464a-af2f|[{"type": "version","value": "2"},{"type": "version","value": "1"}] |
|562-8608|4c4d-464a|[{"type": "version","value": "2"},{"type":"xyz","value": "version"}] |
|793-439b|4c4d-464a|[{"type": "version","value": "2"}] |
+--------+---------+-----------------------------------------------------------------------------------------------------+
The column tags is JSON and I am struggling to get the correct data for the column.
What I have so far:
tags_schema = spark.read.json(df_component.select('tags').rdd.map(lambda row: row[0])).schema
df_component = df_component.withColumn('tags', from_json(col('tags'), tags_schema))
df_component.printSchema()
root
|-- pk: string (nullable = true)
|-- sk: string (nullable = true)
|-- tags: struct (nullable = true)
| |-- type: string (nullable = true)
| |-- value: string (nullable = true)
After running the above code, most of the values are returning null. Below is the sample data:
+--------+---------+------------+
|pk |sk |tags |
+--------+---------+------------+
|123-8aab|464a-af2f|null |
|125-5afs|464a-af2f|null |
|562-8608|4c4d-464a|null |
|793-439b|4c4d-464a|[version, 2]|
+--------+---------+------------+
Any help would be appreciated.
Your schema identifies tags as a struct however the data inside tags is an array of struct,
Try using the following as the tag schema
from pyspark.sql.types import ArrayType,StructType, StructField, StringType
# from the example you have an array of structs with each struct having type and value
tags_schema = ArrayType(StructType([
StructField("type",StringType(),True),
StructField("value",StringType(),True),
],True),True)
df_component = df_component.withColumn("tagdata",from_json(col('tags'), tags_schema))
Debugging conversion
# show original and new column
df_component.show()

extract elements from wrappedarray in spark data frame

I am trying to extracting elements from wrapped array. But it is not working.
when i try to access to _image element it is throwing an error
Below is the schema of structure.
I have tried to flatten the image values using explode, but it is not working.
Can someone help me how to extract elements from below wrapped array.
root
|-- gender: string (nullable = true)
|-- item_number: string (nullable = true)
|-- price: double (nullable = true)
|-- size: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _description: string (nullable = true)
| | |-- color_swatch: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _VALUE: string (nullable = true)
| | | | |-- _image: string (nullable = true)
eg of below given data of nested array
[[Medium,WrappedArray([Red,red_cardigan.jpg],Burgundy,burgundy_cardigan.jpg])],
[Large,WrappedArray([Red,red_cardigan.jpg], [Burgundy,burgundy_cardigan.jpg])]]
code
val flatDF1 =
xmlDF.select(col("_description").as("description"),col("_product_image").
as("product_image"),
col("catalog_item._gender").as("gender"),
col("catalog_item.item_number").as("item_number"),col("catalog_item.price").as("price"),
col("catalog_item.size").as("size"))
Explode is the right built-in function to use.
Given the following structure and supposing you want to use Dataframe API :
case class ColorSwatch(_VALUE: String, _image: String)
case class Size(_description: String, color_swatch: Seq[ColorSwatch])
case class Cart(gender: String, item_number: String, price: Double, size: Seq[Size])
we can write :
import sparkSession.implicits._
import org.apache.spark.sql.functions.explode
val df = sparkSession.emptyDataset[Cart].toDF
df.select(explode($"size").as("size_elem"))
.select(explode($"size_elem.color_swatch").as("color_elem"))
.select($"color_elem._image")
.show(false)
Here I'm only selecting the image element but we can also retrieve all related attributes (gender, description, _value, ...)
Note : You can also use the Dataset API with a functional approach (consisting in flatMapping your structures)

Query parquet file by specific key in nested column map with Apache Spark

Here is the parquet file schema:
root
|-- id: string (nullable = false)
|-- samples: map (nullable = false)
| |-- key: string
| |-- value: struct (valueContainsNull = false)
| | |-- name: string (nullable = false)
| | |-- some_other_value: string (nullable = false)
I'd like to fetch all ids and matched samples where sample's key equals to some value. For example:
select id, samples where key(samples) in ("abc") // Note: pseudo SQL
How to write query like this for Apache Spark?

Spark SQL - How to select on dates stored as UTC millis from the epoch?

I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch using Spark SQL. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as:
|-- dateCreated: struct (nullable = true)
||-- $date: long (nullable = true)
The complete schema is as follows:
scala> accEvt.printSchema
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- appId: integer (nullable = true)
|-- cId: long (nullable = true)
|-- data: struct (nullable = true)
| |-- expires: struct (nullable = true)
| | |-- $date: long (nullable = true)
| |-- metadata: struct (nullable = true)
| | |-- another key: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- field: string (nullable = true)
| | |-- flavors: string (nullable = true)
| | |-- foo: string (nullable = true)
| | |-- location1: string (nullable = true)
| | |-- location2: string (nullable = true)
| | |-- test: string (nullable = true)
| | |-- testKey: string (nullable = true)
| | |-- testKey2: string (nullable = true)
|-- dateCreated: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- id: integer (nullable = true)
|-- originationDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- processedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- receivedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
and my goal is to write queries along the lines of:
SELECT COUNT(*) FROM myTable WHERE dateCreated BETWEEN [dateStoredAsLong0] AND [dateStoredAsLong1]
My process thus far has been:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#29200d25
scala> val accEvt = sqlContext.jsonFile("/home/bkarels/mongoexport/accomplishment_event.json")
...
14/10/29 15:03:38 INFO SparkContext: Job finished: reduce at JsonRDD.scala:46, took 4.668981083 s
accEvt: org.apache.spark.sql.SchemaRDD =
SchemaRDD[6] at RDD at SchemaRDD.scala:103
scala> accEvt.registerAsTable("accomplishmentEvent")
(At this point the following base line query executes successfully)
scala> sqlContext.sql("select count(*) from accomplishmentEvent").collect.foreach(println)
...
[74475]
Now, the voodoo that I cannot get right is how to form my select statement to reason about the dates. For example the following executes w/o error, but returns zero rather than the count of all records as it should (74475).
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate >= '1970-01-01'").collect.foreach(println)
...
[0]
I have also tried some ugliness like:
scala> val now = new java.util.Date()
now: java.util.Date = Wed Oct 29 15:05:15 CDT 2014
scala> val today = now.getTime
today: Long = 1414613115743
scala> val thirtydaysago = today - (30 * 24 * 60 * 60 * 1000)
thirtydaysago: Long = 1416316083039
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate <= %s and processedDate >= %s".format(today,thirtydaysago)).collect.foreach(println)
As recommended, I've selected on a named field to ensure that works. So:
scala> sqlContext.sql("select receivedDate from accomplishmentEvent limit 10").collect.foreach(println)
returns:
[[1376318850033]]
[[1376319429590]]
[[1376320804289]]
[[1376320832835]]
[[1376320832960]]
[[1376320835554]]
[[1376320914480]]
[[1376321041899]]
[[1376321109341]]
[[1376321121469]]
Then extending to try and get some kind of dates working I have tried:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.date > '1970-01-01' limit 5").collect.foreach(println)
Results in the error:
java.lang.RuntimeException: No such field date in StructType(ArrayBuffer(StructField($date,LongType,true)))
...
Prefixing our field name with $ as also suggested results in a different kind of error:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5").collect.foreach(println)
java.lang.RuntimeException: [1.69] failure: ``UNION'' expected but ErrorToken(illegal character) found
select actualConsumerId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5
Clearly I am not getting how to select on dates stored in this way - can anyone help me fill in this gap?
I am newer to both Scala and Spark, so forgive me if this is an elementary question, but my searches have turned up empty on the forums and Spark documentation.
Thank you.
Your JSON is not flat, so the fields below the top level need to be addressed using qualified names, such as dateCreated.$date. Your specific date fields are all of long type, so you'll need to do numerical comparisons on them and it looks like you were on the right track for doing those.
An additional problem is that your field names have "$" characters, and Spark SQL won't let you query on them. One solution is that instead of reading the JSON directly as a SchemaRDD (as you have done) you first read it as an RDD[String], use the map method to perform the Scala string manipulations of your choice, and then use SQLContext's jsonRDD method to create the SchemaRDD.
val lines = sc.textFile(...)
// you may want something less naive than global replacement of all "$" chars
val linesFixed = lines.map(s => s.replaceAllLiterally("$", ""))
val accEvt = sqlContext.jsonRDD(linesFixed)
I've tested this with Spark 1.1.0.
For reference, the lack of quoting capability in Spark SQL has been noted in this bug report and perhaps others, and it seems that the fix was recently checked in, but will take some time to make it into a release