Use high order function to add calculated field in array of structs in Spark query

Use high order function to add calculated field in array of structs in Spark query - dataframe

I have a table with a column of array of structs with this schema:
root
|-- id: long (nullable = true)
|-- mainColumn: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- fieldA: string (nullable = true)
| | |-- fieldB: string (nullable = true)
| | |-- fieldC: string (nullable = true)
| | |-- fieldD: string (nullable = true)
|-- randomOtherColumn: short (nullable = true)
I need to create a query that returns the entire table but applies a function to the fieldA of every entry (an encryption function). I have tried using the transform() function but I also need to return the other fields (fieldB , fieldC, etc).
SELECT
x.id,
x.randomOtherColumn,
transform(y -> ???)
FROM
my_table x
Something like this would be simple in JavaScript with the spread operator:
df.map(x => (
{
x.id,
x.randomOtherColumn,
x.mainColumn.map(y => ({ ...y, fieldA: encrypt(y.fieldA) }))
}
))

You need to recreate the inner struct elements of the array like this:
SELECT
id,
randomOtherColumn,
transform(mainColumn, x -> struct(encrypt(fieldA) as 'fieldA', fieldB as 'fieldB', fieldC as 'fieldC', fieldD as 'fieldD')) as mainColumn
FROM
my_table
Or using DataFrame API with column method withField (since Spark 3.1+):
val df1 = df.withColumn(
"mainColumn",
transform(col("mainColumn"), x => x.withField("fieldA", encrypt(x("fieldA"))))
)
As you also tagged pyspark:
from pyspark.sql import functions as F
df1 = df.withColumn(
"mainColumn",
F.transform("mainColumn", lambda x: x.withField("fieldA", encrypt(x["fieldA"])))
)

Related

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?

Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)

extract elements from wrappedarray in spark data frame

I am trying to extracting elements from wrapped array. But it is not working.
when i try to access to _image element it is throwing an error
Below is the schema of structure.
I have tried to flatten the image values using explode, but it is not working.
Can someone help me how to extract elements from below wrapped array.
root
|-- gender: string (nullable = true)
|-- item_number: string (nullable = true)
|-- price: double (nullable = true)
|-- size: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _description: string (nullable = true)
| | |-- color_swatch: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _VALUE: string (nullable = true)
| | | | |-- _image: string (nullable = true)
eg of below given data of nested array
[[Medium,WrappedArray([Red,red_cardigan.jpg],Burgundy,burgundy_cardigan.jpg])],
[Large,WrappedArray([Red,red_cardigan.jpg], [Burgundy,burgundy_cardigan.jpg])]]
code
val flatDF1 =
xmlDF.select(col("_description").as("description"),col("_product_image").
as("product_image"),
col("catalog_item._gender").as("gender"),
col("catalog_item.item_number").as("item_number"),col("catalog_item.price").as("price"),
col("catalog_item.size").as("size"))

Explode is the right built-in function to use.
Given the following structure and supposing you want to use Dataframe API :
case class ColorSwatch(_VALUE: String, _image: String)
case class Size(_description: String, color_swatch: Seq[ColorSwatch])
case class Cart(gender: String, item_number: String, price: Double, size: Seq[Size])
we can write :
import sparkSession.implicits._
import org.apache.spark.sql.functions.explode
val df = sparkSession.emptyDataset[Cart].toDF
df.select(explode($"size").as("size_elem"))
.select(explode($"size_elem.color_swatch").as("color_elem"))
.select($"color_elem._image")
.show(false)
Here I'm only selecting the image element but we can also retrieve all related attributes (gender, description, _value, ...)
Note : You can also use the Dataset API with a functional approach (consisting in flatMapping your structures)

Spark elasticsearch connector: how to select _id field?

I'm using the spark elasticsearch connector and I want to extract from ES some fields plus the _id.
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
Unfortunately, the _id field is not recognized:
AnalysisException: u'cannot resolve \'`_id`\' given input columns:
[query, size, #version, #timestamp,
sourceinfo, signaletic, document, metadata, fields, aggs]
With this mapping I can select for example document.{fieldA}, sourceinfo.{fieldB}, etc. but not _id. It's like if it's at a level to high in the mapping.
Any idea how to get this particular field?

You can access _id or the metadata by setting the es.read.metadata to True
For example in your code:
myquery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myquery)
.option("pushdown", "true")
.option("es.read.metadata",True) # Set it to True
.load("myindex/mytype")
.limit(10)
.select("myfield","_id")
When you do
df.printSchema()
it will print something like ( in my case )
|-- user_id: string (nullable = true)
|-- user_rt: string (nullable = true)
|-- user_rt_id: string (nullable = true)
|-- username: string (nullable = true)
|-- video: long (nullable = true)
|-- _metadata: map (nullable = true) # this map column will be added
| |-- key: string
| |-- value: string (valueContainsNull = true)
You can further extract the properties like
_index, _type, _id, _score, sort from _metadata column
To access the _id column from _metadata (map type) i have used this, you can use what you think best
df.createOrReplaceTempView('temp_table_name')
spark.sql("""
SELECT
_metadata._id as reference_id
FROM temp_table_name
""")
Hope this solves your problem

Some of the special characters like _,₹ not support in spark SQL.please try to replace the special characters with another character like "Id".

Issue while creating spark datasets

I am trying to read a json into a dataframe and then to a dataset.
I am facing the below issue. Could someone quickly help me.
There is a PrintDF dataframe was created successfully and below is the schema.
scala> personDF.printSchema();
root
|-- personDF: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- line1: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- state: string (nullable = true)
| |-- type1: string (nullable = true)
I created a case class to capture the above DF.
scala> case class address1(city:String,line1:String,postalCode:String,state:String,type1:String)
defined class address1
Below is the data that personDF has at the moment
scala> personzDF.show()
+--------------------+
| personDF|
+--------------------+
|[CENTERPORT,5 PRO...|
|[HUNTINGTON,94 JA...|
|[RIVERHEAD,9 PATT...|
|[NORTHPORT,50 LIS...|
|[NORTHPORT,24 LAU...|
|[NORTHPORT,340 SC...|
|[GREENLAWN,166 BR...|
|[MELVILLE,1 MERID...|
+--------------------+
Finally while I am creating dataset, I am getting the below error.
scala> val ds = personDF.as[address1]
<console>:32: error: overloaded method value as with alternatives:
(alias: Symbol)org.apache.spark.sql.DataFrame <and>
(alias: String)org.apache.spark.sql.DataFrame
does not take type parameters
val ds = personDF.as[address1]
I did google search and could not find the reason.
Thanks,
Sivaram

Actually personDF is an array of structType personDF, but it is not in the structure of what you are expecting from your printSchema():
|-- personDF: struct (nullable = true)
Are u trying to convert personzDF to address1 type? Then try like this:
val ds = personzDF.map(rec => rec.split(","))
.map(rec => address1(rec(0), rec(1), rec(2), rec(3), rec(4)))
.toDF()
Hope it helps.

Spark SQL - How to select on dates stored as UTC millis from the epoch?

I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch using Spark SQL. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as:
|-- dateCreated: struct (nullable = true)
||-- $date: long (nullable = true)
The complete schema is as follows:
scala> accEvt.printSchema
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- appId: integer (nullable = true)
|-- cId: long (nullable = true)
|-- data: struct (nullable = true)
| |-- expires: struct (nullable = true)
| | |-- $date: long (nullable = true)
| |-- metadata: struct (nullable = true)
| | |-- another key: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- field: string (nullable = true)
| | |-- flavors: string (nullable = true)
| | |-- foo: string (nullable = true)
| | |-- location1: string (nullable = true)
| | |-- location2: string (nullable = true)
| | |-- test: string (nullable = true)
| | |-- testKey: string (nullable = true)
| | |-- testKey2: string (nullable = true)
|-- dateCreated: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- id: integer (nullable = true)
|-- originationDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- processedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- receivedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
and my goal is to write queries along the lines of:
SELECT COUNT(*) FROM myTable WHERE dateCreated BETWEEN [dateStoredAsLong0] AND [dateStoredAsLong1]
My process thus far has been:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#29200d25
scala> val accEvt = sqlContext.jsonFile("/home/bkarels/mongoexport/accomplishment_event.json")
...
14/10/29 15:03:38 INFO SparkContext: Job finished: reduce at JsonRDD.scala:46, took 4.668981083 s
accEvt: org.apache.spark.sql.SchemaRDD =
SchemaRDD[6] at RDD at SchemaRDD.scala:103
scala> accEvt.registerAsTable("accomplishmentEvent")
(At this point the following base line query executes successfully)
scala> sqlContext.sql("select count(*) from accomplishmentEvent").collect.foreach(println)
...
[74475]
Now, the voodoo that I cannot get right is how to form my select statement to reason about the dates. For example the following executes w/o error, but returns zero rather than the count of all records as it should (74475).
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate >= '1970-01-01'").collect.foreach(println)
...
[0]
I have also tried some ugliness like:
scala> val now = new java.util.Date()
now: java.util.Date = Wed Oct 29 15:05:15 CDT 2014
scala> val today = now.getTime
today: Long = 1414613115743
scala> val thirtydaysago = today - (30 * 24 * 60 * 60 * 1000)
thirtydaysago: Long = 1416316083039
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate <= %s and processedDate >= %s".format(today,thirtydaysago)).collect.foreach(println)
As recommended, I've selected on a named field to ensure that works. So:
scala> sqlContext.sql("select receivedDate from accomplishmentEvent limit 10").collect.foreach(println)
returns:
[[1376318850033]]
[[1376319429590]]
[[1376320804289]]
[[1376320832835]]
[[1376320832960]]
[[1376320835554]]
[[1376320914480]]
[[1376321041899]]
[[1376321109341]]
[[1376321121469]]
Then extending to try and get some kind of dates working I have tried:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.date > '1970-01-01' limit 5").collect.foreach(println)
Results in the error:
java.lang.RuntimeException: No such field date in StructType(ArrayBuffer(StructField($date,LongType,true)))
...
Prefixing our field name with $ as also suggested results in a different kind of error:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5").collect.foreach(println)
java.lang.RuntimeException: [1.69] failure: ``UNION'' expected but ErrorToken(illegal character) found
select actualConsumerId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5
Clearly I am not getting how to select on dates stored in this way - can anyone help me fill in this gap?
I am newer to both Scala and Spark, so forgive me if this is an elementary question, but my searches have turned up empty on the forums and Spark documentation.
Thank you.

Your JSON is not flat, so the fields below the top level need to be addressed using qualified names, such as dateCreated.$date. Your specific date fields are all of long type, so you'll need to do numerical comparisons on them and it looks like you were on the right track for doing those.
An additional problem is that your field names have "$" characters, and Spark SQL won't let you query on them. One solution is that instead of reading the JSON directly as a SchemaRDD (as you have done) you first read it as an RDD[String], use the map method to perform the Scala string manipulations of your choice, and then use SQLContext's jsonRDD method to create the SchemaRDD.
val lines = sc.textFile(...)
// you may want something less naive than global replacement of all "$" chars
val linesFixed = lines.map(s => s.replaceAllLiterally("$", ""))
val accEvt = sqlContext.jsonRDD(linesFixed)
I've tested this with Spark 1.1.0.
For reference, the lack of quoting capability in Spark SQL has been noted in this bug report and perhaps others, and it seems that the fix was recently checked in, but will take some time to make it into a release

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Use high order function to add calculated field in array of structs in Spark query - dataframe

Related

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

extract elements from wrappedarray in spark data frame

Spark elasticsearch connector: how to select _id field?

Issue while creating spark datasets

Spark SQL - How to select on dates stored as UTC millis from the epoch?

Categories

Resources