Spark SQL - How to select on dates stored as UTC millis from the epoch? - sql

I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch using Spark SQL. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as:
|-- dateCreated: struct (nullable = true)
||-- $date: long (nullable = true)
The complete schema is as follows:
scala> accEvt.printSchema
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- appId: integer (nullable = true)
|-- cId: long (nullable = true)
|-- data: struct (nullable = true)
| |-- expires: struct (nullable = true)
| | |-- $date: long (nullable = true)
| |-- metadata: struct (nullable = true)
| | |-- another key: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- field: string (nullable = true)
| | |-- flavors: string (nullable = true)
| | |-- foo: string (nullable = true)
| | |-- location1: string (nullable = true)
| | |-- location2: string (nullable = true)
| | |-- test: string (nullable = true)
| | |-- testKey: string (nullable = true)
| | |-- testKey2: string (nullable = true)
|-- dateCreated: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- id: integer (nullable = true)
|-- originationDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- processedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- receivedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
and my goal is to write queries along the lines of:
SELECT COUNT(*) FROM myTable WHERE dateCreated BETWEEN [dateStoredAsLong0] AND [dateStoredAsLong1]
My process thus far has been:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#29200d25
scala> val accEvt = sqlContext.jsonFile("/home/bkarels/mongoexport/accomplishment_event.json")
...
14/10/29 15:03:38 INFO SparkContext: Job finished: reduce at JsonRDD.scala:46, took 4.668981083 s
accEvt: org.apache.spark.sql.SchemaRDD =
SchemaRDD[6] at RDD at SchemaRDD.scala:103
scala> accEvt.registerAsTable("accomplishmentEvent")
(At this point the following base line query executes successfully)
scala> sqlContext.sql("select count(*) from accomplishmentEvent").collect.foreach(println)
...
[74475]
Now, the voodoo that I cannot get right is how to form my select statement to reason about the dates. For example the following executes w/o error, but returns zero rather than the count of all records as it should (74475).
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate >= '1970-01-01'").collect.foreach(println)
...
[0]
I have also tried some ugliness like:
scala> val now = new java.util.Date()
now: java.util.Date = Wed Oct 29 15:05:15 CDT 2014
scala> val today = now.getTime
today: Long = 1414613115743
scala> val thirtydaysago = today - (30 * 24 * 60 * 60 * 1000)
thirtydaysago: Long = 1416316083039
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate <= %s and processedDate >= %s".format(today,thirtydaysago)).collect.foreach(println)
As recommended, I've selected on a named field to ensure that works. So:
scala> sqlContext.sql("select receivedDate from accomplishmentEvent limit 10").collect.foreach(println)
returns:
[[1376318850033]]
[[1376319429590]]
[[1376320804289]]
[[1376320832835]]
[[1376320832960]]
[[1376320835554]]
[[1376320914480]]
[[1376321041899]]
[[1376321109341]]
[[1376321121469]]
Then extending to try and get some kind of dates working I have tried:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.date > '1970-01-01' limit 5").collect.foreach(println)
Results in the error:
java.lang.RuntimeException: No such field date in StructType(ArrayBuffer(StructField($date,LongType,true)))
...
Prefixing our field name with $ as also suggested results in a different kind of error:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5").collect.foreach(println)
java.lang.RuntimeException: [1.69] failure: ``UNION'' expected but ErrorToken(illegal character) found
select actualConsumerId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5
Clearly I am not getting how to select on dates stored in this way - can anyone help me fill in this gap?
I am newer to both Scala and Spark, so forgive me if this is an elementary question, but my searches have turned up empty on the forums and Spark documentation.
Thank you.

Your JSON is not flat, so the fields below the top level need to be addressed using qualified names, such as dateCreated.$date. Your specific date fields are all of long type, so you'll need to do numerical comparisons on them and it looks like you were on the right track for doing those.
An additional problem is that your field names have "$" characters, and Spark SQL won't let you query on them. One solution is that instead of reading the JSON directly as a SchemaRDD (as you have done) you first read it as an RDD[String], use the map method to perform the Scala string manipulations of your choice, and then use SQLContext's jsonRDD method to create the SchemaRDD.
val lines = sc.textFile(...)
// you may want something less naive than global replacement of all "$" chars
val linesFixed = lines.map(s => s.replaceAllLiterally("$", ""))
val accEvt = sqlContext.jsonRDD(linesFixed)
I've tested this with Spark 1.1.0.
For reference, the lack of quoting capability in Spark SQL has been noted in this bug report and perhaps others, and it seems that the fix was recently checked in, but will take some time to make it into a release

Related

Remove duplicate column name in a Pyspark Dataframe from a json column nested object

So I have a dataframe that I gather from a table I have in my database. When I read it from the database, the json column becomes a string in my dataframe, no problem I convert it using:
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
Now I get the following schema for my dataframe, I have removed a lot of irrelevant properties from the json, but trust me there is a lot:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- Id: string (nullable = true)
| | |-- ... many more properties
Something is wrong here, inside of the nested object, object, I have two properties with the same name. This is because some of the older entries in the database had the Id of object as a string type, but that was later changed to a long type. Because of this I cannot write my dataframe using:
df.write.format("delta").option('mergeSchema', 'true').option("inferSchema", "true").option("overwriteSchema", "true").mode("overwrite").save(output_folder)
I get the following error:
AnalysisException: Found duplicate column(s) in the metadata update: json.object.id;
How do I go about fixing this. Ideally I would like to merge these two columns names, by casting Id with a string type to a long type and merge them. So I could get a schema like:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)
| | |-- ... many more properties
Note how the root of the json also has a property called Id, I would like for that to be untouched!
Can someone help me with this?
Instead of modifying and remove the duplicate column with same name after having used:
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
I went with a solution where I used regex substitution on the JsonCol beforehand:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
This removes the " on each side of the Id, which means they will be interpreted as long instead of as a string.
Putting it all together:
df = df.withColumn("JsonCol", regexp_replace(col("JsonCol"), r'"Id":"[0-9]+"', '"Id":[0-9]+'))
df_json = spark.read.json(df.rdd.map(lambda r: r.JsonCol))
df = df.withColumn("json_data", from_json("JsonCol", df_json.schema)).drop("JsonCol")
This gave me my expected output, and afterwards I was able to save it properly:
root
|-- Id: integer (nullable = true)
|-- json_data: struct (nullable = false)
| |-- object: struct (nullable = true)
| | |-- Id: long (nullable = true)

extract elements from wrappedarray in spark data frame

I am trying to extracting elements from wrapped array. But it is not working.
when i try to access to _image element it is throwing an error
Below is the schema of structure.
I have tried to flatten the image values using explode, but it is not working.
Can someone help me how to extract elements from below wrapped array.
root
|-- gender: string (nullable = true)
|-- item_number: string (nullable = true)
|-- price: double (nullable = true)
|-- size: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _description: string (nullable = true)
| | |-- color_swatch: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- _VALUE: string (nullable = true)
| | | | |-- _image: string (nullable = true)
eg of below given data of nested array
[[Medium,WrappedArray([Red,red_cardigan.jpg],Burgundy,burgundy_cardigan.jpg])],
[Large,WrappedArray([Red,red_cardigan.jpg], [Burgundy,burgundy_cardigan.jpg])]]
code
val flatDF1 =
xmlDF.select(col("_description").as("description"),col("_product_image").
as("product_image"),
col("catalog_item._gender").as("gender"),
col("catalog_item.item_number").as("item_number"),col("catalog_item.price").as("price"),
col("catalog_item.size").as("size"))
Explode is the right built-in function to use.
Given the following structure and supposing you want to use Dataframe API :
case class ColorSwatch(_VALUE: String, _image: String)
case class Size(_description: String, color_swatch: Seq[ColorSwatch])
case class Cart(gender: String, item_number: String, price: Double, size: Seq[Size])
we can write :
import sparkSession.implicits._
import org.apache.spark.sql.functions.explode
val df = sparkSession.emptyDataset[Cart].toDF
df.select(explode($"size").as("size_elem"))
.select(explode($"size_elem.color_swatch").as("color_elem"))
.select($"color_elem._image")
.show(false)
Here I'm only selecting the image element but we can also retrieve all related attributes (gender, description, _value, ...)
Note : You can also use the Dataset API with a functional approach (consisting in flatMapping your structures)

How to select nested arrays and maps in a Parquet file using Spark SQL

I need to select elements in deeply nested data structures in Parquet files. The schema of the Parquet file is as follows:
root
|-- descriptor_type: string (nullable = true)
|-- src_date: long (nullable = true)
|-- downloaded: long (nullable = true)
|-- exit_nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- fingerprint: string (nullable = true)
| | |-- published: long (nullable = true)
| | |-- last_status: long (nullable = true)
| | |-- exit_adresses: map (nullable = true)
| | | |-- key: string
| | | |-- value: long (valueContainsNull = true)
An entry in the dataset, serialized as JSON, looks like this:
{
"descriptor_type": "tordnsel 1.0",
"src_date": 1472781720000,
"downloaded": 1472781720000,
"exit_nodes": [
{
"fingerprint": "CECCFA65F3EB16CA8C0F9EAC9050C348515E26C5",
"published": 1472713568000,
"last_status": 1472716961000,
"exit_adresses": {
"178.217.187.39": 1472717419000
}
},
...
I'm using Spark 2.0 integrated in SnappyData 0.6 in which the Parquet file is adressed like this:
snappy> CREATE EXTERNAL TABLE stage USING PARQUET OPTIONS (path './testdata.parquet.snappy');
Selecting the first row yields the following result:
snappy> select * from stage limit 1;
descriptor_type|src_date |downloaded |exit_nodes
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
tordnsel 1.0 |1472781720000 |1472781720000 |5704000060110000e011000060120000d812000058130000d813000058140000d014000050150000d015000050160000d016000048170000c81700004018000&
The field 'exit_nodes' contains only one long string, not the array of structs that I was naïvely hoping for.
I can select a specific element in the 'exit_nodes' array by index:
snappy> select exit_nodes[0].fingerprint, exit_nodes[0].published, exit_nodes[0].exit_adresses from stage limit 1;
EXIT_NODES[0].FINGERPRINT |EXIT_NODES[0].PUBLISHED|EXIT_NODES[0].EXIT_ADRESSES
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
3D28E5FBD0C670C004E59D6CFDE7305BC8948FA8 |1472750744000 |15000000010000000800000037382e3134322e31392e3231330100000008000000b057f0e656010000
With the 'exit_adresses' map I had no luck though:
snappy> select exit_nodes[0].exit_adresses.key from stage limit 1;
EXIT_NODES[0].EXIT_ADRESSES[KEY]
--------------------------------
NULL
So the questions are:
- how to select the keys and values in one 'exit_adresses' map?
- how to select all records in the 'element's array, or all key-value pairs in the nested map, to import them from the Parquet file into an RDBMS?
I don't have a direct answer to this, but imho there is no further support of querying nested parquet types beyond this spark PR-360
This pretty much covers all you can do :
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

Issue while creating spark datasets

I am trying to read a json into a dataframe and then to a dataset.
I am facing the below issue. Could someone quickly help me.
There is a PrintDF dataframe was created successfully and below is the schema.
scala> personDF.printSchema();
root
|-- personDF: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- line1: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- state: string (nullable = true)
| |-- type1: string (nullable = true)
I created a case class to capture the above DF.
scala> case class address1(city:String,line1:String,postalCode:String,state:String,type1:String)
defined class address1
Below is the data that personDF has at the moment
scala> personzDF.show()
+--------------------+
| personDF|
+--------------------+
|[CENTERPORT,5 PRO...|
|[HUNTINGTON,94 JA...|
|[RIVERHEAD,9 PATT...|
|[NORTHPORT,50 LIS...|
|[NORTHPORT,24 LAU...|
|[NORTHPORT,340 SC...|
|[GREENLAWN,166 BR...|
|[MELVILLE,1 MERID...|
+--------------------+
Finally while I am creating dataset, I am getting the below error.
scala> val ds = personDF.as[address1]
<console>:32: error: overloaded method value as with alternatives:
(alias: Symbol)org.apache.spark.sql.DataFrame <and>
(alias: String)org.apache.spark.sql.DataFrame
does not take type parameters
val ds = personDF.as[address1]
I did google search and could not find the reason.
Thanks,
Sivaram
Actually personDF is an array of structType personDF, but it is not in the structure of what you are expecting from your printSchema():
|-- personDF: struct (nullable = true)
Are u trying to convert personzDF to address1 type? Then try like this:
val ds = personzDF.map(rec => rec.split(","))
.map(rec => address1(rec(0), rec(1), rec(2), rec(3), rec(4)))
.toDF()
Hope it helps.

Query parquet file by specific key in nested column map with Apache Spark

Here is the parquet file schema:
root
|-- id: string (nullable = false)
|-- samples: map (nullable = false)
| |-- key: string
| |-- value: struct (valueContainsNull = false)
| | |-- name: string (nullable = false)
| | |-- some_other_value: string (nullable = false)
I'd like to fetch all ids and matched samples where sample's key equals to some value. For example:
select id, samples where key(samples) in ("abc") // Note: pseudo SQL
How to write query like this for Apache Spark?