Spark dataframe statements work differently inside class definition - sql

Trying to create a spark-scala class for working with a calendar reference table.
I use sql against a Hadoop database to create a Spark dataframe:
scala> val dfCyccal = sql(sqlCyccal)
dfCyccal: org.apache.spark.sql.DataFrame = [DT_WORKDAY: date, NWKD: int ... 4 more fields]
scala> dfCyccal.printSchema
root
|-- DT_WORKDAY: date (nullable = true)
|-- NWKD: integer (nullable = true)
|-- DT_PREV_WD: date (nullable = true)
|-- DT_NEXT_WD: date (nullable = true)
|-- DT_MNTHEND: date (nullable = true)
|-- ACCTG_MNTH: date (nullable = true)
scala> dfCyccal.show(5)
+----------+----+----------+----------+----------+----------+
|DT_WORKDAY|NWKD|DT_PREV_WD|DT_NEXT_WD|DT_MNTHEND|ACCTG_MNTH|
+----------+----+----------+----------+----------+----------+
|2004-01-29| 20|2003-12-30|2004-02-27|2004-01-29|2004-01-01|
|2004-01-30| 1|2003-12-31|2004-03-02|2004-02-27|2004-02-01|
|2004-02-02| 2|2004-01-02|2004-03-03|2004-02-27|2004-02-01|
|2004-02-03| 3|2004-01-05|2004-03-04|2004-02-27|2004-02-01|
|2004-02-04| 4|2004-01-06|2004-03-05|2004-02-27|2004-02-01|
+----------+----+----------+----------+----------+----------+
only showing top 5 rows
I then set reference constants for the extract:
scala> val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
MIN_DT_WORKDAY: java.sql.Date = 2004-01-29
scala> val MAX_DT_WORKDAY : java.sql.Date = dfCyccal.agg(max('DT_WORKDAY)).first().getDate(0)
MAX_DT_WORKDAY: java.sql.Date = 2020-12-01
Problem is, when I try to encapsulate this in a class definition, I get a different result:
class CYCCAL(parameters for SQL) {
...
val dfCyccal = sql(sqlCyccal).persist;
<console>:143: error: not found: value min
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
^
<console>:144: error: not found: value max
val MAX_DT_WORKDAY : java.sql.Date = dfCyccal.agg(max('DT_WORKDAY)).first().getDate(0)
}; // end of CYCCAL
How does the Class setup change the operations on the DataFrame?

They work the same. spark-shell just imports many objects by default including:
import org.apache.spark.sql.functions._
which are missing in your own code.

This worked. I had to add the following INSIDE the Class definition:
import org.apache.spark.sql.functions.{min,max};
I also had to change the notation on the column from
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min('DT_WORKDAY)).first().getDate(0)
to
val MIN_DT_WORKDAY : java.sql.Date = dfCyccal.agg(min("DT_WORKDAY")).first().getDate(0)
The original was being processed as a Symbol, and you could not apply the functions to it.

Related

Spark: Select specific Index of value of type Any

So I have a DataFrame in which one of the columns is of type WrappedArray(JSON). In each of these, the JSON has a format [String, String]
I have sucessfully accessed the inside of the array and now I have a column of type Any in which each value has a [String, String].
Point is: I just want to take the first value of these two strings, but if I try something like column(0), it raises an error since Any has no index. How can I access this value?
My code for now is:
val schema = StructType(Seq(
StructField("productId", StringType, true),
StructField("name", StringType, true)
))
df.withColumn("column", from_json($"column"(0), schema) )
And the schema of my df:
root
|-- customerId: string (nullable = true)
|-- column: struct (nullable = true)
| |-- productId: string (nullable = true)
| |-- name: string (nullable = true)
|-- date: date (nullable = true)
I managed to solve the problem myself. The answer was quite obvious: Instead of creating a column of type struct which contains two values, I created a MapType which had the same values.
My final code:
df.withColumn("column", from_json($"column"(0), MapType(StringType, StringType)) )
And then, for accessing the keys and values of the new column:
.select("column.productId", "column.name")

Cannot resolve column from input columns after second join

I am trying to join a dataframe with 2 separate dataframes which have common parent. But I'm getting an error
This is the schema
demoDf
root
|-- id: integer (nullable = true)
parentDf
root
|-- id: integer (nullable = true)
|-- value: string (nullable = true)
This is the scala code
val demoDf: Dataframe
val parentDf: DataFrame
val childDf1 = parentDf.filter(col("id") > 50)
val childDf2 = parentDf.filter(col("id") < 50)
demoDf.alias("demoDf")
.join(childDf1.alias("childDf1"), $"demoDf.id" === $"childDf1.id")
.join(childDf2.alias("childDf2"), $"demoDf.id" === $"childDf2.id")
The error I get is the following. As you can see here demoDf.id is present
org.apache.spark.sql.AnalysisException: cannot resolve '`demoDf.id`' given input columns: [demoDf.id, childDf1.id, childDf1.value, childDf2.id, childDf2.value];;
The expected output should be a proper join. I wonder what I'm missing. On a side note, If I simply join childDf1 with demoDf it works and this error comes up only after I add second job.

Issue while creating spark datasets

I am trying to read a json into a dataframe and then to a dataset.
I am facing the below issue. Could someone quickly help me.
There is a PrintDF dataframe was created successfully and below is the schema.
scala> personDF.printSchema();
root
|-- personDF: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- line1: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- state: string (nullable = true)
| |-- type1: string (nullable = true)
I created a case class to capture the above DF.
scala> case class address1(city:String,line1:String,postalCode:String,state:String,type1:String)
defined class address1
Below is the data that personDF has at the moment
scala> personzDF.show()
+--------------------+
| personDF|
+--------------------+
|[CENTERPORT,5 PRO...|
|[HUNTINGTON,94 JA...|
|[RIVERHEAD,9 PATT...|
|[NORTHPORT,50 LIS...|
|[NORTHPORT,24 LAU...|
|[NORTHPORT,340 SC...|
|[GREENLAWN,166 BR...|
|[MELVILLE,1 MERID...|
+--------------------+
Finally while I am creating dataset, I am getting the below error.
scala> val ds = personDF.as[address1]
<console>:32: error: overloaded method value as with alternatives:
(alias: Symbol)org.apache.spark.sql.DataFrame <and>
(alias: String)org.apache.spark.sql.DataFrame
does not take type parameters
val ds = personDF.as[address1]
I did google search and could not find the reason.
Thanks,
Sivaram
Actually personDF is an array of structType personDF, but it is not in the structure of what you are expecting from your printSchema():
|-- personDF: struct (nullable = true)
Are u trying to convert personzDF to address1 type? Then try like this:
val ds = personzDF.map(rec => rec.split(","))
.map(rec => address1(rec(0), rec(1), rec(2), rec(3), rec(4)))
.toDF()
Hope it helps.

How to query JSON data column using Spark DataFrames?

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html

Spark SQL - How to select on dates stored as UTC millis from the epoch?

I have been searching and have not found a solution as to how one might query on dates stored as UTC milliseconds from the epoch using Spark SQL. The schema I have pulled in from a NoSQL datasource (JSON from MongoDB) has the target date as:
|-- dateCreated: struct (nullable = true)
||-- $date: long (nullable = true)
The complete schema is as follows:
scala> accEvt.printSchema
root
|-- _id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- appId: integer (nullable = true)
|-- cId: long (nullable = true)
|-- data: struct (nullable = true)
| |-- expires: struct (nullable = true)
| | |-- $date: long (nullable = true)
| |-- metadata: struct (nullable = true)
| | |-- another key: string (nullable = true)
| | |-- class: string (nullable = true)
| | |-- field: string (nullable = true)
| | |-- flavors: string (nullable = true)
| | |-- foo: string (nullable = true)
| | |-- location1: string (nullable = true)
| | |-- location2: string (nullable = true)
| | |-- test: string (nullable = true)
| | |-- testKey: string (nullable = true)
| | |-- testKey2: string (nullable = true)
|-- dateCreated: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- id: integer (nullable = true)
|-- originationDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- processedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- receivedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
and my goal is to write queries along the lines of:
SELECT COUNT(*) FROM myTable WHERE dateCreated BETWEEN [dateStoredAsLong0] AND [dateStoredAsLong1]
My process thus far has been:
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext#29200d25
scala> val accEvt = sqlContext.jsonFile("/home/bkarels/mongoexport/accomplishment_event.json")
...
14/10/29 15:03:38 INFO SparkContext: Job finished: reduce at JsonRDD.scala:46, took 4.668981083 s
accEvt: org.apache.spark.sql.SchemaRDD =
SchemaRDD[6] at RDD at SchemaRDD.scala:103
scala> accEvt.registerAsTable("accomplishmentEvent")
(At this point the following base line query executes successfully)
scala> sqlContext.sql("select count(*) from accomplishmentEvent").collect.foreach(println)
...
[74475]
Now, the voodoo that I cannot get right is how to form my select statement to reason about the dates. For example the following executes w/o error, but returns zero rather than the count of all records as it should (74475).
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate >= '1970-01-01'").collect.foreach(println)
...
[0]
I have also tried some ugliness like:
scala> val now = new java.util.Date()
now: java.util.Date = Wed Oct 29 15:05:15 CDT 2014
scala> val today = now.getTime
today: Long = 1414613115743
scala> val thirtydaysago = today - (30 * 24 * 60 * 60 * 1000)
thirtydaysago: Long = 1416316083039
scala> sqlContext.sql("select count(*) from accomplishmentEvent where processedDate <= %s and processedDate >= %s".format(today,thirtydaysago)).collect.foreach(println)
As recommended, I've selected on a named field to ensure that works. So:
scala> sqlContext.sql("select receivedDate from accomplishmentEvent limit 10").collect.foreach(println)
returns:
[[1376318850033]]
[[1376319429590]]
[[1376320804289]]
[[1376320832835]]
[[1376320832960]]
[[1376320835554]]
[[1376320914480]]
[[1376321041899]]
[[1376321109341]]
[[1376321121469]]
Then extending to try and get some kind of dates working I have tried:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.date > '1970-01-01' limit 5").collect.foreach(println)
Results in the error:
java.lang.RuntimeException: No such field date in StructType(ArrayBuffer(StructField($date,LongType,true)))
...
Prefixing our field name with $ as also suggested results in a different kind of error:
scala> sqlContext.sql("select cId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5").collect.foreach(println)
java.lang.RuntimeException: [1.69] failure: ``UNION'' expected but ErrorToken(illegal character) found
select actualConsumerId from accomplishmentEvent where receivedDate.$date > '1970-01-01' limit 5
Clearly I am not getting how to select on dates stored in this way - can anyone help me fill in this gap?
I am newer to both Scala and Spark, so forgive me if this is an elementary question, but my searches have turned up empty on the forums and Spark documentation.
Thank you.
Your JSON is not flat, so the fields below the top level need to be addressed using qualified names, such as dateCreated.$date. Your specific date fields are all of long type, so you'll need to do numerical comparisons on them and it looks like you were on the right track for doing those.
An additional problem is that your field names have "$" characters, and Spark SQL won't let you query on them. One solution is that instead of reading the JSON directly as a SchemaRDD (as you have done) you first read it as an RDD[String], use the map method to perform the Scala string manipulations of your choice, and then use SQLContext's jsonRDD method to create the SchemaRDD.
val lines = sc.textFile(...)
// you may want something less naive than global replacement of all "$" chars
val linesFixed = lines.map(s => s.replaceAllLiterally("$", ""))
val accEvt = sqlContext.jsonRDD(linesFixed)
I've tested this with Spark 1.1.0.
For reference, the lack of quoting capability in Spark SQL has been noted in this bug report and perhaps others, and it seems that the fix was recently checked in, but will take some time to make it into a release