query spark dataframe on max column value - dataframe

I have a hive external partitioned table with following data structure:
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.1/dt=20210201/file3.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.0/dt=20210103/file4.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210210/file5.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210311/file6.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file7.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file8.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file9.parquet
i want to return a dataframe containing data for foo project, for the max version.
i want to avoid reading records for any other project.
I'm unable to query this table directly due to limitations in our etl process, so have tried reading directly from parquet
val df_foo = spark.read.parquet("hdfs://my_server/stg/my_table/project=foo")
df_foo.printSchema
root
|-- clientid: string (nullable = true)
|-- some_field_i_care_about: string (nullable = true)
|-- project_version: double (nullable = true)
|-- dt: string (nullable = true)
df_foo.groupBy("project_version", "dt").count().show
+---------------+--------+------+
|project_version| dt| count|
+---------------+--------+------+
| 2.0|20210105|187234|
| 2.0|20210110|188356|
| 2.1|20210201|188820|
+---------------+--------+------+
val max_version = df_foo.groupBy().max("project_version")
max_version.show
+--------------------+
|max(project_version)|
+--------------------+
| 2.1|
+--------------------+
val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [max(project_version): double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
project_version column is a double, and max_version value is also a double, why can't i compare these values in the filter?
thanks for any help

max_version is of type org.apache.spark.sql.DataFrame its not Double. You have to extract value from the DataFrame.
Check below code.
scala> val max_version = df.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo = Seq((2.0,20210105,187234),(2.0,20210110,188356),(2.1,20210201,188820)).toDF("project_version","dt","count")
df_foo: org.apache.spark.sql.DataFrame = [project_version: double, dt: int ... 1 more field]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
df_foo_latest: Long = 1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version)
df_foo_latest: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [project_version: double, dt: int ... 1 more field]
scala> df_foo_latest.count
res1: Long = 1
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+
Instead of extracting value from DataFrame, try to use inner join. It's much safer.
scala> val max_version = df_foo.groupBy().max("project_version")
max_version: org.apache.spark.sql.DataFrame = [max(project_version): double]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("project_version"))
scala> val df_foo_latest = df_foo.join(max_version,Seq($"project_version"),"inner")
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+

Related

How to convert a DataFrame map column to a struct column?

Suppose we have a DataFrame with a column of map type.
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// +--------------------+
What is the most straightforward way to convert it to a struct (or, equivalently, define a new column with the same keys and values but as a struct type)? See the following spark-shell (2.4.5) session, for an insanely inefficient way of going about it:
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Now, obviously collect() is very inefficient, and this is generally an awful way to do things in Spark. But what is the preferred way to accomplish this conversion? named_struct and struct both take a sequence of parameter values to construct the results, but I can't find any way to "unwrap" the map key/values to pass them to these functions.
I would use explode function:
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
df.select(explode('mapColumn)).select(struct('*).as("struct"))
output:
+--------+
| struct|
+--------+
|[foo, 1]|
|[bar, 2]|
+--------+
root
|-- struct: struct (nullable = false)
| |-- key: string (nullable = false)
| |-- value: integer (nullable = false)
I see #chlebek answer but in case it should be kept in one row you can use an UDF
scala> val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
df: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>]
scala> df.show
+--------------------+
| mapColumn|
+--------------------+
|[foo -> 1, bar -> 2]|
+--------------------+
scala> case class KeyValue(key: String, value: String)
defined class KeyValue
scala> val toArrayOfStructs = udf((value: Map[String, String]) => value.map {
| case (k, v) => KeyValue(k, v)
| }.toArray )
toArrayOfStructs: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StructType(StructField(key,StringType,true), StructField(value,StringType,true)),true),Some(List(MapType(StringType,StringType,true))))
scala> df.withColumn("alfa", toArrayOfStructs(col("mapColumn")))
res4: org.apache.spark.sql.DataFrame = [mapColumn: map<string,int>, alfa: array<struct<key:string,value:string>>]
scala> res4.show
+--------------------+--------------------+
| mapColumn| alfa|
+--------------------+--------------------+
|[foo -> 1, bar -> 2]|[[foo, 1], [bar, 2]]|
+--------------------+--------------------+
scala> res4.printSchema
root
|-- mapColumn: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- alfa: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
Your method doesn't seem to work with more rows, and especially if they have different maps, like this one:
val df = Seq(
(Map("foo"->1, "bar"->2)),
(Map("foo"->3, "baz"->4))
).toDF("mapColumn")
df.show()
// +--------------------+
// | mapColumn|
// +--------------------+
// |{foo -> 1, bar -> 2}|
// |{foo -> 3, baz -> 4}|
// +--------------------+
Your script would return...
val jsonStr = df.select(to_json($"mapColumn")).collect()(0)(0).asInstanceOf[String]
spark.read.json(Seq(jsonStr).toDS()).show()
// +---+---+
// |bar|foo|
// +---+---+
// | 2| 1|
// +---+---+
Solutions
map to columns:
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.withColumn("_c", from_json(json_col, json_schema)).select("_c.*")
df2.show()
// +----+----+---+
// | bar| baz|foo|
// +----+----+---+
// | 2|null| 1|
// |null| 4| 3|
// +----+----+---+
map to struct (field names: "key", "value"):
val df2 = df.select(explode(map_entries($"mapColumn")).as("struct"))
df2.show()
// +--------+
// | struct|
// +--------+
// |{foo, 1}|
// |{bar, 2}|
// |{foo, 3}|
// |{baz, 4}|
// +--------+
map to struct (field names: "foo", "bar", "baz"):
val json_col = to_json($"mapColumn")
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).as("struct"))
df2.show()
// +------------+
// | struct|
// +------------+
// |{2, null, 1}|
// |{null, 4, 3}|
// +------------+
Define a case class
case class Bean56(foo: Int, bar: Int)
//Define a bean Encoder
val personEncoder = Encoders.bean[Bean56](classOf[Bean56])
val df = spark.sql("""select map("foo", 1, "bar", 2) AS mapColumn""")
//Map the output to required bean
val Bean56s = df.map(row => {
val map = row.getMap[String, Int](0)
Bean56(map.getOrElse("foo", -1), map.getOrElse("bar", -1))
})(personEncoder) // Supply implicit Encoder of the bean
Bean56s.foreach(println(_)) // Print the bean

Spark dataframe transverse of columns

I have ticket transaction systems. Sample dataframe looks like below. Every day will have 2 records with how many ticket & value of tickets were booked through channel(only 2 channel is possible. Passenger,Agent)
date,channel,ticket_qty,ticket_amount
20011231,passenger,500,2500
20011231,agent,100,1100
20020101,passenger,450,2000
20020101,agent,120,1500
I want to make it to single record per date& removing channel. Like below
date,passenger_ticket_qty,passenger_ticket_amount,agent_ticket_qty,agent_ticket_amount
20011231,500,2500,100,1100
20020101,450,2000,120,1500
I have acheived it in below way.
val pas_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "passenger")
val agent_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "agent")
val df = pas_df.as("pdf").join(agent_df.as("adf"), $"adf.date" === $"pdf.date")
.select($"pdf.date" as date,
$"pdf.ticket_qty" as passenger_ticket_qty,
$"pdf.ticket_amount" as passenger_ticket_amount,
$"adf.ticket_qty" agent_ticket_qty,
$"adf.ticket_amount" as agent_ticket_amount)
This is working perfect way.But it takes around 3 hrs since the file 40yrs of records.
Is there a better way to get this done without join?
Thanks in Advance.
Perhaps this is useful-
Load the data provided
val data =
"""
|date,channel,ticket_qty,ticket_amount
|20011231,passenger,500,2500
|20011231,agent,100,1100
|20020101,passenger,450,2000
|20020101,agent,120,1500
""".stripMargin
val stringDS = data.split(System.lineSeparator())
// .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+---------+----------+-------------+
* |date |channel |ticket_qty|ticket_amount|
* +--------+---------+----------+-------------+
* |20011231|passenger|500 |2500 |
* |20011231|agent |100 |1100 |
* |20020101|passenger|450 |2000 |
* |20020101|agent |120 |1500 |
* +--------+---------+----------+-------------+
*
* root
* |-- date: integer (nullable = true)
* |-- channel: string (nullable = true)
* |-- ticket_qty: integer (nullable = true)
* |-- ticket_amount: integer (nullable = true)
*/
Use pivot and first
df.groupBy("date")
.pivot("channel")
.agg(
first("ticket_qty").as("ticket_qty"),
first("ticket_amount").as("ticket_amount")
).show(false)
/**
* +--------+----------------+-------------------+--------------------+-----------------------+
* |date |agent_ticket_qty|agent_ticket_amount|passenger_ticket_qty|passenger_ticket_amount|
* +--------+----------------+-------------------+--------------------+-----------------------+
* |20011231|100 |1100 |500 |2500 |
* |20020101|120 |1500 |450 |2000 |
* +--------+----------------+-------------------+--------------------+-----------------------+
*/

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)

Schema Conversion from String datatype to Array(Map(Array)) datatype in Pyspark

We are reading data from dynamo db so we are getting datatype as string, but we want writing string data type as array(map(array))
string Data :
{"policy_details":[{"cdhid":" 123","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}, {"cdhid":" 1234","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}]}
output required:
string data type need to convert to ARRAY(MAP(ARRAY))
We have tried with below schema:
ArrayType([
StructField("policy_num", MapType(ArrayType([
StructField("cdhid", StringType(), True),
StructField("role_id", StringType(), True),
StructField("role_desc", StringType(), True)
])))
])
getting below issue:
elementType [StructField(cdhid,StringType,true),
StructField(role_id,StringType,true),
StructField(role_desc,StringType,true)] should be an instance of < class 'pyspark.sql.types.DataType' >
Regarding your data, the schema your want is not the one that fits.
The schema of you data is :
from pyspark.sql import types as T
schm = T.StructType([T.StructField("policy_details",T.ArrayType(T.StructType([
T.StructField("cdhid", T.StringType(), True),
T.StructField("p2cid", T.StringType(), True),
T.StructField("roleDesc", T.StringType(), True),
T.StructField("positionnum", T.StringType(), True),
])), True)])
Then, you just need to use the from_json function.
from pyspark.sql import functions as F
df.show()
+--------------------+
| db_data|
+--------------------+
|{"policy_details"...|
+--------------------+
new_df = df.select(F.from_json("db_data", schm).alias("data"))
new_df.printSchema()
root
|-- data: struct (nullable = true)
| |-- policy_details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- cdhid: string (nullable = true)
| | | |-- p2cid: string (nullable = true)
| | | |-- roleDesc: string (nullable = true)
| | | |-- positionnum: string (nullable = true)
EDIT : If you want to use MapType, you can replace the schema with :
schm = T.StructType([
T.StructField(
"policy_details",
T.ArrayType(T.MapType(
T.StringType(),
T.StringType()
)),
True
)
])

where clause not working in spark sql dataframe

I've created a dataframe which contains 3 columns : zip, lat, lng
I want to select the lat and lng values where zip = 00650
So, I tried using :
sqlContext.sql("select lat,lng from census where zip=00650").show()
But it is returning ArrayOutOfBound Exception because it does not have any values in it.
If I remove the where clause it is running fine.
Can someone please explain what I am doing wrong ?
Update:
dataframe schema:
root
|-- zip: string (nullable = true)
|-- lat: string (nullable = true)
|-- lng: string (nullable = true)
First 10 rows are :
+-----+---------+-----------+
| zip| lat| lng|
+-----+---------+-----------+
|00601|18.180555| -66.749961|
|00602|18.361945| -67.175597|
|00603|18.455183| -67.119887|
|00606|18.158345| -66.932911|
|00610|18.295366| -67.125135|
|00612|18.402253| -66.711397|
|00616|18.420412| -66.671979|
|00617|18.445147| -66.559696|
|00622|17.991245| -67.153993|
|00623|18.083361| -67.153897|
|00624|18.064919| -66.716683|
|00627|18.412600| -66.863926|
|00631|18.190607| -66.832041|
|00637|18.076713| -66.947389|
|00638|18.295913| -66.515588|
|00641|18.263085| -66.712985|
|00646|18.433150| -66.285875|
|00647|17.963613| -66.947127|
|00650|18.349416| -66.578079|
As you can see in your schema zip is of type String, so your query should be something like this
sqlContext.sql("select lat, lng from census where zip = '00650'").show()
Update:
If you are using Spark 2 then you can do this:
import sparkSession.sqlContext.implicits._
val dataFrame = Seq(("10.023", "75.0125", "00650"),("12.0246", "76.4586", "00650"), ("10.023", "75.0125", "00651")).toDF("lat","lng", "zip")
dataFrame.printSchema()
dataFrame.select("*").where(dataFrame("zip") === "00650").show()
dataFrame.registerTempTable("census")
sparkSession.sqlContext.sql("SELECT lat, lng FROM census WHERE zip = '00650'").show()
output:
root
|-- lat: string (nullable = true)
|-- lng: string (nullable = true)
|-- zip: string (nullable = true)
+-------+-------+-----+
| lat| lng| zip|
+-------+-------+-----+
| 10.023|75.0125|00650|
|12.0246|76.4586|00650|
+-------+-------+-----+
+-------+-------+
| lat| lng|
+-------+-------+
| 10.023|75.0125|
|12.0246|76.4586|
+-------+-------+
I resolved my issue using RDD rather that DataFrame. It provided me desired results :
val data = sc.textFile("/home/ishan/Desktop/c").map(_.split(","))
val arr=data.filter(_.contains("00650")).take(1)
arr.foreach{a => a foreach println}