Spark dataframe transverse of columns - apache-spark-sql

I have ticket transaction systems. Sample dataframe looks like below. Every day will have 2 records with how many ticket & value of tickets were booked through channel(only 2 channel is possible. Passenger,Agent)
date,channel,ticket_qty,ticket_amount
20011231,passenger,500,2500
20011231,agent,100,1100
20020101,passenger,450,2000
20020101,agent,120,1500
I want to make it to single record per date& removing channel. Like below
date,passenger_ticket_qty,passenger_ticket_amount,agent_ticket_qty,agent_ticket_amount
20011231,500,2500,100,1100
20020101,450,2000,120,1500
I have acheived it in below way.
val pas_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "passenger")
val agent_df= spark.read.csv(filepath).option("header","true")
.filter($"channel" === "agent")
val df = pas_df.as("pdf").join(agent_df.as("adf"), $"adf.date" === $"pdf.date")
.select($"pdf.date" as date,
$"pdf.ticket_qty" as passenger_ticket_qty,
$"pdf.ticket_amount" as passenger_ticket_amount,
$"adf.ticket_qty" agent_ticket_qty,
$"adf.ticket_amount" as agent_ticket_amount)
This is working perfect way.But it takes around 3 hrs since the file 40yrs of records.
Is there a better way to get this done without join?
Thanks in Advance.

Perhaps this is useful-
Load the data provided
val data =
"""
|date,channel,ticket_qty,ticket_amount
|20011231,passenger,500,2500
|20011231,agent,100,1100
|20020101,passenger,450,2000
|20020101,agent,120,1500
""".stripMargin
val stringDS = data.split(System.lineSeparator())
// .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+---------+----------+-------------+
* |date |channel |ticket_qty|ticket_amount|
* +--------+---------+----------+-------------+
* |20011231|passenger|500 |2500 |
* |20011231|agent |100 |1100 |
* |20020101|passenger|450 |2000 |
* |20020101|agent |120 |1500 |
* +--------+---------+----------+-------------+
*
* root
* |-- date: integer (nullable = true)
* |-- channel: string (nullable = true)
* |-- ticket_qty: integer (nullable = true)
* |-- ticket_amount: integer (nullable = true)
*/
Use pivot and first
df.groupBy("date")
.pivot("channel")
.agg(
first("ticket_qty").as("ticket_qty"),
first("ticket_amount").as("ticket_amount")
).show(false)
/**
* +--------+----------------+-------------------+--------------------+-----------------------+
* |date |agent_ticket_qty|agent_ticket_amount|passenger_ticket_qty|passenger_ticket_amount|
* +--------+----------------+-------------------+--------------------+-----------------------+
* |20011231|100 |1100 |500 |2500 |
* |20020101|120 |1500 |450 |2000 |
* +--------+----------------+-------------------+--------------------+-----------------------+
*/

Related

Spark- check intersect of two string columns

I have a dataframe below where colA and colB contain strings. I'm trying to check if colB contains any substring of values in colA. The vaules can contain , or space, but as long as any part of colB's string has overlap with colA's, it is a match. For example row 1 below has an overlap ("bc"), and row 2 does not.
I was thinking of splitting the values into arrays but the delimiters are not constant. Could someone please help to shed some light on how to do this? Many thanks for your help.
+---+-------+-----------+
| id|colA | colB |
+---+-------+-----------+
| 1|abc d | bc, z |
| 2|abcde | hj f |
+---+-------+-----------+
You could split by using regex and then create a UDF function to check substrings.
Example:
spark = SparkSession.builder.getOrCreate()
data = [
{"id": 1, "A": "abc d", "B": "bc, z, d"},
{"id": 2, "A": "abc-d", "B": "acb, abc"},
{"id": 3, "A": "abcde", "B": "hj f ab"},
]
df = spark.createDataFrame(data)
split_regex = "((,)?\s|[-])"
df = df.withColumn("A", F.split(F.col("A"), split_regex))
df = df.withColumn("B", F.split(F.col("B"), split_regex))
def mapper(a, b):
result = []
for ele_b in b:
for ele_a in a:
if ele_b in ele_a:
result.append(ele_b)
return result
df = df.withColumn(
"result", F.udf(mapper, ArrayType(StringType()))(F.col("A"), F.col("B"))
)
Result:
root
|-- A: array (nullable = true)
| |-- element: string (containsNull = true)
|-- B: array (nullable = true)
| |-- element: string (containsNull = true)
|-- id: long (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+-----------+---+-------+
|A |B |id |result |
+--------+-----------+---+-------+
|[abc, d]|[bc, z, d] |1 |[bc, d]|
|[abc, d]|[acb, abc] |2 |[abc] |
|[abcde] |[hj, f, ab]|3 |[ab] |
+--------+-----------+---+-------+
You can use a custom UDF to implement the intersect logic as below -
Data Preparation
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
import pandas as pd
data = {"id" :[1,2],
"colA" : ["abc d","abcde"],
"colB" : ["bc, z","hj f"]}
mypd = pd.DataFrame(data)
sparkDF = sql.createDataFrame(mypd)
sparkDF.show()
+---+-----+-----+
| id| colA| colB|
+---+-----+-----+
| 1|abc d|bc, z|
| 2|abcde| hj f|
+---+-----+-----+
UDF
def str_intersect(x,y):
res = set(x) & set(y)
if res:
return ''.join(res)
else:
return None
str_intersect_udf = F.udf(lambda x,y:str_intersect(x,y),StringType())
sparkDF.withColumn('intersect',str_intersect_udf(F.col('colA'),F.col('colB'))).show()
+---+-----+-----+---------+
| id| colA| colB|intersect|
+---+-----+-----+---------+
| 1|abc d|bc, z| bc |
| 2|abcde| hj f| null|
+---+-----+-----+---------+

query spark dataframe on max column value

I have a hive external partitioned table with following data structure:
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210105/file1.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.0/dt=20210110/file2.parquet
hdfs://my_server/stg/my_table/project=foo/project_version=2.1/dt=20210201/file3.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.0/dt=20210103/file4.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210210/file5.parquet
hdfs://my_server/stg/my_table/project=bar/project_version=2.1/dt=20210311/file6.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file7.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file8.parquet
hdfs://my_server/stg/my_table/project=big_project/project_version=1.1/dt=20210401/file9.parquet
i want to return a dataframe containing data for foo project, for the max version.
i want to avoid reading records for any other project.
I'm unable to query this table directly due to limitations in our etl process, so have tried reading directly from parquet
val df_foo = spark.read.parquet("hdfs://my_server/stg/my_table/project=foo")
df_foo.printSchema
root
|-- clientid: string (nullable = true)
|-- some_field_i_care_about: string (nullable = true)
|-- project_version: double (nullable = true)
|-- dt: string (nullable = true)
df_foo.groupBy("project_version", "dt").count().show
+---------------+--------+------+
|project_version| dt| count|
+---------------+--------+------+
| 2.0|20210105|187234|
| 2.0|20210110|188356|
| 2.1|20210201|188820|
+---------------+--------+------+
val max_version = df_foo.groupBy().max("project_version")
max_version.show
+--------------------+
|max(project_version)|
+--------------------+
| 2.1|
+--------------------+
val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset [max(project_version): double]
at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:77)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:163)
at scala.util.Try.getOrElse(Try.scala:79)
project_version column is a double, and max_version value is also a double, why can't i compare these values in the filter?
thanks for any help
max_version is of type org.apache.spark.sql.DataFrame its not Double. You have to extract value from the DataFrame.
Check below code.
scala> val max_version = df.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo = Seq((2.0,20210105,187234),(2.0,20210110,188356),(2.1,20210201,188820)).toDF("project_version","dt","count")
df_foo: org.apache.spark.sql.DataFrame = [project_version: double, dt: int ... 1 more field]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("version")).as[Double].collect.head
max_version: Double = 2.1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version).count()
df_foo_latest: Long = 1
scala> val df_foo_latest = df_foo.filter($"project_version" === max_version)
df_foo_latest: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [project_version: double, dt: int ... 1 more field]
scala> df_foo_latest.count
res1: Long = 1
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+
Instead of extracting value from DataFrame, try to use inner join. It's much safer.
scala> val max_version = df_foo.groupBy().max("project_version")
max_version: org.apache.spark.sql.DataFrame = [max(project_version): double]
scala> val max_version = df_foo.groupBy().agg(max("project_version").as("project_version"))
scala> val df_foo_latest = df_foo.join(max_version,Seq($"project_version"),"inner")
scala> df_foo_latest.show(false)
+---------------+--------+------+
|project_version|dt |count |
+---------------+--------+------+
|2.1 |20210201|188820|
+---------------+--------+------+

Update array of structs - Spark

I have the following spark delta table structure,
+---+------------------------------------------------------+
|id |addresses |
+---+------------------------------------------------------+
|1 |[{"Address":"ABC", "Street": "XXX"}, {"Address":"XYZ", "Street": "YYY"}]|
+---+------------------------------------------------------+
Here the addresses column is an array of structs.
I need to update the first Address inside array as "XXX", from the "Street" attributes value without changing the second element in the list.
So, "ABC" should be updated to "XXX" and "XYZ" should be updated to "YYY"
You can assume, I have so many attributes in the struct like street, zipcode etc so I want to leave them untouched and just update the value of Address from Street attribute.
How can I do this in Spark or Databricks or Sql?
Schema,
|-- id: string (nullable = true)
|-- addresses: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Address: string (nullable = true)
| | | | |-- Street: string (nullable = true)
Cheers!
Please check below code.
scala> vdf.show(false)
+---+--------------+
|id |addresses |
+---+--------------+
|1 |[[ABC], [XYZ]]|
+---+--------------+
scala> vdf.printSchema
root
|-- id: integer (nullable = false)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Address: string (nullable = true)
scala> val new_address = array(struct(lit("AAA").as("Address")))
scala> val except_first = array_except($"addresses",array($"addresses"(0)))
scala> val addresses = array_union(new_address,except_first).as("addresses")
scala> vdf.select($"id",addresses).select($"id",$"addresses",to_json($"addresses").as("json_addresses")).show(false)
+---+--------------+-------------------------------------+
|id |addresses |json_addresses |
+---+--------------+-------------------------------------+
|1 |[[AAA], [XYZ]]|[{"Address":"AAA"},{"Address":"XYZ"}]|
+---+--------------+-------------------------------------+
Updated
scala> vdf.withColumn("addresses",explode($"addresses")).groupBy($"id").agg(collect_list(struct($"addresses.Street".as("Address"),$"addresses.Street")).as("addresses")).withColumn("json_data",to_json($"addresses")).show(false)
+---+------------------------+-------------------------------------------------------------------+
|id |addresses |json_data |
+---+------------------------+-------------------------------------------------------------------+
|1 |[[XXX, XXX], [YYY, YYY]]|[{"Address":"XXX","Street":"XXX"},{"Address":"YYY","Street":"YYY"}]|
+---+------------------------+-------------------------------------------------------------------+

Regex extraction in SQL

I have the following data format from which I'm trying to extract the id part,
{"memberurn"=urn:li:member:10000012}
This is my code,
CAST(regexp_extract(key.memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT) AS member_id
In the output member_id is NULL
What am I doing wrong here?
Try this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import LongType
spark = SparkSession.builder \
.appName('practice')\
.getOrCreate()
sc= spark.sparkContext
df= sc.parallelize([
[""" {"memberurn"=urn:li:member:10000012}"""]]).toDF(["a"])
df.show(truncate=False)
+-------------------------------------+
|a |
+-------------------------------------+
| {"memberurn"=urn:li:member:10000012}|
+-------------------------------------+
df1= df.withColumn("id", F.regexp_extract(F.col('a'),
'(urn:li:member:)(\d+)', 2))
df2= df1.withColumn("id",df1["id"].cast(LongType()))
df2.show()
+-------------------------------------+--------+
|a |id |
+-------------------------------------+--------+
| {"memberurn"=urn:li:member:10000012}|10000012|
+-------------------------------------+--------+
print(df2.printSchema())
root
|-- a: string (nullable = true)
|-- id: long (nullable = true)
In scala-
Using regex_extract
val df = spark.range(1).withColumn("memberurn", lit("urn:li:member:10000012"))
df.withColumn("member_id",
expr("""CAST(regexp_extract(memberurn, 'urn:li:member:(\\d+)', 1) AS BIGINT)"""))
.show(false)
/**
* +---+----------------------+---------+
* |id |memberurn |member_id|
* +---+----------------------+---------+
* |0 |urn:li:member:10000012|10000012 |
* +---+----------------------+---------+
*/
Using substring_index
df.withColumn("member_id",
substring_index($"memberurn", ":", -1).cast("bigint"))
.show(false)
/**
* +---+----------------------+---------+
* |id |memberurn |member_id|
* +---+----------------------+---------+
* |0 |urn:li:member:10000012|10000012 |
* +---+----------------------+---------+
*/

Scala Spark Dataframe: how to explode an array of Int and array of struct at the same time

I'm new to Scala/Spark and I'm trying to make explode a dataframe that has an array column and array of struct column so that I end up with no arrays and no struct.
Here's an example
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(4,5,6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
df.printSchema
df.show
df has the following schema
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
and the data looks like
+---+---------+--------------------+
| id| before| after|
+---+---------+--------------------+
| 1|[4, 5, 6]|[[07:00, 07:30, 7...|
+---+---------+--------------------+
How do I explode the dataframe so I get the following schema
|-- id: string (nullable = true)
|-- before: integer (containsNull = false)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- after_area: string (nullable = true)
The resulting data should have 3 rows and 5 columns
+---+---------+--------------------+--------------------+--------+
| id| before| after_start_time| after_start_time| area|
+---+---------+--------------------+--------------------+--------+
| 1| 4| 07:00| 07:30| 70|
| 1| 5| 08:00| 08:30| 80|
| 1| 6| 09:00| 09:30| 90|
+---+---------+--------------------+--------------------+--------+
I'm using spark 2.3.0 (arrays_zip is not available). And the only solutions I can find is either for exploding two Arrays of String or one Array of struct.
Use arrays_zip to combine two arrays, then explode to explode array columns & use as to rename required columns.
As arrays_zip is not available in spark 2.3. Created UDF to perform same operation.
val arrays_zip = udf((before:Seq[Int],after: Seq[Area]) => before.zip(after))
Execution time with built in (spark 2.4.2) arrays_zip - Time taken: 1146 ms
Execution time with arrays_zip UDF - Time taken: 1165 ms
Check below code.
scala> df.show(false)
+---+---------+------------------------------------------------------------+
|id |before |after |
+---+---------+------------------------------------------------------------+
|1 |[4, 5, 6]|[[07:00, 07:30, 70], [08:00, 08:30, 80], [09:00, 09:30, 90]]|
+---+---------+------------------------------------------------------------+
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.printSchema
root
|-- id: string (nullable = true)
|-- before: integer (nullable = true)
|-- after_start_time: string (nullable = true)
|-- after_end_time: string (nullable = true)
|-- area: string (nullable = true)
Output
scala>
df
.select(
$"id",
explode(
arrays_zip($"before",$"after")
.cast("array<struct<before:int,after:struct<start_time:string,end_time:string,area:string>>>")
).as("before_after")
)
.select(
$"id",
$"before_after.before".as("before"),
$"before_after.after.start_time".as("after_start_time"),
$"before_after.after.end_time".as("after_end_time"),
$"before_after.after.area"
)
.show(false)
+---+------+----------------+--------------+----+
|id |before|after_start_time|after_end_time|area|
+---+------+----------------+--------------+----+
|1 |4 |07:00 |07:30 |70 |
|1 |5 |08:00 |08:30 |80 |
|1 |6 |09:00 |09:30 |90 |
+---+------+----------------+--------------+----+
To handle some complex struct you can do,
Declare two beans Area(input) and Area2(output)
Map row to Area2 bean
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import scala.collection.mutable
object ExplodeTwoArrays {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = Seq((
"1", Seq(4, 5, 6),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
val outDf = df.map(row=> {
val id = row.getString(0)
val beforeArray : Seq[Int]= row.getSeq[Int](1)
val afterArray : mutable.WrappedArray[Area2] =
row.getAs[mutable.WrappedArray[GenericRowWithSchema]](2) // Need to map Array(Struct) to the something compatible
.zipWithIndex // Require to iterate with indices
.map{ case(element,i) => {
Area2(element.getAs[String]("start_time"),
element.getAs[String]("end_time"),
element.getAs[String]("area"),
beforeArray(i))
}}
(id,afterArray) // Return row(id,Array(Area2(...)))
}).toDF("id","after")
outDf.printSchema()
outDf.show()
}
}
case class Area(start_time: String, end_time: String, area: String)
case class Area2(start_time: String, end_time: String, area: String, before: Int)