Spark Sql Rank Issue - apache-spark-sql

I am trying to take to apply same rank if the audit_timestamp is same .
I am using spark 1.5 CDH5.5
val loc ="/data/published/logs/a.avro"
val df =sqlContext.read.avro(loc)
val concatDF = df.withColumn("concat_log", concat(df("service_key"),df("event_start_date_time")))
val md5DF = concatDF.withColumn("nk_log",md5(concatDF("concat_log"))).drop("concat_log")
val windowFunction = Window.partitionBy(md5DF("nk_log")).orderBy(md5DF("audit_timestamp").desc)
val rankDF = md5DF.withColumn("logs_number", rank().over(windowFunction))
rankDF.filter(rankDF("event_start_date_time") === "2018-06-05T15:00:00Z").select("nk_log","event_start_date_time","service_key","event_sequence","audit_timestamp","logs_number").show(100,false)
Actual output :
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|nk_log |event_start_date_time|service_key|event_sequence|audit_timestamp |logs_number|
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |371 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |362 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |386 |2018-06-08 10:05:37|1 |
Expected output :
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|nk_log |event_start_date_time|service_key|event_sequence|audit_timestamp |logs_number|
+--------------------------------+---------------------+-----------+--------------+-------------------+-----------+
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |371 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |362 |2018-06-10 10:05:38|1 |
|00ac0985bcd5197d82961a4bbf2d2709|2018-06-05T15:00:00Z |3839 |386 |2018-06-08 10:05:37|2 |
I dont know what is the issue here . How do i get the expected output

Related

Transform list of map to dataframe

I have the following data:
d = Some(List(Map(id -> 1, n -> Hi), Map(id -> 2, n -> Hello)))
I would like to transform it into a dataframe like the following:
+--------+
|id|n |
+--------+
|1 |Hi |
+--------+
|2 |Hello|
+--------+
I tried the following:
import spark.implicits._
val df = d
.map( m => (m.get("id"),m.get("n")))
.toDF("id", "n")
But im getting:
error: value get is not a member of Any
.map( m => (m.get("id"),m.get("n")))
Your top level here is Option and i think thats the reason why you cant handle it with single map. I managed to do it with something like this:
import spark.implicits._
val d = Some(List(Map("id" -> "1", "n" -> "Hi"), Map("id" -> "2", "n" -> "Hello")))
val data = d.fold (List.empty [(Option [String], Option [String])]) (_.map (m => (m.get ("id"), m.get ("n"))))
val df = data.toDF("id", "n").show()
Output:
+---+-----+
| id| n|
+---+-----+
| 1| Hi|
| 2|Hello|
+---+-----+

data frame parsing column scala

I have some problem with parsing Dataframe
val result = df_app_clickstream.withColumn(
"attributes",
explode(expr(raw"transform(attributes, x -> str_to_map(regexp_replace(x, '{\\}',''), ' '))"))
).select(
col("userId"),
col("attributes").getField("campaign_id").alias("app_campaign_id"),
col("attributes").getField("channel_id").alias("app_channel_id")
)
result.show()
I have input like this :
-------------------------------------------------------------------------------
| userId | attributes |
-------------------------------------------------------------------------------
| f6e8252f-b5cc-48a4-b348-29d89ee4fa9e |{'campaign_id':082,'channel_id':'Chnl'}|
-------------------------------------------------------------------------------
and need to get output like this :
--------------------------------------------------------------------
| userId | campaign_id | channel_id|
--------------------------------------------------------------------
| f6e8252f-b5cc-48a4-b348-29d89ee4fa9e | 082 | Facebook |
--------------------------------------------------------------------
but have error
you can try below solution
import org.apache.spark.sql.functions._
val data = Seq(("f6e8252f-b5cc-48a4-b348-29d89ee4fa9e", """{'campaign_id':082, 'channel_id':'Chnl'}""")).toDF("user_id", "attributes")
val out_df = data.withColumn("splitted_col", split(regexp_replace(col("attributes"),"'|\\}|\\{", ""), ","))
.withColumn("campaign_id", split(element_at(col("splitted_col"), 1), ":")(1))
.withColumn("channel_id", split(element_at(col("splitted_col"), 2), ":")(1))
out_df.show(truncate = false)
+------------------------------------+----------------------------------------+-----------------------------------+-----------+----------+
|user_id |attributes |splitted_col |campaign_id|channel_id|
+------------------------------------+----------------------------------------+-----------------------------------+-----------+----------+
|f6e8252f-b5cc-48a4-b348-29d89ee4fa9e|{'campaign_id':082, 'channel_id':'Chnl'}|[campaign_id:082, channel_id:Chnl]|082 |Chnl |
+------------------------------------+----------------------------------------+-----------------------------------+-----------+----------+

How to use separate key lists to perform a join between two DataFrames?

I want to join two different DataFrames (dfA and dfB) built as follows :
dfA.show()
+-----+-------+-------+
| id_A| name_A|address|
+-----+-------+-------+
| 1| AAAA| Paris|
| 4| DDDD| Sydney|
+-----+-------+-------+
dfB.show()
+-----+-------+---------+
| id_B| name_B| job|
+-----+-------+---------+
| 1| AAAA| Analyst|
| 2| AERF| Engineer|
| 3| UOPY| Gardener|
| 4| DDDD| Insurer|
+-----+-------+---------+
I need to use the following lists in order to do the join :
val keyListA = List("id_A", "name_A")
val keyListB = List("id_B", "name_B")
A simple solution would be :
val join = dfA.join(
dfA("id_A") === dfB("id_B") &&
dfA("name_A") === dfB("name_B"),
"left_outer")
Is there a syntax that would allow you to do this join by using the keyListA and keyListB lists ?
If you really want to build your join expression from lists of column names:
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.sql.functions._
val dfA: DataFrame = ???
val dfB: DataFrame = ???
val keyListA = List("id_A", "name_A", "property1_A", "property2_A", "property3_A")
val keyListB = List("id_B", "name_B", "property1_B", "property2_B", "property3_B")
def joinExprsFrom(keyListA: List[String], keyListB: List[String]): Column =
keyListA
.zip(keyListB)
.map { case (fromA, fromB) => col(fromA) === col(fromB) }
.reduce((acc, expr) => acc && expr )
dfA.join(
dfB,
joinExprsFrom(keyListA, keyListB),
"left_outer")
You need to make sure keyListA and keyListB are the same size and non-empty.

How to convert dataframe value into Map[String,List[String]]?

I want to convert below dataframe into Map[String,List[String]]. I have changed initial dataframe to get Name columns in List format(using collect_list) but I am not able to convert it into Map[String,List[String]].
DataFrame
+---------+-------+
|City | Name |
+---------+-------+
|Mumbai |[A,B] |
|Pune |[C,D] |
|Delhi |[A,D] |
+---------+-------+
Expected Output:
Map(Mumbai -> List(A,B), Pune -> List(C,D), Delhi-> List(A,D))
You can convert to rdd and collect as Map as below
val df = Seq(
("Mumbai", List("A", "B")),
("Pune", List("C", "D")),
("Delhi", List("A", "D"))
).toDF("city", "name")
val map: collection.Map[String, List[String]] = df.rdd
.map(row => (row.getAs[String]("city"), row.getAs[List[String]]("name")))
.collectAsMap()
Hope this helps!

How to extract values from key value map?

I have a column of type map, where the key and value changes. I am trying to extract the value and create a new column.
Input:
----------------+
|symbols |
+---------------+
|[3pea -> 3PEA] |
|[barello -> BA]|
|[] |
|[] |
+---------------+
Expected output:
--------+
|symbols|
+-------+
|3PEA |
|BA |
| |
| |
+-------+
Here is what I tried so far using a udf:
def map_value=udf((inputMap:Map[String,String])=> {inputMap.map(x=>x._2)
})
java.lang.UnsupportedOperationException: Schema for type scala.collection.immutable.Iterable[String] is not supported
Since Spark scala v2.3 api, sql v2.3 api, or pyspark v2.4 api you can use the spark sql function map_values
The following is in pyspark, scala would be very similar.
Setup (assuming working SparkSession as spark):
from pyspark.sql import functions as F
df = (
spark.read.json(sc.parallelize(["""[
{"key": ["3pea"], "value": ["3PEA"] },
{"key": ["barello"], "value": ["BA"] }
]"""]))
.select(F.map_from_arrays(F.col("key"), F.col("value")).alias("symbols") )
)
df.printSchema()
df.show()
root
|-- symbols: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+---------------+
| symbols|
+---------------+
| [3pea -> 3PEA]|
|[barello -> BA]|
+---------------+
df.select((F.map_values(F.col("symbols"))[0]).alias("map_vals")).show()
+--------+
|map_vals|
+--------+
| 3PEA|
| BA|
+--------+
import org.apache.spark.sql.functions._
import spark.implicits._
val m = Seq(Array("A -> abc"), Array("B -> 0.11856755943424617"), Array("C -> kqcams"))
val df = m.toDF("map_data")
df.show
// Simulate your data I think.
val df2 = df.withColumn("xxx", split(concat_ws("",$"map_data"), "-> ")).select($"xxx".getItem(1).as("map_val")).drop("xxx")
df2.show(false)
results in:
+--------------------+
| map_data|
+--------------------+
| [A -> abc]|
|[B -> 0.118567559...|
| [C -> kqcams]|
+--------------------+
+-------------------+
|map_val |
+-------------------+
|abc |
|0.11856755943424617|
|kqcams |
+-------------------+