how to create a date range mapping in spark? - apache-spark-sql

I have a spark dataframe with following structure.
+-------+-------------------+
|country| date_published|
+-------+-------------------+
| UK|2020-04-15 00:00:00|
| UK|2020-04-14 00:00:00|
| UK|2020-04-09 00:00:00|
| UK|2020-04-08 00:00:00|
| UK|2020-04-07 00:00:00|
| UK|2020-04-06 00:00:00|
| UK|2020-04-03 00:00:00|
| UK|2020-04-02 00:00:00|
| UK|2020-04-01 00:00:00|
| UK|2020-03-31 00:00:00|
| UK|2020-03-30 00:00:00|
| UK|2020-03-27 00:00:00|
| UK|2020-03-26 00:00:00|
| UK|2020-03-25 00:00:00|
| UK|2020-03-24 00:00:00|
| UK|2020-03-23 00:00:00|
| UK|2020-03-20 00:00:00|
| UK|2020-03-19 00:00:00|
| UK|2020-03-18 00:00:00|
| UK|2020-03-17 00:00:00|
+-------+-------------------+
I want to create a date mapping based on this data. Conditions,
All dates till 2020-01-01 should be mapped as "YTD".
All dates till 2019-04-15 should be mapped as "LAST_1_YEAR".
All dates from 2019-01-01 till 2019-04-15(last year date as of day) should be mapped as "YTD_LAST_YEAR"
All dates before 2019-04-15 should be mapped as "YEAR_AGO_1_YEAR"
We can create two columns like ytd_map(condition 1, 3), last_year_map(condition 2,4)
There could be other countries in the list, and the above conditions should work for them
Approach I tried, is to create a dataframe with max_date_published for each country, but I not sure how to filter the dataframe for each country separately.
df_data = df_data_cleaned.select("date_published","country").distinct().orderBy(F.desc("date_published"))
df_max_dt = df_data.groupBy("country").agg(F.max(F.col("date_published")))
df_max_dt.collect()

I tried this and working as of now.
spark.sql("select country,\
date_published,\
(case when date_published >= max_date_published_last_year then 'LAST_1_YEAR'\
when date_published <= max_date_published_last_year and date_published >= add_months(max_date_published_last_year, -12) then 'YEAR_AGO_1_YEAR' else '' end) as MAT_MAPPING,\
(case when date_published >= date_published_start_of_year then 'YTD'\
when date_published <= max_date_published_last_year and date_published >= date_published_start_of_last_year\
then 'YTD_LAST_YEAR'\
else '' end) as YTD_MAPPING from\
(select t.country, t.date_published, t.date_published_ya, t.max_date_published_current_year,\
cast(add_months(t.max_date_published_current_year, -12) as timestamp) as max_date_published_last_year,\
date_trunc('year', max_date_published_current_year) AS date_published_start_of_year,\
date_trunc('year', cast(add_months(t.max_date_published_current_year, -12) as timestamp)) AS date_published_start_of_last_year\
from\
(select country,\
date_published,cast(add_months(date_published, -12) as timestamp) as date_published_ya,\
max(date_published)over(partition by country order by date_published desc) max_date_published_current_year from df_mintel_time) t) t2")

Related

How can i get all daily dates between two given dates in DataFrame

I have this DataFrame of Spark:
+-------------------+-------------------+
| first | last |
+-------------------+-------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00|
+-------------------+-------------------+
I need to get the dates from the first, not including, to the last, including, jumping from day to day.
My desired output would be:
+-------------------+-------------------+-----------------------------------------------------------------+
| first | last | array_dates
+-------------------+-------------------+-----------------------------------------------------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00| [2022-11-04 00:00:00, 2022-11-05 00:00:00, 2022-11-06 00:00:00] |
+-------------------+-------------------+-----------------------------------------------------------------+
Since Spark, 2.4, you can use sequence built-in spark function as follows:
import org.apache.spark.sql.functions.{col, date_add, expr, sequence}
val result = df.withColumn(
"array_dates",
sequence(
date_add(col("first"), 1),
col("last"),
expr("INTERVAL 1 DAY")
)
)
with following input df:
+-------------------+-------------------+
|first |last |
+-------------------+-------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00|
+-------------------+-------------------+
You get the following output result:
+-------------------+-------------------+---------------------------------------------------------------+
|first |last |array_dates |
+-------------------+-------------------+---------------------------------------------------------------+
|2022-11-03 00:00:00|2022-11-06 00:00:00|[2022-11-04 00:00:00, 2022-11-05 00:00:00, 2022-11-06 00:00:00]|
+-------------------+-------------------+---------------------------------------------------------------+

First and last not null fields over partitions

I have a table like this:
EventID
EventTime
AttrA
AttrB
1
2022-10-01 00:00:01.000000
null
null
1
2022-10-01 00:00:02.000000
a
null
1
2022-10-01 00:00:03.000000
b
1
1
2022-10-01 00:00:04.000000
null
null
2
2022-10-01 00:01:01.000000
aa
11
2
2022-10-01 00:01:02.000000
bb
null
2
2022-10-01 00:01:03.000000
null
null
2
2022-10-01 00:01:04.000000
aa
22
and I want to jump across the records to return the first and last not null AttrA and AttrB values for each eventID based on the eventTime. Each eventID can have multiple records so we can't know where the not nulls may be. So the wished results would be:
EventID
FirstAttrA
LastAttrA
FirstAttrB
LastAttrB
1
a
b
1
1
2
aa
aa
11
22
What I did is to add row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) and then again DESC and then have multiple CTEs like this:
WITH enhanced_table AS
(
SELECT
eventID,
attrA,
attrB,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time ASC) as rn,
row_number() OVER (PARTITION BY event_id) ORDER BY event_time DESC) as reversed_rn
),
first_events_with_attrA AS
(
SELECT
eventID,
FIRST(attrA) OVER (PARTITION BY eventID ORDER BY rn ASC) AS url
FROM enhanced_table
WHERE attrA IS NOT NULL
)...
But I need one CTE which scans again the table for each case I want (for this example 4 CTEs in total). It works, but it is slow.
Is there a way to grab the values I am interested in in a more efficient way?
No Need to build Row Numbers , you can directly use native SparkSQL Functions FIRST & LAST with isIgnoreNull as True to achieve the intended results -
Data Preparation
s = StringIO("""
EventID,EventTime,AttrA,AttrB
1,2022-10-01 00:00:01.000000,,
1,2022-10-01 00:00:02.000000,a,
1,2022-10-01 00:00:03.000000,b,1
1,2022-10-01 00:00:04.000000,,
2,2022-10-01 00:01:01.000000,aa,11
2,2022-10-01 00:01:02.000000,bb,
2,2022-10-01 00:01:03.000000,,
2,2022-10-01 00:01:04.000000,aa,22
"""
)
inp_schema = StructType([
StructField('EventID',IntegerType(),True)
,StructField('EventTime',StringType(),True)
,StructField('AttrA',StringType(),True)
,StructField('AttrB',DoubleType(),True)
]
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df,schema=inp_schema)\
.withColumn('AttrA',F.when(F.isnan(F.col('AttrA')),None).otherwise(F.col('AttrA')))\
.withColumn('AttrB',F.when(F.isnan(F.col('AttrB')),None).otherwise(F.col('AttrB')))
sparkDF.show(truncate=False)
+-------+--------------------------+-----+-----+
|EventID|EventTime |AttrA|AttrB|
+-------+--------------------------+-----+-----+
|1 |2022-10-01 00:00:01.000000|null |null |
|1 |2022-10-01 00:00:02.000000|a |null |
|1 |2022-10-01 00:00:03.000000|b |1.0 |
|1 |2022-10-01 00:00:04.000000|null |null |
|2 |2022-10-01 00:01:01.000000|aa |11.0 |
|2 |2022-10-01 00:01:02.000000|bb |null |
|2 |2022-10-01 00:01:03.000000|null |null |
|2 |2022-10-01 00:01:04.000000|aa |22.0 |
+-------+--------------------------+-----+-----+
First & Last
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
EventID,
FIRST(AttrA,True) as First_AttrA,
LAST(AttrA,True) as Last_AttrA,
FIRST(AttrB,True) as First_AttrB,
LAST(AttrB,True) as Last_AttrB
FROM INPUT
GROUP BY 1
""").show()
+-------+-----------+----------+-----------+----------+
|EventID|First_AttrA|Last_AttrA|First_AttrB|Last_AttrB|
+-------+-----------+----------+-----------+----------+
| 1| a| b| 1.0| 1.0|
| 2| aa| aa| 11.0| 22.0|
+-------+-----------+----------+-----------+----------+

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance
Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

Transposing in Spark SQL

For the following table structure:
+------------------------------------------------------+
| timestamp | value1 | value2 | ... value100 |
+------------------------------------------------------+
|1/1/1 00:00:00 | 1 | 2 | 100 |
+------------------------------------------------------+
How could I transpose it into a structure like this using Spark SQL syntax?
+---------------------------------------+
| timestamp | id | value |
+---------------------------------------+
|1/1/1 00:00:00 | value1 | 1 |
|1/1/1 00:00:00 | value2 | 2 |
|1/1/1 00:00:00 | ... value100 | 100 |
+---------------------------------------+
In Python or R this would be relatively straightforward, and UNPIVOT doesn't seem to be applicable here.
A more concise approach would be to use STACK
Data Preparation
sparkDF = sql.createDataFrame([("20201021T00:00:00+0530",10,97,23,214),
("20211011T00:00:00+0530",23,8218,9192,827),
("20200212T00:00:00+0300",51,981,18,10),
("20211021T00:00:00+0530",10,2197,871,108),
("20211021T00:00:00+0900",128,9812,98,192),
("20211021T00:00:00-0500",218,487,21,51)
]
,['timestamp','value1','value2','value3','value4'])
# sparkDF.show(truncate=False)
sparkDF.createOrReplaceTempView("sparkDF")
sql.sql("""
SELECT
timestamp
,STACK(4,'value1',value1
,'value2',value2
,'value3',value3
,'value4',value4
) as (id,value)
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
Stack String
You can further create the stack_str , depending on the columns you want
col_len = 4
stack_str = ''
for i in range(col_len):
if i == 0:
stack_str += f'\'value{i+1}\',value{i+1}'
else:
stack_str += f',\'value{i+1}\',value{i+1}'
stack_str = f"STACK({col_len},{stack_str}) as (id,value)"
stack_str
"STACK(4,'value1',value1,'value2',value2,'value3',value3,'value4',value4) as (id,value)"
sql.sql(f"""
SELECT
timestamp
,{stack_str}
FROM sparkDF
""").show()
+--------------------+------+-----+
| timestamp| id|value|
+--------------------+------+-----+
|20201021T00:00:00...|value1| 10|
|20201021T00:00:00...|value2| 97|
|20201021T00:00:00...|value3| 23|
|20201021T00:00:00...|value4| 214|
|20211011T00:00:00...|value1| 23|
|20211011T00:00:00...|value2| 8218|
|20211011T00:00:00...|value3| 9192|
|20211011T00:00:00...|value4| 827|
|20200212T00:00:00...|value1| 51|
|20200212T00:00:00...|value2| 981|
|20200212T00:00:00...|value3| 18|
|20200212T00:00:00...|value4| 10|
|20211021T00:00:00...|value1| 10|
|20211021T00:00:00...|value2| 2197|
|20211021T00:00:00...|value3| 871|
|20211021T00:00:00...|value4| 108|
|20211021T00:00:00...|value1| 128|
|20211021T00:00:00...|value2| 9812|
|20211021T00:00:00...|value3| 98|
|20211021T00:00:00...|value4| 192|
+--------------------+------+-----+
You could do the same using regular SQL as follows
select timestamp
,'value1' as id
,value1 as value
from table
union all
select timestamp
,'value2' as id
,value2 as value
from table
union all
select timestamp
,'value3' as id
,value3 as value
from table

Transform row wise postgres data to grouped column wise data

I have a stock market data which looks likes like
instrument_symbol|close |timestamp |
-----------------|------|-------------------|
IOC |134.15|2019-08-05 00:00:00|
YESBANK | 83.75|2019-08-05 00:00:00|
IOC |135.25|2019-08-02 00:00:00|
YESBANK | 88.3|2019-08-02 00:00:00|
IOC |136.95|2019-08-01 00:00:00|
YESBANK | 88.4|2019-08-01 00:00:00|
IOC | 139.3|2019-07-31 00:00:00|
YESBANK | 91.2|2019-07-31 00:00:00|
YESBANK | 86.05|2019-07-30 00:00:00|
IOC | 133.5|2019-07-30 00:00:00|
IOC |138.25|2019-07-29 00:00:00|
I want to transform it to
timestamp, IOC, YESBANK
2019-08-05 00:00:00 134.15 83.75
2019-08-02 00:00:00 135.25 88.3
......
.....
...
format.
Is there some Postgres query to do this? Or do we have to do this programmatically?
You can use conditional aggregation. In Postgres, I like the filter syntax:
select "timestamp",
max(close) filter (where instrument_symbol = 'IOC') as ioc,
max(close) filter (where instrument_symbol = 'YESBANK') as yesbank
from t
group by "timestamp"
order by 1 desc;
Use conditional aggregation.
select "timestamp" :: date, max( case
when instrument_symbol = 'IOC'
then close end ) as ioc,
max( case
when instrument_symbol = 'YESBANK'
then close end ) as yesbank FROM t
group by "timestamp" :: date
order by 1 desc
DEMO