Rolling join using dates in PySpark? - sql

I'm trying to do a join between two PySpark dataframes, joining on a key, however the date of the first table should always come after the date of the second table. As an example. We have two tables that we're trying to join:
Table 1:
Date1 value1 key
13 Feb 2020 1 a
01 Mar 2020 2 a
31 Mar 2020 3 a
15 Apr 2020 4 a
Table 2:
Date2 value2 key
10 Feb 2020 11 a
15 Mar 2020 22 a
After the join, the result should be something like this:
Date1 value1 value2 key
13 Feb 2020 1 11 a
01 Mar 2020 2 null a
31 Mar 2020 3 22 a
15 Apr 2020 4 null a
Any ideas?

This is an interesting join. My approach is to join on the key first, select the earliest date, and do a self join after the earliest date is found.
from pyspark.sql import functions as F, Window
# Clean up date format first
df3 = df1.withColumn('Date1', F.to_date('Date1', 'dd MMM yyyy'))
df4 = df2.withColumn('Date2', F.to_date('Date2', 'dd MMM yyyy'))
result = (df3.join(df4, 'key')
.filter('Date1 > Date2')
.withColumn('rn', F.row_number().over(Window.partitionBy('Date2').orderBy('Date1')))
.filter('rn = 1')
.drop('key', 'rn', 'Date2')
.join(df3, ['Date1', 'value1'], 'right')
)
result.show()
+----------+------+------+---+
|Date1 |value1|value2|key|
+----------+------+------+---+
|2020-02-13|1 |11 |a |
|2020-03-01|2 |null |a |
|2020-03-31|3 |22 |a |
|2020-04-15|4 |null |a |
+----------+------+------+---+

You can try window lag function, it's scala but python version will be similar.
// change col names for union all and add extra col to indentify dataset
val df1A = df1.toDF("Date","value","key").withColumn("df",lit(1))
val df2A = df2.toDF("Date","value","key").withColumn("df",lit(2))
import org.apache.spark.sql.expressions.Window
df1A.unionAll(df2A)
.withColumn("value2",lag(array('value,'df),1) over Window.partitionBy('key).orderBy(to_date('Date,"dd MMM yyyy")))
.filter('df===1)
.withColumn("value2",when(element_at('value2,2)===2,element_at('value2,1)))
.drop("df")
.show
output:
+-----------+-----+---+------+
| Date|value|key|value2|
+-----------+-----+---+------+
|13 Feb 2020| 1| a| 11|
|01 Mar 2020| 2| a| null|
|31 Mar 2020| 3| a| 22|
|15 Apr 2020| 4| a| null|
+-----------+-----+---+------+

Related

Filling rows from calendar table with previous values

I'm new to SQL, coming over from Python and R, and using Spark SQL with Databricks. I'm trying to complete a basic query and would appreciate guidance, especially guidance that explains the underlying concepts of SQL as they relate to my question.
I have a calendar table with complete, consecutive dates, and a data table with date_added, user_id, sales, and price columns. The data table has incomplete dates, since not every user is active on every date. Below are examples of each table.
Calendar Table
date
2020-01-01
2020-01-02
2020-01-03
2020-01-04
2020-01-05
2020-01-06
Data Table
date_added user_id sales price
2020-01-02 01 1 4.00
2020-01-05 01 3 4.00
2020-01-02 02 1 5.00
2020-01-03 02 1 5.00
2020-01-05 02 2 5.00
2020-01-03 03 2 1.00
2020-01-05 03 5 1.00
I am looking to create a new table, where every calendar table date within a certain range (the active dates) is defined for every user, and null values for all columns except the sales column are filled by the following value in that column. Something along these lines:
date user_id sales price
2020-01-02 01 1 4.00
2020-01-03 01 null 4.00
2020-01-04 01 null 4.00
2020-01-05 01 3 4.00
2020-01-02 02 1 5.00
2020-01-03 02 1 5.00
2020-01-04 02 null 5.00
2020-01-05 02 2 5.00
2020-01-02 03 null 1.00
2020-01-03 03 2 1.00
2020-01-04 03 null 1.00
2020-01-05 03 5 1.00
Any guidance is appreciated on how I might proceed to this output. I've tried to use a LEFT JOIN on the dates, but without success. I know that the UNION operator is used to concatenate tables on top of one another, but don't know how I would apply that method here.
You can use cross join the users with the calendar table then left join with data table:
spark.sql("""
SELECT date, dates.user_id, sales, COALESCE(data.price, dates.price) AS price
FROM (
SELECT user_id, price, date
FROM (SELECT user_id, FIRST(price) as price FROM data_table GROUP BY user_id)
CROSS JOIN calender_table
WHERE date >= (SELECT MIN(date_added) FROM data_table)
AND date <= (SELECT MAX(date_added) FROM data_table)
) dates
LEFT JOIN data_table data
ON dates.user_id = data.user_id
AND dates.date = data.date_added
""").show()
Output:
+----------+-------+-----+-----+
|date |user_id|sales|price|
+----------+-------+-----+-----+
|2020-01-02|01 |1 |4.0 |
|2020-01-03|01 |null |4.0 |
|2020-01-04|01 |null |4.0 |
|2020-01-05|01 |3 |4.0 |
|2020-01-02|02 |1 |5.0 |
|2020-01-03|02 |1 |5.0 |
|2020-01-04|02 |null |5.0 |
|2020-01-05|02 |2 |5.0 |
|2020-01-02|03 |null |1.0 |
|2020-01-03|03 |2 |1.0 |
|2020-01-04|03 |null |1.0 |
|2020-01-05|03 |5 |1.0 |
+----------+-------+-----+-----+
You can also generate the dates without using a calendar table using sequence function. See my other answer here.
Let your original dataframe as df1. Then you can get the min, max date for each id and let it as `df2'.
from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy('user_id').orderBy(f.desc('date_added'))
df2 = df1.groupBy('user_id') \
.agg(f.sequence(f.min('date_added'), f.max('date_added')).alias('date_added')) \
.withColumn('date_added', f.explode('date_added'))
df2.join(df, ['user_id', 'date_added'], 'left') \
.withColumn('price', f.first('price').over(w)) \
.orderBy('user_id', 'date_added') \
.show()
+-------+----------+-----+-----+
|user_id|date_added|sales|price|
+-------+----------+-----+-----+
| 1|2020-01-02| 1| 4.0|
| 1|2020-01-03| null| 4.0|
| 1|2020-01-04| null| 4.0|
| 1|2020-01-05| 3| 4.0|
| 2|2020-01-02| 1| 5.0|
| 2|2020-01-03| 1| 5.0|
| 2|2020-01-04| null| 5.0|
| 2|2020-01-05| 2| 5.0|
| 3|2020-01-03| 2| 1.0|
| 3|2020-01-04| null| 1.0|
| 3|2020-01-05| 5| 1.0|
+-------+----------+-----+-----+

Convert the yearly columns to rows and add a year column in PySpark

For example:
ID
SALESYR2019
SALESYR2020
1
10
50
2
20
100
ID
SALESYR
SALES
1
2019
10
2
2019
20
1
2020
50
2
2020
100
What you are trying to do is unpivot the table, here is how you can do it:
data = [{
"id":1,
"SALESYR2019":10,
"SALESYR2020":50
},
{
"id":2,
"SALESYR2019":20,
"SALESYR2020":100
}
]
df = spark.createDataframe(data)
from pyspark.sql.functions import expr
unpivotExpr = "stack(2, '2019', SALESYR2019, '2020', SALESYR2020) as (SALESYR,SALES)"
unPivotDF = df.select("id", expr(unpivotExpr))
unPivotDF.show(truncate=False)
unPivotDF.show()
Output
>>> unPivotDF.show(truncate=False)
+---+-------+-----+
|id |SALESYR|SALES|
+---+-------+-----+
|1 |2019 |10 |
|1 |2020 |50 |
|2 |2019 |20 |
|2 |2020 |100 |
+---+-------+-----+

Generate repeating N row number for a PySpark DataFrame

I want to create a new column in PySpark DataFrame with N repeating row numbers irrespective of other columns in the data frame.
Original data:
name year
A 2010
A 2011
A 2011
A 2013
A 2014
A 2015
A 2016
A 2018
B 2018
B 2019
I want to have a new column with N repeating row number, consider N=3.
Expected Output:
name year rownumber
A 2010 1
A 2011 1
A 2011 1
A 2013 2
A 2014 2
A 2015 2
A 2016 3
A 2018 3
B 2018 3
B 2019 4
You can try row number with division:
n=3
df.withColumn("rounum",
((F.row_number().over(Window.orderBy(F.lit(0)))-1)/n).cast("Integer")+1).show()
+----+----+------+
|name|year|rounum|
+----+----+------+
| A|2010| 1|
| A|2011| 1|
| A|2011| 1|
| A|2013| 2|
| A|2014| 2|
| A|2015| 2|
| A|2016| 3|
| A|2018| 3|
| B|2018| 3|
| B|2019| 4|
+----+----+------+

Identify series which starts with a value and ends with the same value in Sql (Oracle) with overlapping time

How can I calculate view-time for each video view?
Details:
I´m using #Oracle 11g
Platform ID | Action| Total Length of video| Action Time
1 | START| 0 | 31 jul 2012 13:30:33
1 |PAUSED | 218 | 31 jul 2012 13:30:58
1 |PLAYING| 218 | 31 jul 2012 13:34:34
1 | IDLE | 218 |31 jul 2012 13:37:51
1 | START| 0 | 13 sep 2012 11:44:15
1 |PAUSED | 167 | 13 sep 2012 11:45:50
1 |START |0 |22 aug 2012 13:19:44
1 |PAUSED | 167 |22 aug 2012 13:20:24
2 | START|0 |22 aug 2012 13:23:37
2 | IDLE |172 | 22 aug 2012 13:26:29
2 | START|0 |22 aug 2012 13:26:33
2 | STOP |172 | 22 aug 2012 13:29:20
2 | START|0 |22 aug 2012 13:29:25
2| IDLE |276| 22 aug 2012 13:34:03
Problem Statement: Above is the data about Video readership and all
the actions performed by user on a given video is captured as above.
A user can do following actions on a video.
START: Started a video
PAUSED: Paused a video.
PLAYING: Start playing a paused video.
IDLE: Video is IDLE with no action.
STOP: Video is stopped.
I want to calculate % Video viewed per *Session.
*Session: One session can be grouped based on PlatformID and START action
until the successive START action is encountered; provided that the
data is ordered by PlatformID and Action TIME in Ascending order.
Platform ID| Action |Total Length of video |Action Time |Session
1 | START |0 |31 jul 2012 13:30:33 | 1
1 | PAUSED |218 |31 jul 2012 13:30:58 |1
1 | PLAYING |218 |31 jul 2012 13:34:34 |1
1 | IDLE |218 |31 jul 2012 13:37:51 |1
1 | START |0 |13 sep 2012 11:44:15 |2
1 | PAUSED |167 |13 sep 2012 11:45:50 |2
1 | START |0 |22 aug 2012 13:19:44 |3
1 | PAUSED |167 |22 aug 2012 13:20:24 |3
2 | START |0 |22 aug 2012 13:23:37 |4
2 | IDLE |172 |22 aug 2012 13:26:29 |4
3 | START |0 |22 aug 2012 13:26:33 |5
3 | STOP |172 |22 aug 2012 13:29:20 |5
4 | START |0 |22 aug 2012 13:29:25 |6
4 | IDLE |276 |22 aug 2012 13:34:03 |6
4 | PLAYING |276 |22 aug 2012 13:36:05 |6
4 | PAUSED |276 |22 aug 2012 13:38:12 |6
Video length can be calculated as:Action Date Time Where Action=last
action of the Session which can be anything e.g. IDLE for session=
Action Date time Where Action=START.
I am not able to generate a Series for session as shown in column
session above. Please find below query as I tried.
In Nutshell,
Basically, I need to generate a session series where it should start
when Action=’START’ and STOP when Action=’START’ is encountered when
ordered by PlatformID and Action Time.
SELECT X.*,
X.Action as Action_Series,
Row_Number() Over(Partition by PlatformID Order by
CASE WHEN Action='START' THEN 0
WHEN Action='START' AND PrevAction != 'START' THEN 1
ELSE 0
END) AS Series_val
FROM
(select PlatformID, action, total_length, action_date_time,
Dense_Rank() over(Partition by 1 Order by PlatformID,action_date_time) as ID,
Lag(Action) Over(Partition by 1 Order by PlatformID,action_date_time) as PrevAction,
Lead(Action) Over(Partition by 1 Order by PlatformID,action_date_time) as NextAction
from audiovideostats
)X
Order by ID;

Sql Server Aggregation or Pivot Table Query

I'm trying to write a query that will tell me the number of customers who had a certain number of transactions each week. I don't know where to start with the query, but I'd assume it involves an aggregate or pivot function. I'm working in SqlServer management studio.
Currently the data is looks like where the first column is the customer id and each subsequent column is a week :
|Customer| 1 | 2| 3 |4 |
----------------------
|001 |1 | 0| 2 |2 |
|002 |0 | 2| 1 |0 |
|003 |0 | 4| 1 |1 |
|004 |1 | 0| 0 |1 |
I'd like to see a return like the following:
|Visits |1 | 2| 3 |4 |
----------------------
|0 |2 | 2| 1 |0 |
|1 |2 | 0| 2 |2 |
|2 |0 | 1| 1 |1 |
|4 |0 | 1| 0 |0 |
What I want is to get the count of customer transactions per week. E.g. during the 1st week 2 customers (i.e. 002 and 003) had 0 transactions, 2 customers (i.e. 001 and 004) had 1 transaction, whereas zero customers had more than 1 transaction
The query below will get you the result you want, but note that it has the column names hard coded. It's easy to add more week columns, but if the number of columns is unknown then you might want to look into a solution using dynamic SQL (which would require accessing the information schema to get the column names). It's not that hard to turn it into a fully dynamic version though.
select
Visits
, coalesce([1],0) as Week1
, coalesce([2],0) as Week2
, coalesce([3],0) as Week3
, coalesce([4],0) as Week4
from (
select *, count(*) c from (
select '1' W, week1 Visits from t union all
select '2' W, week2 Visits from t union all
select '3' W, week3 Visits from t union all
select '4' W, week4 Visits from t ) a
group by W, Visits
) x pivot ( max (c) for W in ([1], [2], [3], [4]) ) as pvt;
In the query your table is called t and the output is:
Visits Week1 Week2 Week3 Week4
0 2 2 1 1
1 2 0 2 2
2 0 1 1 1
4 0 1 0 0