Converting pandas data frame logic to pyspark data frame based logic - pandas

Given a data frame with 4 columns group, start_date, available_stock, used_stock. I basically have to figure out how long a stock will last given a group and date. lets say we have a dataframe with the following data
+----------+------------+-----------------+------------+
| group | start_date | available stock | used_stock |
+----------+------------+-----------------+------------+
| group 1 | 01/12/2019 | 100 | 80 |
| group 1 | 08/12/2019 | 60 | 10 |
| group 1 | 15/12/2019 | 60 | 10 |
| group 1 | 22/12/2019 | 150 | 200 |
| group 2 | 15/12/2019 | 80 | 90 |
| group 2 | 22/12/2019 | 150 | 30 |
| group 3 | 22/12/2019 | 50 | 50 |
+----------+------------+-----------------+------------+
Steps:
sort each group by start_date so we get something like the above data set
per group starting from the smallest date we check if the used_stock is greater or equal to the available stock. if it is true the end date is same as start_date
if the above condition is false then add the next dates used_stock to the current used_stock value. continue till the used_stock is greater or equal to available_stock, at which point the end date is same as the start_date of last added used_stock row.
in case no such value is found end date is null
after applying the above steps for every row we should get something like
+----------+------------+-----------------+------------+------------+
| group | start_date | available stock | used_stock | end_date |
+----------+------------+-----------------+------------+------------+
| group 1 | 01/12/2019 | 100 | 80 | 15/12/2019 |
| group 1 | 08/12/2019 | 60 | 10 | 22/12/2019 |
| group 1 | 15/12/2019 | 60 | 10 | 22/12/2019 |
| group 1 | 22/12/2019 | 150 | 200 | 22/12/2019 |
| group 2 | 15/12/2019 | 80 | 90 | 15/12/2019 |
| group 2 | 22/12/2019 | 150 | 30 | null |
| group 3 | 22/12/2019 | 50 | 50 | 22/12/2019 |
+----------+------------+-----------------+------------+------------+
the above logic was prebuilt in pandas and was tweaked and applied in the spark application as a grouped map Pandas UDF. I want to move away from #pandas_udf approach and have a pure spark data frame based approach to check if there will be any performance improvements.Would appreciate any help with this or any improvements on the given logic which would reduce the overall execution time.

With spark 2.4+, you can use SparkSQL builtin function aggregate:
aggregate(array_argument, zero_expression, merge, finish)
and implement the logic in the merge and finish expressions, see below for an example:
from pyspark.sql.functions import collect_list, struct, to_date, expr
from pyspark.sql import Window
w1 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, Window.unboundedFollowing)
# SQL expression to calculate end_date using aggregate function:
end_date_expr = """
aggregate(
/* argument */
data,
/* zero expression, initialize and specify the aggregator's datatype which is 'struct<end_date:date,total:double>' */
(date(NULL) as end_date, double(0) as total),
/* merge: use acc.total to save accumulated sum of used_stock
* this works similar to Python's reduce function
*/
(acc, y) ->
IF(acc.total >= `available stock`
, (acc.end_date as end_date, acc.total as total)
, (y.start_date as end_date, acc.total + y.used_stock as total)
),
/* finish: post-processing and retrieving only end_date */
z -> IF(z.total >= `available stock`, z.end_date, NULL)
)
"""
df.withColumn('start_date', to_date('start_date', 'dd/MM/yyyy')) \
.withColumn('data', collect_list(struct('start_date','used_stock')).over(w1)) \
.withColumn('end_date', expr(end_date_expr)) \
.select("group", "start_date", "`available stock`", "used_stock", "end_date") \
.show(truncate=False)
+-------+----------+---------------+----------+----------+
|group |start_date|available stock|used_stock|end_date |
+-------+----------+---------------+----------+----------+
|group 1|2019-12-01|100 |80 |2019-12-15|
|group 1|2019-12-08|60 |10 |2019-12-22|
|group 1|2019-12-15|60 |10 |2019-12-22|
|group 1|2019-12-22|150 |200 |2019-12-22|
|group 2|2019-12-15|80 |90 |2019-12-15|
|group 2|2019-12-22|150 |30 |null |
|group 3|2019-12-22|50 |50 |2019-12-22|
+-------+----------+---------------+----------+----------+
Note: this could be less efficient if many of the groups contain a large list of rows(i.e. 1000+ rows), when most of them require to just scan limited rows (i.e. less than 20) to find the first row satisfying the condition. In such case, you might set up two Window specs and do the calculation in two rounds:
from pyspark.sql.functions import collect_list, struct, to_date, col, when, expr
# 1st scan up to the N following rows which can cover majority of end_date satisfying the condition
N = 20
w2 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, N)
# 2nd scan will cover the full length but only to rows having end_date is NULL
w1 = Window.partitionBy('group').orderBy('start_date').rowsBetween(0, Window.unboundedFollowing)
df.withColumn('start_date', to_date('start_date', 'dd/MM/yyyy')) \
.withColumn('data', collect_list(struct('start_date','used_stock')).over(w2)) \
.withColumn('end_date', expr(end_date_expr)) \
.withColumn('data',
when(col('end_date').isNull(), collect_list(struct('start_date','used_stock')).over(w1))) \
.selectExpr(
"group",
"start_date",
"`available stock`",
"used_stock",
"IF(end_date is NULL, {0}, end_date) AS end_date".format(end_date_expr)
).show(truncate=False)

Related

How to use ID JOIN instead of DATEDIFF()

Write a SQL query to find all dates' id with a higher temperature compared to its previous dates (yesterday).
Try out if you want: https://leetcode.com/problems/rising-temperature/
Input:
Weather table:
+----+------------+-------------+
| id | recordDate | temperature |
+----+------------+-------------+
| 1 | 2015-01-01 | 10 |
| 2 | 2015-01-02 | 25 |
| 3 | 2015-01-03 | 20 |
| 4 | 2015-01-04 | 30 |
+----+------------+-------------+
Output:
+----+
| id |
+----+
| 2 |
| 4 |
+----+
Here's my code:
SELECT w_2.id AS "Id"
FROM Weather w_1
JOIN Weather w_2
ON w_1.id + 1 = w_2.id
WHERE w_1.temperature < w_2.temperature
But my code won't be accepted even if it looks exactly like the expected output.
I know the answer is:
SELECT w2.id
FROM Weather w1, Weather w2
WHERE w2.temperature > w1.temperature
AND DATEDIFF(w2.recordDate, w1.recordDate) = 1
But I tried to not use DATEDIFF because this function is not available in PostgreSQL.
The queries are not compatible. You should join the table on recordDate, not on Id.
SELECT w_2.id AS "Id"
FROM Weather w_1
JOIN Weather w_2
ON w_1.recordDate + 1 = w_2.recordDate
WHERE w_1.temperature < w_2.temperature
Do not assume that Id is sequential and ordered in the same way as recordDate, although the sample data may suggest this.

join two views and detect missing entries where the matching condition is in the next row of the other view/table (using SQLITE)

I am running a science test and logging my data inside two sqlite tables.
I have selected the data needed into two seperate and independent Views (RX and TX views).
Now I need to analyze the measurements and create a 3rd table view with the results with the following points in mind:
1- For each test at TX side (Table-1) there might be a corresponding entry at RX side (Table-2).
2- If the time stamp #RX side is less than the time stamp at the next row of the TX table view
we consider them to be associated with one record in the 3rd view/table and calculate the time difference OTHERWISE it would be a miss.
Question: How should i write the sql query in SQLITE to produce the analysis and test result given in table3?
Thanks a lot in advance.
TX View - Table (1)
id | time | measurement
------------------------
1 | 09:40:10.221 | 100
2 | 09:40:15.340 | 60
3 | 09:40:21.100 | 80
4 | 09:40:25.123 | 90
5 | 09:40:29.221 | 45
RX View -Table (2)
time | measurement
------------------------
09:40:15.7 | 65
09:40:21.560 | 80
09:40:30.414 | 50
Test Result View - Table (3)
id |TxTime |RxTime | delta_time(s)| delta_value
------------------------------------------------------------------------
1 | 09:40:10.221 | NULL |NULL | NULL (i.e. missed)
2 | 09:40:15.340 | 09:40:15.7 |0.360 | 5
3 | 09:40:21.100 | 09:40:21.560 |0.460 | 0
4 | 09:40:25.123 | NULL |NULL | NULL (i.e. missed)
5 | 09:40:29.221 | 09:40:30.414 |1.193 | 5
Use window function LEAD() to get the next time of each row in TX and join the views on your conditions:
SELECT t.id, t.time TxTime, r.time RxTime,
ROUND((julianday(r.time) - julianday(t.time)) * 24 * 60 *60, 3) [delta_time(s)],
r.measurement - t.measurement delta_value
FROM (
SELECT *, LEAD(time) OVER (ORDER BY Time) next
FROM TX
) t
LEFT JOIN RX r ON r.time >= t.time AND (r.time < t.next OR t.next IS NULL)
See the demo.
Results:
> id | TxTime | RxTime | delta_time(s) | delta_value
> -: | :----------- | :----------- | :------------ | :----------
> 1 | 09:40:10.221 | null | null | null
> 2 | 09:40:15.340 | 09:40:15.7 | 0.36 | 5
> 3 | 09:40:21.100 | 09:40:21.560 | 0.46 | 0
> 4 | 09:40:25.123 | null | null | null
> 5 | 09:40:29.221 | 09:40:30.414 | 1.193 | 5

how to join tables on cases where none of function(a) in b

Say in MonetDB (specifically, the embedded version from the "MonetDBLite" R package) I have a table "events" containing entity ID codes and event start and end dates, of the format:
| id | start_date | end_date |
| 1 | 2010-01-01 | 2010-03-30 |
| 1 | 2010-04-01 | 2010-06-30 |
| 2 | 2018-04-01 | 2018-06-30 |
| ... | ... | ... |
The table is approximately 80 million rows of events, attributable to approximately 2.5 million unique entities (ID values). The dates appear to align nicely with calendar quarters, but I haven't thoroughly checked them so assume they can be arbitrary. However, I have at least sense-checked them for end_date > start_date.
I want to produce a table "nonevent_qtrs" listing calendar quarters where an ID has no event recorded, e.g.:
| id | last_doq |
| 1 | 2010-09-30 |
| 1 | 2010-12-31 |
| ... | ... |
| 1 | 2018-06-30 |
| 2 | 2010-03-30 |
| ... | ... |
(doq = day of quarter)
If the extent of an event spans any days of the quarter (including the first and last dates), then I wish for it to count as having occurred in that quarter.
To help with this, I have produced a "calendar table"; a table of quarters "qtrs", covering the entire span of dates present in "events", and of the format:
| first_doq | last_doq |
| 2010-01-01 | 2010-03-30 |
| 2010-04-01 | 2010-06-30 |
| ... | ... |
And tried using a non-equi merge like so:
create table nonevents
as select
id,
last_doq
from
events
full outer join
qtrs
on
start_date > last_doq or
end_date < first_doq
group by
id,
last_doq
But this is a) terribly inefficient and b) certainly wrong, since most IDs are listed as being non-eventful for all quarters.
How can I produce the table "nonevent_qtrs" I described, which contains a list of quarters for which each ID had no events?
If it's relevant, the ultimate use-case is to calculate runs of non-events to look at time-till-event analysis and prediction. Feels like run length encoding will be required. If there's a more direct approach than what I've described above then I'm all ears. The only reason I'm focusing on non-event runs to begin with is to try to limit the size of the cross-product. I've also considered producing something like:
| id | last_doq | event |
| 1 | 2010-01-31 | 1 |
| ... | ... | ... |
| 1 | 2018-06-30 | 0 |
| ... | ... | ... |
But although more useful this may not be feasible due to the size of the data involved. A wide format:
| id | 2010-01-31 | ... | 2018-06-30 |
| 1 | 1 | ... | 0 |
| 2 | 0 | ... | 1 |
| ... | ... | ... | ... |
would also be handy, but since MonetDB is column-store I'm not sure whether this is more or less efficient.
Let me assume that you have a table of quarters, with the start date of a quarter and the end date. You really need this if you want the quarters that don't exist. After all, how far back in time or forward in time do you want to go?
Then, you can generate all id/quarter combinations and filter out the ones that exist:
select i.id, q.*
from (select distinct id from events) i cross join
quarters q left join
events e
on e.id = i.id and
e.start_date <= q.quarter_end and
e.end_date >= q.quarter_start
where e.id is null;

How to find two consecutive rows sorted by date, containing a specific value?

I have a table with the following structure and data in it:
| ID | Date | Result |
|---- |------------ |-------- |
| 1 | 30/04/2020 | + |
| 1 | 01/05/2020 | - |
| 1 | 05/05/2020 | - |
| 2 | 03/05/2020 | - |
| 2 | 04/05/2020 | + |
| 2 | 05/05/2020 | - |
| 2 | 06/05/2020 | - |
| 3 | 01/05/2020 | - |
| 3 | 02/05/2020 | - |
| 3 | 03/05/2020 | - |
| 3 | 04/05/2020 | - |
I'm trying to write an SQL query (I'm using SQL Server) which returns the date of the first two consecutive negative results for a given ID.
For example, for ID no. 1, the first two consecutive negative results are on 01/05 and 05/05.
The first two consecutive results for ID No. 2 are on 05/05 and 06/05.
The first two consecutive negative results for ID No. 3 are on on 01/05 and 02/05 .
So the query should produce the following result:
| ID | FirstNegativeDate |
|---- |------------------- |
| 1 | 01/05 |
| 2 | 05/05 |
| 3 | 01/05 |
Please note that the dates aren't necessarily one day apart. Sometimes, two consecutive negative tests may be several days apart. But they should still be considered as "consecutive negative tests". In other words, two negative tests are not 'consecutive' only if there is a positive test result in between them.
How can this be done in SQL? I've done some reading and it looks like maybe the PARTITION BY statement is required but I'm not sure how it works.
This is a gaps-and-island problem, where you want the start of the first island of '-'s that contains at least two rows.
I would recommend lead() and aggregation:
select id, min(date) first_negative_date
from (
select t.*, lead(result) over(partition by id order by date) lead_result
from mytable t
) t
where result = '-' and lead_result = '-'
group by id
Use LEAD or LAG functions over ID partition ordered by your Date column.
Then simple check where LEAD/LAG column is equal to Result.
You'll need also to filter the top ones.
The image attached just shows what LEAD/LAG would return

Comparing every row in table with the master row

I have a Redshift table with single VARCHAR column named "Test" and several float columns. The "Test" column has unique values, one of them is "Control", others are not hardcoded.
Tables has ~10 rows (not static) and ~10 columns.
I need to generate the Looker report which will show the original data and the difference between the corresponding float columns in "Control" and other Tests.
Input Example:
Test | Metric_1 | Metric_2
----------------------------
Control| 10 | 100
A | 12 | 120
B | 8 | 80
The desirable report:
| Control | A | A-Control | B | B-Control
|---------|----|-----------|---|-----------
Metric_1 | 10 | 12 | 2 | 8 | -2
Metric_2 | 100 | 120| 20 | 80| -20
To calculate the difference for the each row with "Control"
I tried:
SELECT T.test,
T.metric_1 - Control.metric_1 AS DIFF1,
T.metric_2 - Control.metric_2 AS DIFF2,
...
FROM T, (SELECT * FROM T WHERE test='Control') AS Control
I can do part of work in Looker (it can transpose),
part in SQL, but still cannot figure out how to build this report.
You could transpose the test dimension, being able to build part of it:
| Control | A | B |
|---------|----|---|
Metric_1 | 10 | 12 | 8 |
Metric_2 | 100 | 120| 80|
Then operate on top of this results using table calculations.
You can use the functions pivot_where() or pivot_index().
For example, pivot_where(test = 'A', metric) - pivot_where(test = 'Control', metric)