indexing pyspark column based on intervals defined in other columns

indexing pyspark column based on intervals defined in other columns - indexing

I have a pyspark dataframe where I want to insert an ID column based on the column pad_change
My dataframe looks something like this:
|TOOL_ID|pad_change|DATE |Pad_ID |
+-------+----------+----------+--------------+
|59628 |1 |2021-05-22|PAD_2021-05-22|
|59628 |0 |2021-05-23|null |
|59628 |0 |2021-05-23|null |
|59628 |0 |2021-05-24|null |
|59628 |0 |2021-05-25|null |
|59628 |0 |2021-05-26|null |
|59628 |0 |2021-05-27|null |
|59628 |0 |2021-05-27|null |
|59628 |0 |2021-05-28|null |
|59628 |0 |2021-05-29|null |
|59628 |1 |2021-06-02|PAD_2021-06-02|
|59628 |0 |2021-06-02|null |
|59628 |0 |2021-06-02|null |
|59628 |0 |2021-06-02|null |
|59628 |0 |2021-06-03|null |
|59628 |0 |2021-06-04|null |
|59628 |0 |2021-06-04|null |
|59628 |0 |2021-06-04|null |
|59628 |0 |2021-06-04|null |
|59628 |1 |2021-06-05|PAD_2021-06-05|
|59628 |0 |2021-06-06|null |
|59628 |0 |2021-06-07|null |
|59628 |0 |2021-06-08|null |
|59628 |0 |2021-06-08|null |
|59628 |0 |2021-06-09|null |
|59628 |0 |2021-06-09|null |
|59628 |0 |2021-06-10|null |
I would like the Pad_ID column to have the same value when pad_change=1 until the next time it changes from 0 to 1.
Expected Output:
|TOOL_ID|pad_change|DATE |Pad_ID |
+-------+----------+----------+--------------+
|59628 |1 |2021-05-22|PAD_2021-05-22|
|59628 |0 |2021-05-23|PAD_2021-05-22|
|59628 |0 |2021-05-23|PAD_2021-05-22|
|59628 |0 |2021-05-24|PAD_2021-05-22|
|59628 |0 |2021-05-25|PAD_2021-05-22|
|59628 |0 |2021-05-26|PAD_2021-05-22|
|59628 |0 |2021-05-27|PAD_2021-05-22|
|59628 |0 |2021-05-27|PAD_2021-05-22|
|59628 |0 |2021-05-28|PAD_2021-05-22|
|59628 |0 |2021-05-29|PAD_2021-05-22|
|59628 |1 |2021-06-02|PAD_2021-06-02|
|59628 |0 |2021-06-02|PAD_2021-06-02|
|59628 |0 |2021-06-02|PAD_2021-06-02|
|59628 |0 |2021-06-02|PAD_2021-06-02|
|59628 |0 |2021-06-03|PAD_2021-06-02|
|59628 |0 |2021-06-04|PAD_2021-06-02|
|59628 |0 |2021-06-04|PAD_2021-06-02|
|59628 |0 |2021-06-04|PAD_2021-06-02|
|59628 |0 |2021-06-04|PAD_2021-06-02|
|59628 |1 |2021-06-05|PAD_2021-06-05|
|59628 |0 |2021-06-06|PAD_2021-06-05|
|59628 |0 |2021-06-07|PAD_2021-06-05|
|59628 |0 |2021-06-08|PAD_2021-06-05|
|59628 |0 |2021-06-08|PAD_2021-06-05|
|59628 |0 |2021-06-09|PAD_2021-06-05|
|59628 |0 |2021-06-09|PAD_2021-06-05|
|59628 |0 |2021-06-10|PAD_2021-06-05|
Is there a way to do this in Pyspark?

You can try the below solution creating 2 windows, one for creating a cumulative sum of pad_change for each group. Then you can use that helper column along with Tool_ID column for creating a new window with aggregation as first. Finally drop the helper column (Win_)
from pyspark.sql import functions as F, Window as W
w = W.partitionBy("TOOL_ID").orderBy("DATE")\
.rangeBetween(W.unboundedPreceding,0)
w1 = W.partitionBy("TOOL_ID","Win_").orderBy("TOOL_ID")
out = (df.withColumn("Win_",F.sum("pad_change").over(w))
.withColumn("Pad_ID",F.first("Pad_ID").over(w1)).drop("Win_"))
out.show(30)
+---+-------+----------+----------+--------------+
| |TOOL_ID|pad_change| DATE| Pad_ID|
+---+-------+----------+----------+--------------+
| | 59628| 1|2021-05-22|PAD_2021-05-22|
| | 59628| 0|2021-05-23|PAD_2021-05-22|
| | 59628| 0|2021-05-23|PAD_2021-05-22|
| | 59628| 0|2021-05-24|PAD_2021-05-22|
| | 59628| 0|2021-05-25|PAD_2021-05-22|
| | 59628| 0|2021-05-26|PAD_2021-05-22|
| | 59628| 0|2021-05-27|PAD_2021-05-22|
| | 59628| 0|2021-05-27|PAD_2021-05-22|
| | 59628| 0|2021-05-28|PAD_2021-05-22|
| | 59628| 0|2021-05-29|PAD_2021-05-22|
| | 59628| 1|2021-06-02|PAD_2021-06-02|
| | 59628| 0|2021-06-02|PAD_2021-06-02|
| | 59628| 0|2021-06-02|PAD_2021-06-02|
| | 59628| 0|2021-06-02|PAD_2021-06-02|
| | 59628| 0|2021-06-03|PAD_2021-06-02|
| | 59628| 0|2021-06-04|PAD_2021-06-02|
| | 59628| 0|2021-06-04|PAD_2021-06-02|
| | 59628| 0|2021-06-04|PAD_2021-06-02|
| | 59628| 0|2021-06-04|PAD_2021-06-02|
| | 59628| 1|2021-06-05|PAD_2021-06-05|
| | 59628| 0|2021-06-06|PAD_2021-06-05|
| | 59628| 0|2021-06-07|PAD_2021-06-05|
| | 59628| 0|2021-06-08|PAD_2021-06-05|
| | 59628| 0|2021-06-08|PAD_2021-06-05|
| | 59628| 0|2021-06-09|PAD_2021-06-05|
| | 59628| 0|2021-06-09|PAD_2021-06-05|
| | 59628| 0|2021-06-10|PAD_2021-06-05|
+---+-------+----------+----------+--------------+

Related

How to fill a column to differentiate a set of rows from other rows in a group in Impala?

I have the following table in Impala.
|LogTime|ClientId|IsNewSession|
|1 |123 |1 |
|2 |123 | |
|3 |123 | |
|3 |666 |1 |
|4 |666 | |
|10 |123 |1 |
|23 |666 |1 |
|24 |666 | |
|25 |444 |1 |
|26 |444 | |
I want to make a new table as follows:
|LogTime|ClientId|IsNewSession|SessionId|
|1 |123 |1 |1 |
|2 |123 | |1 |
|3 |123 | |1 |
|3 |666 |1 |1 |
|4 |666 | |1 |
|10 |123 |1 |2 |
|23 |666 |1 |2 |
|24 |666 | |2 |
|25 |444 |1 |1 |
|26 |444 | |1 |
Basically, I want to make SessionId column that has a unique session ID per set of rows until there's a value of 1 in IsNewSession column after group by ClientId, to differentiate different sessions per ClientId.
I've made IsNewSession column to do so, but not sure how to iterate on the rows to make SessionId column.
Any help would be greatly appreciated!

You can use a cumulative sum:
select t.*,
sum(isnewsession) over (partition by clientid order by logtime) as sessionid
from t;

How to substract or sum data and store the values automatically?

i have a table named transaction.
acc_type are the types of transactions that exist.
income account = 101,102,103,104,105,106
outcome account = 111,112,113,114,115,116
In the example table below 101 is income and 111 is the Outcome.
___________________________________________
|id|mem_id|name|date |acc_type|amount |
-------------------------------------------|
| 1|m01 |A |1/6/2020|101 |150 |
| 2|m01 |A |2/6/2020|101 |200 |
| 3|m01 |A |3/6/2020|111 |100 |
| 4|m02 |B |3/6/2020|101 |150 |
| 5|m02 |B |3/6/2020|111 | 50 |
| 6|m02 |B |4/6/2020|101 |100 |
| 7|m03 |C |4/6/2020|102 |500 |
| 8|m03 |C |5/6/2020|112 | 50 |
| 9|m03 |C |5/6/2020|112 |100 |
--------------------------------------------
So, if the data has the same mem_id and acc_type = 101 will be aggregated. But if acc_type 111 will be substracted.
it is possible to create an up to date balance table like this :
_______________________________________________________
|id|mem_id|name|date |acc_type|amount |balance |
-------------------------------------------|------------
| 1|m01 |A |1/6/2020|101 |150 |150 |
| 2|m01 |A |2/6/2020|101 |200 |350 |
| 3|m01 |A |3/6/2020|111 |100 |250 |
| 4|m02 |B |3/6/2020|101 |150 |150 |
| 5|m02 |B |3/6/2020|111 | 50 |100 |
| 6|m02 |B |4/6/2020|101 |100 |200 |
| 7|m03 |C |4/6/2020|102 |500 |500 |
| 8|m03 |C |5/6/2020|112 | 50 |450 |
| 9|m03 |C |5/6/2020|112 |100 |350 |
--------------------------------------------------------
if yes,what is the best method to applied(is it view, trigger, procedure or transact?). I need this balance table especially the balance value of every account type for each member.
i'm using 10.4.6-MariaDB and vb.net for the interface
thanks.

SQL: Need to SUM column for each type

How can I find the SUM of all scores for the minimum date of each lesson_id please:
-----------------------------------------------------------
|id |uid |group_id |lesson_id |game_id |score |date |
-----------------------------------------------------------
|1 |145 |1 |1 |0 |40 |1391627323 |
|2 |145 |1 |1 |0 |80 |1391627567 |
|3 |145 |1 |2 |0 |40 |1391627323 |
|4 |145 |1 |3 |0 |30 |1391627323 |
|5 |145 |1 |3 |0 |90 |1391627567 |
|6 |145 |1 |4 |0 |20 |1391628000 |
|7 |145 |1 |5 |0 |35 |1391628000 |
-----------------------------------------------------------
I need output:
-------------------
|sum_first_scores |
-------------------
|165 |
-------------------
I have this so far, which lists the score for each minimum date, per lesson, but I need to sum those results as above:
SELECT lesson_id, MIN(date), score AS first_score FROM cdu_user_progress
WHERE cdu_user_progress.uid = 145
GROUP BY lesson_id

You can identify the first score as the one where no earlier record exists. Then just take the sum:
select sum(score)
from edu_user_progress eup
where cdu_user_progress.uid = 145 and
not exists (select 1
from edu_user_progress eup2
where eup2.uid = eup.uid and
eup2.lesson_id = eup.lesson_id and
eup2.date < eup.date
);
This assumes that the minimum date for the lesson id has only one score.

SQL: triple-nested many to many query

I'm trying to fix my nested query, I have these tables:
cdu_groups_blocks
------------------------
|id |group_id |block_id|
------------------------
|1 |1 |1 |
|2 |1 |2 |
|3 |1 |3 |
------------------------
cdu_blocks: cdu_blocks_sessions:
-------------------------- ---------------------------
|id |name |enabled | |id |block_id |session_id |
-------------------------- ---------------------------
|1 |block_1 |1 | |1 |1 |1 |
|2 |block_2 |1 | |2 |1 |2 |
|3 |block_3 |1 | |3 |2 |3 |
-------------------------- |4 |2 |4 |
|5 |3 |5 |
|6 |3 |6 |
---------------------------
cdu_sessions: cdu_sessions_lessons
-------------------------- ----------------------------
|id |name |enabled | |id |session_id |lesson_id |
-------------------------- ----------------------------
|1 |session_1 |1 | |1 |1 |1 |
|2 |session_2 |1 | |2 |1 |2 |
|3 |session_3 |1 | |3 |2 |3 |
|4 |session_4 |0 | |4 |4 |4 |
|5 |session_5 |1 | |5 |4 |5 |
|6 |session_6 |0 | |6 |5 |6 |
-------------------------- ----------------------------
cdu_lessons:
--------------------------
|id |name |enabled |
--------------------------
|1 |lesson_1 |1 |
|2 |lesson_2 |1 |
|3 |lesson_3 |1 |
|4 |lesson_4 |1 |
|5 |lesson_5 |0 |
|6 |lesson_6 |0 |
--------------------------
It's a many-to-many which links to another many-to-many which links to another many-to-many.
Essentially I want to get all lesson_id(s) associated with a particular group_id.
So far I have this, but it's throwing up various SQL errors:
SELECT b.* FROM
(
SELECT block_id, group_id FROM cdu_groups_blocks
JOIN cdu_blocks ON cdu_blocks.id = cdu_groups_blocks.block_id
WHERE group_id = $group_id
AND enabled = 1
) AS b
INNER JOIN
(
SELECT l.* FROM
(
SELECT session_id, block_id FROM cdu_blocks_sessions
JOIN cdu_sessions ON cdu_sessions.id = cdu_blocks_sessions.session_id
AND enabled = 1
) AS s
INNER JOIN
(
SELECT lesson_id, session_id FROM cdu_sessions_lessons
JOIN cdu_lessons ON cdu_lessons.id = cdu_sessions_lessons.lesson_id
WHERE enabled = 1
) AS l
WHERE s.session_id = l.session_id
) AS sl
WHERE sl.block_id = g.block_id
Any help would be much appreciated!

sl.block_id is from s table in your first select inside sl subselect.
Just get it. Change:
SELECT l.* FROM ...
to
SELECT l.*, s.block_id FROM ...

SQL - get rows where column is greater than certain amount

I need to get the sum of the scores for the first of each lesson_id, but I also need the overall min and max scores for all lesson_ids as well as some other info:
cdu_groups:
----------------
|id |name |
----------------
|1 |group_1 |
|2 |group_2 |
----------------
cdu_user_progress145:
-----------------------------------------------------------
|id |uid |group_id |lesson_id |game_id |score |date |
-----------------------------------------------------------
|1 |145 |1 |1 |0 |40 |1391627323 |
|2 |145 |1 |1 |0 |80 |1391627567 |
|3 |145 |1 |2 |0 |40 |1391627323 |
|4 |145 |1 |3 |0 |30 |1391627323 |
|5 |145 |1 |3 |0 |90 |1391627567 |
|6 |145 |1 |4 |0 |20 |1391627323 |
|7 |145 |1 |5 |0 |35 |1391627323 |
-----------------------------------------------------------
I need this output:
-----------------------------------------------------------------
|name |group_id |min_score |max_score |... |sum_first_scores |
-----------------------------------------------------------------
|group_1 |1 |20 |90 |... |165 |
-----------------------------------------------------------------
SELECT
cdu_groups.*,
MAX(score) AS max_score,
MIN(score) AS min_score,
COUNT(DISTINCT(lesson_id)) AS scored_lesson_count,
COUNT(DISTINCT CASE WHEN score >= 75 then lesson_Id ELSE NULL END) as passed_lesson_count,
SUM(first_scores.first_score) AS sum_first_scores
FROM cdu_user_progress
JOIN cdu_groups ON cdu_groups.id = cdu_user_progress.group_id
JOIN
(
SELECT lesson_id, MIN(date), score AS first_score FROM cdu_user_progress
WHERE cdu_user_progress.uid = 145
GROUP BY lesson_id
) AS first_scores ON first_scores.lesson_id = cdu_user_progress.lesson_id
WHERE cdu_user_progress.uid = 145
I'm getting this error though:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SUM(first_scores.first_score) AS sum_first_scores FROM cdu_user_progress ' at line 7

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

indexing pyspark column based on intervals defined in other columns - indexing

Related

How to fill a column to differentiate a set of rows from other rows in a group in Impala?

How to substract or sum data and store the values automatically?

SQL: Need to SUM column for each type

SQL: triple-nested many to many query

SQL - get rows where column is greater than certain amount

Categories

Resources