Apache spark window, chose previous last item based on some condition - sql

I have an input data which has id, pid, pname, ppid which are id (can think it is time), pid (process id), pname (process name), ppid (parent process id) who created pid
+---+---+-----+----+
| id|pid|pname|ppid|
+---+---+-----+----+
| 1| 1| 5| -1|
| 2| 1| 7| -1|
| 3| 2| 9| 1|
| 4| 2| 11| 1|
| 5| 3| 5| 1|
| 6| 4| 7| 2|
| 7| 1| 9| 3|
+---+---+-----+----+
now need to find ppname (parent process name) which is the last pname (previous pnames) of following condition previous.pid == current.ppid
expected result for previous example:
+---+---+-----+----+------+
| id|pid|pname|ppid|ppname|
+---+---+-----+----+------+
| 1| 1| 5| -1| -1|
| 2| 1| 7| -1| -1| no item found above with pid=-1
| 3| 2| 9| 1| 7| last pid = 1(ppid) above, pname=7
| 4| 2| 11| 1| 7|
| 5| 3| 5| 1| 7|
| 6| 4| 7| 2| 11| last pid = 2(ppid) above, pname=11
| 7| 1| 9| 3| 5| last pid = 3(ppid) above, pname=5
+---+---+-----+----+------+
I can join by itself based on pid==ppid then take diff between ids and pick row with min positive difference maybe then join back again for the cases where we didn't find any positive diffs (-1 case).
But I am thinking it is almost like a cross join, which I might not afford since I have 100M rows.

Related

SQL grouped running sum

I have some data like this
data = [("1","1"), ("1","1"), ("1","1"), ("2","1"), ("2","1"), ("3","1"), ("3","1"), ("4","1"),]
df =spark.createDataFrame(data=data,schema=["id","imp"])
df.createOrReplaceTempView("df")
+---+---+
| id|imp|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 1|
| 3| 1|
| 4| 1|
+---+---+
I want the count of IDs grouped by ID, it's running sum and total sum. This is the code I'm using
query = """
select id,
count(id) as count,
sum(count(id)) over (order by count(id) desc) as running_sum,
sum(count(id)) over () as total_sum
from df
group by id
order by count desc
"""
spark.sql(query).show()
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 7| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+
The problem is with the running_sum column. For some reason it automatically groups the count 2 while summing and shows 7 for both ID 2 and 3.
This is the result I'm expecting
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 5| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+
You should do the running sum in an outer query.
spark.sql('''
select *,
sum(cnt) over (order by id rows between unbounded preceding and current row) as run_sum,
sum(cnt) over (partition by '1') as tot_sum
from (
select id, count(id) as cnt
from data_tbl
group by id)
'''). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+
Using dataframe API
data_sdf. \
groupBy('id'). \
agg(func.count('id').alias('cnt')). \
withColumn('run_sum',
func.sum('cnt').over(wd.partitionBy().orderBy('id').rowsBetween(-sys.maxsize, 0))
). \
withColumn('tot_sum', func.sum('cnt').over(wd.partitionBy())). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+

sum of row values within a window range in spark dataframe

I have a dataframe as shown below where count column is having number of columns that has to added to get a new column.
+---+----+----+------------+
| ID|date| A| count|
+---+----+----+------------+
| 1| 1| 10| null|
| 1| 2| 10| null|
| 1| 3|null| null|
| 1| 4| 20| 1|
| 1| 5|null| null|
| 1| 6|null| null|
| 1| 7| 60| 2|
| 1| 7| 60| null|
| 1| 8|null| null|
| 1| 9|null| null|
| 1| 10|null| null|
| 1| 11| 80| 3|
| 2| 1| 10| null|
| 2| 2| 10| null|
| 2| 3|null| null|
| 2| 4| 20| 1|
| 2| 5|null| null|
| 2| 6|null| null|
| 2| 7| 60| 2|
+---+----+----+------------+
The expected output is as shown below.
+---+----+----+-----+-------+
| ID|date| A|count|new_col|
+---+----+----+-----+-------+
| 1| 1| 10| null| null|
| 1| 2| 10| null| null|
| 1| 3| 10| null| null|
| 1| 4| 20| 2| 30|
| 1| 5| 10| null| null|
| 1| 6| 10| null| null|
| 1| 7| 60| 3| 80|
| 1| 7| 60| null| null|
| 1| 8|null| null| null|
| 1| 9|null| null| null|
| 1| 10| 10| null| null|
| 1| 11| 80| 2| 90|
| 2| 1| 10| null| null|
| 2| 2| 10| null| null|
| 2| 3|null| null| null|
| 2| 4| 20| 1| 20|
| 2| 5|null| null| null|
| 2| 6| 20| null| null|
| 2| 7| 60| 2| 80|
+---+----+----+-----+-------+
I tried with window function as follows
val w2 = Window.partitionBy("ID").orderBy("date")
val newDf = df
.withColumn("new_col", when(col("A").isNotNull && col("count").isNotNull, sum(col("A).over(Window.partitionBy("ID").orderBy("date").rowsBetween(Window.currentRow - (col("count")), Window.currentRow)))
But I am getting error as below.
error: overloaded method value - with alternatives:
(x: Long)Long <and>
(x: Int)Long <and>
(x: Char)Long <and>
(x: Short)Long <and>
(x: Byte)Long
cannot be applied to (org.apache.spark.sql.Column)
seems like the column value provided inside window function is causing the issue.
Any idea about how to resolve this error to achieve the requirement or any other alternative solutions?
Any leads appreciated !

SQL or Pyspark - Get the last time a column had a different value for each ID

I am using pyspark so I have tried both pyspark code and SQL.
I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table:
+---+-------+-------+----+
| ID|USER_ID|ADDRESS|TIME|
+---+-------+-------+----+
| 1| 1| A| 10|
| 2| 1| B| 15|
| 3| 1| A| 20|
| 4| 1| A| 40|
| 5| 1| A| 45|
+---+-------+-------+----+
The correct new column I would like is as below:
+---+-------+-------+----+---------+
| ID|USER_ID|ADDRESS|TIME|LAST_DIFF|
+---+-------+-------+----+---------+
| 1| 1| A| 10| null|
| 2| 1| B| 15| 10|
| 3| 1| A| 20| 15|
| 4| 1| A| 40| 15|
| 5| 1| A| 45| 15|
+---+-------+-------+----+---------+
I have tried using different windows but none ever seem to get exactly what I want. Any ideas?
A simplified version of #jxc's answer.
from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
.withColumn('prev_time',lag(col('time')).over(w))
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
grp_df.withColumn('last_addr_change_time',min(col('prev_time')).over(w_grp)).show()
Use lag with running sum to assign groups when there is a change in the column value (based on the defined window). Get the time from the previous row, which will be used in the next step.
Once you get the groups, use the running minimum to get the last timestamp of the column value change. (Suggest you look at the intermediate results to understand the transformations better)
One way using two Window specs:
from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window
w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')
# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))
# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))
df1.join(df2, on=['USER_ID', 'g']).show()
+-------+---+---+-------+----+----+---------+
|USER_ID| g| ID|ADDRESS|TIME|diff|last_diff|
+-------+---+---+-------+----+----+---------+
| 1| 1| 1| A| 10| 10| null|
| 1| 2| 2| B| 15| 15| 10|
| 1| 3| 3| A| 20| 105| 15|
| 1| 3| 4| A| 40| 105| 15|
| 1| 3| 5| A| 45| 105| 15|
+-------+---+---+-------+----+----+---------+
df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

Counting number of nulls in pyspark dataframe by row

So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()
+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
| 1| 2| 'A'| null|
| 2| null| 1| null|
| 3| null| 9| 'C'|
+---+-----+-----+-----+
After running the code, the desired output is:
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 'A'| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 'C'| 1|
+---+-----+-----+-----+--------+
EDIT: Not all non null values are ints.
Convert null to 1 and others to 0 and then sum all the columns:
df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 0| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 1| 1|
+---+-----+-----+-----+--------+

Getting Memory Error in PySpark during Filter & GroupBy computation

This is the Error :
Job aborted due to stage failure: Task 12 in stage 37.0 failed 4 times, most recent failure: Lost task 12.3 in stage 37.0 (TID 325, 10.139.64.5, executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages*
So is there any alternative, more efficient way to apply those function without causing out-of-memory error? I have the data in Billions to be computed.
Input Dataframe on which filtering is to be done:
+------+-------+-------+------+-------+-------+
|Pos_id|level_p|skill_p|Emp_id|level_e|skill_e|
+------+-------+-------+------+-------+-------+
| 1| 2| a| 100| 2| a|
| 1| 2| a| 100| 3| f|
| 1| 2| a| 100| 2| d|
| 1| 2| a| 101| 4| a|
| 1| 2| a| 101| 5| b|
| 1| 2| a| 101| 1| e|
| 1| 2| a| 102| 5| b|
| 1| 2| a| 102| 3| d|
| 1| 2| a| 102| 2| c|
| 2| 2| d| 100| 2| a|
| 2| 2| d| 100| 3| f|
| 2| 2| d| 100| 2| d|
| 2| 2| d| 101| 4| a|
| 2| 2| d| 101| 5| b|
| 2| 2| d| 101| 1| e|
| 2| 2| d| 102| 5| b|
| 2| 2| d| 102| 3| d|
| 2| 2| d| 102| 2| c|
| 2| 4| b| 100| 2| a|
| 2| 4| b| 100| 3| f|
+------+-------+-------+------+-------+-------+
Filtering Code:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df_result = new_df.withColumn('result', function(sf.col('skill_p'), sf.col('skill_e')))
df_filter = df_result.filter(sf.col("result") == 1)
df_filter.show()
res = df_filter.groupBy("Pos_id", "Emp_id").agg(
sf.collect_set("skill_p").alias("SkillsMatch"),
sf.sum("result").alias("SkillsMatchedCount"))
res.show()
This needs to be done on Billions of rows.