Reset counter on window functions

Reset counter on window functions - sql

I have a dataset like the below and I want to create a new column C that acts like a counter/row number which should get reset every time column B has 0 partitioned by column value of A
Using SparkSQL / SQL only (I can do it using Pyspark)
>>> rdd = sc.parallelize([
... [1, 0], [1, 1],[1, 1], [1, 0], [1, 1],
... [1, 1], [2, 1], [2, 1], [3, 0], [3, 1], [3, 1], [3, 1]])
>>> df = rdd.toDF(['A', 'B'])
>>>
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 0|
| 1| 1|
| 1| 1|
| 1| 0|
| 1| 1|
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 0|
| 3| 1|
| 3| 1|
| 3| 1|
+---+---+
What I would like to achieve
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 0| 1|
| 1| 1| 2|
| 1| 1| 3|
| 1| 0| 1|
| 1| 1| 2|
| 1| 1| 3|
| 2| 1| 1|
| 2| 1| 2|
| 3| 0| 1|
| 3| 1| 2|
| 3| 1| 3|
| 3| 1| 4|
+---+---+---+
What I have so far
>>> spark.sql('''
... select *, row_number() over(partition by A order by A) as C from df
... ''').show()
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 0| 1|
| 1| 1| 2|
| 1| 1| 3|
| 1| 0| 4|
| 1| 1| 5|
| 1| 1| 6|
| 3| 0| 1|
| 3| 1| 2|
| 3| 1| 3|
| 3| 1| 4|
| 2| 1| 1|
| 2| 1| 2|
+---+---+---+

SQL tables represent unordered sets. You need a column that specifies the ordering of the data.
With such a column you can accumulate the 0 values because they appear to be breaks. So:
select df.*, row_number() over (partition by A, grp order by A) as C
from (select df.*,
sum(case when b = 0 then 1 else 0 end) over (partition by A order by <ordering column>) as grp
from df
) df

Related

SQL grouped running sum

I have some data like this
data = [("1","1"), ("1","1"), ("1","1"), ("2","1"), ("2","1"), ("3","1"), ("3","1"), ("4","1"),]
df =spark.createDataFrame(data=data,schema=["id","imp"])
df.createOrReplaceTempView("df")
+---+---+
| id|imp|
+---+---+
| 1| 1|
| 1| 1|
| 1| 1|
| 2| 1|
| 2| 1|
| 3| 1|
| 3| 1|
| 4| 1|
+---+---+
I want the count of IDs grouped by ID, it's running sum and total sum. This is the code I'm using
query = """
select id,
count(id) as count,
sum(count(id)) over (order by count(id) desc) as running_sum,
sum(count(id)) over () as total_sum
from df
group by id
order by count desc
"""
spark.sql(query).show()
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 7| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+
The problem is with the running_sum column. For some reason it automatically groups the count 2 while summing and shows 7 for both ID 2 and 3.
This is the result I'm expecting
+---+-----+-----------+---------+
| id|count|running_sum|total_sum|
+---+-----+-----------+---------+
| 1| 3| 3| 8|
| 2| 2| 5| 8|
| 3| 2| 7| 8|
| 4| 1| 8| 8|
+---+-----+-----------+---------+

You should do the running sum in an outer query.
spark.sql('''
select *,
sum(cnt) over (order by id rows between unbounded preceding and current row) as run_sum,
sum(cnt) over (partition by '1') as tot_sum
from (
select id, count(id) as cnt
from data_tbl
group by id)
'''). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+
Using dataframe API
data_sdf. \
groupBy('id'). \
agg(func.count('id').alias('cnt')). \
withColumn('run_sum',
func.sum('cnt').over(wd.partitionBy().orderBy('id').rowsBetween(-sys.maxsize, 0))
). \
withColumn('tot_sum', func.sum('cnt').over(wd.partitionBy())). \
show()
# +---+---+-------+-------+
# | id|cnt|run_sum|tot_sum|
# +---+---+-------+-------+
# | 1| 3| 3| 8|
# | 2| 2| 5| 8|
# | 3| 2| 7| 8|
# | 4| 1| 8| 8|
# +---+---+-------+-------+

Map Spark DF to (row_number, column_number, value) format

I have a Dataframe in the following shape
1 2
5 9
How can I convert it to (row_num, col_num, value) format
0 0 1
0 1 2
1 0 5
1 1 9
Is there any way to apply some function or any mapper?
Thanks in advance

Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val colExpr = array(df.columns.zipWithIndex.map(c => struct(lit(c._2).as("col_name"),col(c._1).as("value"))):_*)
colExpr: org.apache.spark.sql.Column = array(named_struct(col_name, 0 AS `col_name`, NamePlaceholder(), a AS `value`), named_struct(col_name, 1 AS `col_name`, NamePlaceholder(), b AS `value`))
scala> df.withColumn("row_number",lit(row_number().over(Window.orderBy(lit(1)))-1)).withColumn("data",explode(colExpr)).select($"row_number",$"data.*").show(false)
+----------+--------+-----+
|row_number|col_name|value|
+----------+--------+-----+
|0 |0 |1 |
|0 |1 |2 |
|1 |0 |5 |
|1 |1 |9 |
+----------+--------+-----+

You can do it by transposing the data as:
from pyspark.sql.functions import *
from pyspark.sql import Window
df = spark.createDataFrame([(1,2),(5,9)],['col1','col2'])
#renaming the columns based on their position
df = df.toDF(*list(map(lambda x: str(x),[*range(len(df.columns))])))
#Transposing the dataframe as required
col_list = ','.join([f'{i},`{i}`'for i in df.columns])
rows = len(df.columns)
df.withColumn('row_id',lit(row_number().over(Window.orderBy(lit(1)))-1)).select('row_id',
expr(f'''stack({rows},{col_list}) as (col_id,col_value)''')).show()
+------+------+---------+
|row_id|col_id|col_value|
+------+------+---------+
| 0| 0| 1|
| 0| 1| 2|
| 1| 0| 5|
| 1| 1| 9|
+------+------+---------+

In pyspark, row_number() and pos_explode will be helpful. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst1= tst.withColumn("row_number",F.row_number().over(Window.orderBy(F.lit(1)))-1)
#%%
tst_arr = tst1.withColumn("arr",F.array(tst.columns))
tst_new = tst_arr.select('row_number','arr').select('row_number',F.posexplode('arr'))
results:
In [47]: tst_new.show()
+----------+---+---+
|row_number|pos|col|
+----------+---+---+
| 0| 0| 1|
| 0| 1| 7|
| 0| 2| 80|
| 1| 0| 1|
| 1| 1| 8|
| 1| 2| 40|
| 2| 0| 1|
| 2| 1| 5|
| 2| 2|100|
| 3| 0| 5|
| 3| 1| 8|
| 3| 2| 90|
| 4| 0| 7|
| 4| 1| 6|
| 4| 2| 50|
| 5| 0| 0|
| 5| 1| 3|
| 5| 2| 60|
+----------+---+---+

How to get last row value when flag is 0 and get the current row value to new column when flag 1 in pyspark dataframe

Scenario 1 when Flag 1 :
For the row where Flag is 1 Copy trx_date to Destination
Scenario 2 When Flag 0 :
For the row where Flag is 0 Copy the previous Destination Value
Input :
+-----------+----+----------+
|customer_id|Flag| trx_date|
+-----------+----+----------+
| 1| 1| 12/3/2020|
| 1| 0| 12/4/2020|
| 1| 1| 12/5/2020|
| 1| 1| 12/6/2020|
| 1| 0| 12/7/2020|
| 1| 1| 12/8/2020|
| 1| 0| 12/9/2020|
| 1| 0|12/10/2020|
| 1| 0|12/11/2020|
| 1| 1|12/12/2020|
| 2| 1| 12/1/2020|
| 2| 0| 12/2/2020|
| 2| 0| 12/3/2020|
| 2| 1| 12/4/2020|
+-----------+----+----------+
Output :
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Code to generate spark Dataframe :
df = spark.createDataFrame([(1,1,'12/3/2020'),(1,0,'12/4/2020'),(1,1,'12/5/2020'),
(1,1,'12/6/2020'),(1,0,'12/7/2020'),(1,1,'12/8/2020'),(1,0,'12/9/2020'),(1,0,'12/10/2020'),
(1,0,'12/11/2020'),(1,1,'12/12/2020'),(2,1,'12/1/2020'),(2,0,'12/2/2020'),(2,0,'12/3/2020'),
(2,1,'12/4/2020')], ["customer_id","Flag","trx_date"])

Pyspark way to do this. After getting trx_date in datetype, First get incremental sum of Flag to create the groupings we need in order to use the first function on a window partitioned by those groupings. We can use date_format to get both columns back to desired date format. I assumed your format was MM/dd/yyyy, if it was different please change it to dd/MM/yyyy in the code.
df.show() #sample data
#+-----------+----+----------+
#|customer_id|Flag| trx_date|
#+-----------+----+----------+
#| 1| 1| 12/3/2020|
#| 1| 0| 12/4/2020|
#| 1| 1| 12/5/2020|
#| 1| 1| 12/6/2020|
#| 1| 0| 12/7/2020|
#| 1| 1| 12/8/2020|
#| 1| 0| 12/9/2020|
#| 1| 0|12/10/2020|
#| 1| 0|12/11/2020|
#| 1| 1|12/12/2020|
#| 2| 1| 12/1/2020|
#| 2| 0| 12/2/2020|
#| 2| 0| 12/3/2020|
#| 2| 1| 12/4/2020|
#+-----------+----+----------+
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().orderBy("customer_id","trx_date")
w1=Window().partitionBy("Flag2").orderBy("trx_date").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("trx_date", F.to_date("trx_date", "MM/dd/yyyy"))\
.withColumn("Flag2", F.sum("Flag").over(w))\
.withColumn("destination", F.when(F.col("Flag")==0, F.first("trx_date").over(w1)).otherwise(F.col("trx_date")))\
.withColumn("trx_date", F.date_format("trx_date","MM/dd/yyyy"))\
.withColumn("destination", F.date_format("destination", "MM/dd/yyyy"))\
.orderBy("customer_id","trx_date").drop("Flag2").show()
#+-----------+----+----------+-----------+
#|customer_id|Flag| trx_date|destination|
#+-----------+----+----------+-----------+
#| 1| 1|12/03/2020| 12/03/2020|
#| 1| 0|12/04/2020| 12/03/2020|
#| 1| 1|12/05/2020| 12/05/2020|
#| 1| 1|12/06/2020| 12/06/2020|
#| 1| 0|12/07/2020| 12/06/2020|
#| 1| 1|12/08/2020| 12/08/2020|
#| 1| 0|12/09/2020| 12/08/2020|
#| 1| 0|12/10/2020| 12/08/2020|
#| 1| 0|12/11/2020| 12/08/2020|
#| 1| 1|12/12/2020| 12/12/2020|
#| 2| 1|12/01/2020| 12/01/2020|
#| 2| 0|12/02/2020| 12/01/2020|
#| 2| 0|12/03/2020| 12/01/2020|
#| 2| 1|12/04/2020| 12/04/2020|
#+-----------+----+----------+-----------+

You can use window functions. I am unsure whether spark sql supports the standard ignore nulls option to lag().
If it does, you can just do:
select
t.*,
case when flag = 1
then trx_date
else lag(case when flag = 1 then trx_date end ignore nulls)
over(partition by customer_id order by trx_date)
end destination
from mytable t
Else, you can build groups with a window sum first:
select
customer_id,
flag,
trx_date,
case when flag = 1
then trx_date
else min(trx_date) over(partition by customer_id, grp order by trx_date)
end destination
from (
select t.*, sum(flag) over(partition by customer_id order by trx_date) grp
from mytable t
) t

You can achieve this in the following way if you are considering dataframe API
#Convert date format while creating window itself
window = Window().orderBy("customer_id",f.to_date('trx_date','MM/dd/yyyy'))
df1 = df.withColumn('destination', f.when(f.col('Flag')==1,f.col('trx_date'))).\
withColumn('destination',f.last(f.col('destination'),ignorenulls=True).over(window))
df1.show()
+-----------+----+----------+-----------+
|customer_id|Flag| trx_date|destination|
+-----------+----+----------+-----------+
| 1| 1| 12/3/2020| 12/3/2020|
| 1| 0| 12/4/2020| 12/3/2020|
| 1| 1| 12/5/2020| 12/5/2020|
| 1| 1| 12/6/2020| 12/6/2020|
| 1| 0| 12/7/2020| 12/6/2020|
| 1| 1| 12/8/2020| 12/8/2020|
| 1| 0| 12/9/2020| 12/8/2020|
| 1| 0|12/10/2020| 12/8/2020|
| 1| 0|12/11/2020| 12/8/2020|
| 1| 1|12/12/2020| 12/12/2020|
| 2| 1| 12/1/2020| 12/1/2020|
| 2| 0| 12/2/2020| 12/1/2020|
| 2| 0| 12/3/2020| 12/1/2020|
| 2| 1| 12/4/2020| 12/4/2020|
+-----------+----+----------+-----------+
Hope it helps.

Pyspark : How to find and convert top 5 row values to 1 and rest all to 0?

I have a dataframe and i need to find the maximum 5 values in each row, convert only those values to 1 and rest all to 0 while maintaining the dataframe structure, i.e. the column names should remain the same
I tried using toLocalIterator and then converting each row to a list, then converting top 5 to values 1.
But it gives me a java.lang.outOfMemoryError when i run the code on large dataset.
While looking at the logs i found that a task of very large size(around 25000KB) is submitted while the max recommended size is 100KB
Is there a better way to find and convert top 5 values to a certain value(1 in this case) and rest all to 0, which would utilize less memory
EDIT 1:
For example if i have this 10 columns and 5 rows as the input
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+----+----+----+----+----+----+----+----+----+----+
|0.74| 0.9|0.52|0.85|0.18|0.23| 0.3| 0.0| 0.1|0.07|
|0.11|0.57|0.81|0.81|0.45|0.48|0.86|0.38|0.41|0.45|
|0.03|0.84|0.17|0.96|0.09|0.73|0.25|0.05|0.57|0.66|
| 0.8|0.94|0.06|0.44| 0.2|0.89| 0.9| 1.0|0.48|0.14|
|0.73|0.86|0.68| 1.0|0.78|0.17|0.11|0.19|0.18|0.83|
+----+----+----+----+----+----+----+----+----+----+
this is what i want as the output
+---+---+---+---+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10|
+---+---+---+---+---+---+---+---+---+---+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+---+---+---+---+---+---+---+---+---+---+
as you can see i want to find the top(max) 5 values in each row convert them to 1 and the rest of the values to 0, while maintaining the structure i.e. rows and columns
this is what i am using (which gives me outOfMemoryError)
for row in prob_df.rdd.toLocalIterator():
rowPredDict = {}
for cat in categories:
rowPredDict[cat]= row[cat]
sorted_row = sorted(rowPredDict.items(), key=lambda kv: kv[1],reverse=True)
#print(rowPredDict)
rowPredDict = rowPredDict.fromkeys(rowPredDict,0)
rowPredDict[sorted_row[0:5][0][0]] = 1
rowPredDict[sorted_row[0:5][1][0]] = 1
rowPredDict[sorted_row[0:5][2][0]] = 1
rowPredDict[sorted_row[0:5][3][0]] = 1
rowPredDict[sorted_row[0:5][4][0]] = 1
#print(count,sorted_row[0:2][0][0],",",sorted_row[0:2][1][0])
rowPredList.append(rowPredDict)
#count=count+1

I don't have enough volume for performance testing but could you try below approach using spark functions array apis
1. Prepare Dataset:
import pyspark.sql.functions as f
l1 = [(0.74,0.9,0.52,0.85,0.18,0.23,0.3,0.0,0.1,0.07),
(0.11,0.57,0.81,0.81,0.45,0.48,0.86,0.38,0.41,0.45),
(0.03,0.84,0.17,0.96,0.09,0.73,0.25,0.05,0.57,0.66),
(0.8,0.94,0.06,0.44,0.2,0.89,0.9,1.0,0.48,0.14),
(0.73,0.86,0.68,1.0,0.78,0.17,0.11,0.19,0.18,0.83)]
df = spark.createDataFrame(l1).toDF('col_1','col_2','col_3','col_4','col_5','col_6','col_7','col_8','col_9','col_10')
df.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 0.74| 0.9| 0.52| 0.85| 0.18| 0.23| 0.3| 0.0| 0.1| 0.07|
| 0.11| 0.57| 0.81| 0.81| 0.45| 0.48| 0.86| 0.38| 0.41| 0.45|
| 0.03| 0.84| 0.17| 0.96| 0.09| 0.73| 0.25| 0.05| 0.57| 0.66|
| 0.8| 0.94| 0.06| 0.44| 0.2| 0.89| 0.9| 1.0| 0.48| 0.14|
| 0.73| 0.86| 0.68| 1.0| 0.78| 0.17| 0.11| 0.19| 0.18| 0.83|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
2. Get top 5 for each row
Following below steps on df
Create array and Sort the elements
Get first 5 elements into new column called all
UDF to get Max 5 elements from sorted:
Note : spark >= 2.4.0 have slice function which can do similar task. I am using 2.2 in currently so creating UDF but if you have 2.4 or higher version then you can give a try with slice
def get_n_elements_(arr, n):
return arr[:n]
get_n_elements = f.udf(get_n_elements_, t.ArrayType(t.DoubleType()))
df_all = df.withColumn('all', get_n_elements(f.sort_array(f.array(df.columns), False),f.lit(5)))
df_all.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|all |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
|0.74 |0.9 |0.52 |0.85 |0.18 |0.23 |0.3 |0.0 |0.1 |0.07 |[0.9, 0.85, 0.74, 0.52, 0.3] |
|0.11 |0.57 |0.81 |0.81 |0.45 |0.48 |0.86 |0.38 |0.41 |0.45 |[0.86, 0.81, 0.81, 0.57, 0.48]|
|0.03 |0.84 |0.17 |0.96 |0.09 |0.73 |0.25 |0.05 |0.57 |0.66 |[0.96, 0.84, 0.73, 0.66, 0.57]|
|0.8 |0.94 |0.06 |0.44 |0.2 |0.89 |0.9 |1.0 |0.48 |0.14 |[1.0, 0.94, 0.9, 0.89, 0.8] |
|0.73 |0.86 |0.68 |1.0 |0.78 |0.17 |0.11 |0.19 |0.18 |0.83 |[1.0, 0.86, 0.83, 0.78, 0.73] |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+------------------------------+
3. Create dynamic sql and execute with selectExpr
sql_stmt = ''' case when array_contains(all, {0}) then 1 else 0 end AS `{0}` '''
df_all.selectExpr(*[sql_stmt.format(c) for c in df.columns]).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|col_10|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+
| 1| 1| 1| 1| 0| 0| 1| 0| 0| 0|
| 0| 1| 1| 1| 0| 1| 1| 0| 0| 0|
| 0| 1| 0| 1| 0| 1| 0| 0| 1| 1|
| 1| 1| 0| 0| 0| 1| 1| 1| 0| 0|
| 1| 1| 0| 1| 1| 0| 0| 0| 0| 1|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+

You can do that easily like this.
For example we want to do that task for value column, so first sort the value column take the 5th value and change the values using a when condition.
df2 = sc.parallelize([("fo", 100,20),("rogerg", 110,56),("franre", 1080,297),("f11", 10100,217),("franci", 10,227),("fran", 1002,5),("fran231cis", 10007,271),("franc3is", 1030,2)]).toDF(["name", "salary","value"])
df2 = df2.orderBy("value",ascending=False)
+----------+------+-----+
| name|salary|value|
+----------+------+-----+
| franre| 1080| 297|
|fran231cis| 10007| 271|
| franci| 10| 227|
| f11| 10100| 217|
| rogerg| 110| 56|
| fo| 100| 20|
| fran| 1002| 5|
| franc3is| 1030| 2|
+----------+------+-----+
maxx = df2.take(5)[4]["value"]
dff = df2.select(when(df2['value'] >= maxx, 1).otherwise(0).alias("value"),"name", "salary")
+---+----------+------+
|value| name|salary|
+---+----------+------+
| 1| franre| 1080|
| 1|fran231cis| 10007|
| 1| franci| 10|
| 1| f11| 10100|
| 1| rogerg| 110|
| 0| fo| 100|
| 0| fran| 1002|
| 0| franc3is| 1030|
+---+----------+------+

Counting number of nulls in pyspark dataframe by row

So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
df=spark.createDataFrame(vals,columns)
df.show()
+---+-----+-----+-----+
| id|item1|item2|item3|
+---+-----+-----+-----+
| 1| 2| 'A'| null|
| 2| null| 1| null|
| 3| null| 9| 'C'|
+---+-----+-----+-----+
After running the code, the desired output is:
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 'A'| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 'C'| 1|
+---+-----+-----+-----+--------+
EDIT: Not all non null values are ints.

Convert null to 1 and others to 0 and then sum all the columns:
df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
+---+-----+-----+-----+--------+
| id|item1|item2|item3|numNulls|
+---+-----+-----+-----+--------+
| 1| 2| 0| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 1| 1|
+---+-----+-----+-----+--------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Reset counter on window functions - sql

Related

SQL grouped running sum

Map Spark DF to (row_number, column_number, value) format

How to get last row value when flag is 0 and get the current row value to new column when flag 1 in pyspark dataframe

Pyspark : How to find and convert top 5 row values to 1 and rest all to 0?

Counting number of nulls in pyspark dataframe by row

Categories

Resources