I have a table like
CId| RId| No
1| 10| 100
1| 20| 20
1| 30| 10
2| 10| 200
2| 30| 20
3| 40| 25
here, RId represents "NoToAttend" (10),"NoNotToAttend" (20),"NoWait"(30),"Backup" (40) etc...
I need to have a result table that will look like
Cid| "NoToAttend"| "NoNotToAttend"| "NoWait"| "Backup"
1| 100| 20| null|
2| 200| null| 20| null
3| null| null| null| 25
I am not sure on how to use PIVOT. Need help on this

You can use the PIVOT Function and just alias your columns:
[NoToAttend] = pvt.[10],
[NoNotToAttend] = pvt.[20],
[NoWait] = pvt.[30],
[Backup] = pvt.[40]
( SUM([No])
FOR RID IN ([10], [20, [30], [40])
) pvt;


Pyspark sum of columns after union of dataframe

How can I sum all columns after unioning two dataframe ?
I have this first df with one row per user:
df = sqlContext.createDataFrame([("2022-01-10", 3, 2,"a"),("2022-01-10",3,4,"b"),("2022-01-10", 1,3,"c")], ["date", "value1", "value2", "userid"])
| date|value1|value2|userid|
|2022-01-10| 3| 2| a|
|2022-01-10| 3| 4| b|
|2022-01-10| 1| 3| c|
date value will always be the today's date.
and I have another df, with multiple row per userid this time, so one value for each day:
df2 = sqlContext.createDataFrame([("2022-01-01", 13, 12,"a"),("2022-01-02",13,14,"b"),("2022-01-03", 11,13,"c"),
("2022-01-04", 3, 2,"a"),("2022-01-05",3,4,"b"),("2022-01-06", 1,3,"c"),
("2022-01-10", 31, 21,"a"),("2022-01-07",31,41,"b"),("2022-01-09", 11,31,"c")], ["date", "value3", "value4", "userid"])
| date|value3|value4|userid|
|2022-01-01| 13| 12| a|
|2022-01-02| 13| 14| b|
|2022-01-03| 11| 13| c|
|2022-01-04| 3| 2| a|
|2022-01-05| 3| 4| b|
|2022-01-06| 1| 3| c|
|2022-01-10| 31| 21| a|
|2022-01-07| 31| 41| b|
|2022-01-09| 11| 31| c|
After unioning the two of them with this function, here what I have:
def union_different_tables(df1, df2):
columns_df1 = df1.columns
columns_df2 = df2.columns
data_types_df1 = [i.dataType for i in df1.schema.fields]
data_types_df2 = [i.dataType for i in df2.schema.fields]
for col, _type in zip(columns_df1, data_types_df1):
if col not in df2.columns:
df2 = df2.withColumn(col, f.lit(None).cast(_type))
for col, _type in zip(columns_df2, data_types_df2):
if col not in df1.columns:
df1 = df1.withColumn(col, f.lit(None).cast(_type))
union = df1.unionByName(df2)
return union
| date|value1|value2|userid|value3|value4|
|2022-01-10| 3| 2| a| null| null|
|2022-01-10| 3| 4| b| null| null|
|2022-01-10| 1| 3| c| null| null|
|2022-01-01| null| null| a| 13| 12|
|2022-01-02| null| null| b| 13| 14|
|2022-01-03| null| null| c| 11| 13|
|2022-01-04| null| null| a| 3| 2|
|2022-01-05| null| null| b| 3| 4|
|2022-01-06| null| null| c| 1| 3|
|2022-01-10| null| null| a| 31| 21|
|2022-01-07| null| null| b| 31| 41|
|2022-01-09| null| null| c| 11| 31|
What I want to get is the sum of all columns in df2 (I have 10 of them in the real case) till the date of the day for each userid, so one row per user:
| date|value1|value2|userid|value3|value4|
|2022-01-10| 3| 2| a| 47 | 35 |
|2022-01-10| 3| 4| b| 47 | 59 |
|2022-01-10| 1| 3| c| 23 | 47 |
Since I have to do this operation for multiple tables, here what I tried:
user_window = Window.partitionBy(['userid']).orderBy('date')
list_tables = [df2]
list_col_original = df.columns
for table in list_tables:
df = union_different_tables(df, table)
list_column = list(set(table.columns) - set(list_col_original))
df ='userid',
*[f.sum(f.col(col_name)).over(user_window).alias(col_name) for col_name in list_column])
| c| 13| 11|
| c| 16| 12|
| c| 47| 23|
| c| 47| 23|
| b| 14| 13|
| b| 18| 16|
| b| 59| 47|
| b| 59| 47|
| a| 12| 13|
| a| 14| 16|
| a| 35| 47|
| a| 35| 47|
But that give me a sort of cumulative sum, plus I didn't find a way to add all the columns in the resulting df.
The only thing is that I can't do any join ! My df are very very large and any join is taking too long to compute.
Do you know how I can fix my code to have the result I want ?
After union of df1 and df2, you can group by userid and sum all columns except date for which you get the max.
Note that for the union part, you can actually use DataFrame.unionByName if you have the same data types but only number of columns can differ:
df = df1.unionByName(df2, allowMissingColumns=True)
Then group by and agg:
import pyspark.sql.functions as F
result = df.groupBy("userid").agg(
*[F.sum(c).alias(c) for c in df.columns if c not in ("date", "userid")]
#|userid| date|value1|value2|value3|value4|
#| a|2022-01-10| 3| 2| 47| 35|
#| b|2022-01-10| 3| 4| 47| 59|
#| c|2022-01-10| 1| 3| 23| 47|
This supposes the second dataframe contains only dates prior to the today date in the first one. Otherwise, you'll need to filter df2 before union.

Pyspark Dataframe Merge Rows by eliminating null values

i have a Pyspark Data Frame like this one
| 3| 1| null| 124,21| null| null|
| 5| 2| null| 124,23| null| null|
| 5| 2| null| 124,26| null| null|
| 6| 4| null| 124,24| null| null|
| 3| 1| null| null| 6764| null|
| 5| 2| null| null| 6772| null|
| 5| 2| null| null| 6782| null|
| 6| 4| null| null| 6932| null|
| 3| 1| null| null| null| 1|
| 5| 2| null| null| null| 1|
| 5| 2| null| null| null| 1|
| 6| 4| null| null| null| 1|
| 3| 1| 17:18:04| null| null| null|
| 5| 2| 18:22:40| null| null| null|
| 5| 2| 18:25:29| null| null| null|
| 6| 4| 18:32:18| null| null| null|
and i want to merge the columns of it, it should look like (for example):
| 3| 1| 17:18:04| 124,21| 6764| 1|
| 5| 2| 18:22:40| 124,23| 6772| 1|
| 5| 2| 18:25:29| 124,26| 6782| 1|
| 6| 4| 18:32:18| 124,24| 6932| 1|
I tried to use:
df = df.groupBy('id').agg(*[f.first(x,ignorenulls=True) for x in df.columns])
however, this is just giving me just the first value of the column and i need all the records. Because to one ID i have different registered Timestamps and different registered values, which im now loosing.
Thanks for the advice
I'm not sure if this is what you wanted, but essentially you can do a collect_list for each id and column, and explode all resulting lists. In this way, you can have multiple entries per id.
from functools import reduce
import pyspark.sql.functions as F
df2 = reduce(
lambda x, y: x.withColumn(y, F.explode_outer(y)),
df.groupBy('id_product', 'value').agg(*[F.collect_list(c).alias(c) for c in df.columns[2:]])

SQL or Pyspark - Get the last time a column had a different value for each ID

I am using pyspark so I have tried both pyspark code and SQL.
I am trying to get the time that the ADDRESS column was a different value, grouped by USER_ID. The rows are ordered by TIME. Take the below table:
| 1| 1| A| 10|
| 2| 1| B| 15|
| 3| 1| A| 20|
| 4| 1| A| 40|
| 5| 1| A| 45|
The correct new column I would like is as below:
| 1| 1| A| 10| null|
| 2| 1| B| 15| 10|
| 3| 1| A| 20| 15|
| 4| 1| A| 40| 15|
| 5| 1| A| 45| 15|
I have tried using different windows but none ever seem to get exactly what I want. Any ideas?
A simplified version of #jxc's answer.
from pyspark.sql.functions import *
from pyspark.sql import Window
#Window definition
w = Window.partitionBy(col('user_id')).orderBy(col('id'))
#Getting the previous time and classifying rows into groups
grp_df = df.withColumn('grp',sum(when(lag(col('address')).over(w) == col('address'),0).otherwise(1)).over(w)) \
#Window definition with groups
w_grp = Window.partitionBy(col('user_id'),col('grp')).orderBy(col('id'))
Use lag with running sum to assign groups when there is a change in the column value (based on the defined window). Get the time from the previous row, which will be used in the next step.
Once you get the groups, use the running minimum to get the last timestamp of the column value change. (Suggest you look at the intermediate results to understand the transformations better)
One way using two Window specs:
from pyspark.sql.functions import when, col, lag, sum as fsum
from pyspark.sql import Window
w1 = Window.partitionBy('USER_ID').orderBy('ID')
w2 = Window.partitionBy('USER_ID').orderBy('g')
# create a new sub-group label based on the values of ADDRESS and Previous ADDRESS
df1 = df.withColumn('g', fsum(when(col('ADDRESS') == lag('ADDRESS').over(w1), 0).otherwise(1)).over(w1))
# group by USER_ID and the above sub-group label and calculate the sum of time in the group as diff
# calculate the last_diff and then join the data back to the df1
df2 = df1.groupby('USER_ID', 'g').agg(fsum('Time').alias('diff')).withColumn('last_diff', lag('diff').over(w2))
df1.join(df2, on=['USER_ID', 'g']).show()
|USER_ID| g| ID|ADDRESS|TIME|diff|last_diff|
| 1| 1| 1| A| 10| 10| null|
| 1| 2| 2| B| 15| 15| 10|
| 1| 3| 3| A| 20| 105| 15|
| 1| 3| 4| A| 40| 105| 15|
| 1| 3| 5| A| 45| 105| 15|
df_new = df1.join(df2, on=['USER_ID', 'g']).drop('g', 'diff')

Apache spark window, chose previous last item based on some condition

I have an input data which has id, pid, pname, ppid which are id (can think it is time), pid (process id), pname (process name), ppid (parent process id) who created pid
| id|pid|pname|ppid|
| 1| 1| 5| -1|
| 2| 1| 7| -1|
| 3| 2| 9| 1|
| 4| 2| 11| 1|
| 5| 3| 5| 1|
| 6| 4| 7| 2|
| 7| 1| 9| 3|
now need to find ppname (parent process name) which is the last pname (previous pnames) of following condition == current.ppid
expected result for previous example:
| id|pid|pname|ppid|ppname|
| 1| 1| 5| -1| -1|
| 2| 1| 7| -1| -1| no item found above with pid=-1
| 3| 2| 9| 1| 7| last pid = 1(ppid) above, pname=7
| 4| 2| 11| 1| 7|
| 5| 3| 5| 1| 7|
| 6| 4| 7| 2| 11| last pid = 2(ppid) above, pname=11
| 7| 1| 9| 3| 5| last pid = 3(ppid) above, pname=5
I can join by itself based on pid==ppid then take diff between ids and pick row with min positive difference maybe then join back again for the cases where we didn't find any positive diffs (-1 case).
But I am thinking it is almost like a cross join, which I might not afford since I have 100M rows.

Counting number of nulls in pyspark dataframe by row

So I want to count the number of nulls in a dataframe by row.
Please note, there are 50+ columns, I know I could do a case/when statement to do this, but I would prefer a neater solution.
For example, a subset:
columns = ['id', 'item1', 'item2', 'item3']
vals = [(1, 2, 0, None),(2, None, 1, None),(3,None,9, 1)]
| id|item1|item2|item3|
| 1| 2| 'A'| null|
| 2| null| 1| null|
| 3| null| 9| 'C'|
After running the code, the desired output is:
| id|item1|item2|item3|numNulls|
| 1| 2| 'A'| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 'C'| 1|
EDIT: Not all non null values are ints.
Convert null to 1 and others to 0 and then sum all the columns:
df.withColumn('numNulls', sum(df[col].isNull().cast('int') for col in df.columns)).show()
| id|item1|item2|item3|numNulls|
| 1| 2| 0| null| 1|
| 2| null| 1| null| 2|
| 3| null| 9| 1| 1|