Get first value from consecutive rows PySpark - dataframe

I am using PySpark and I want to get the first status order by Date but only if there consecutive status, because the status can be more than once, but not more than one in a row. Here is my example of what i have:
status
created_when
GP
A
2022-10-10
A1
B
2022-10-12
A1
B
2022-10-13
A1
C
2022-10-13
A1
C
2022-10-14
A1
B
2022-12-15
A1
C
2022-12-16
A1
D
2022-12-17
A1
A
2022-12-18
A1
This is what I need:
status
created_when
GP
A
2022-10-10
A1
B
2022-10-12
A1
C
2022-10-13
A1
B
2022-12-15
A1
C
2022-12-16
A1
D
2022-12-17
A1
A
2022-12-18
A1
I think something like but dont know how to implement either:
when(row[status] == row[status] + 1) then row[status]
Thank you for you help

You can use a window function ordered by date - that's why the order of your dataframe matters - and compare whether in the following row you have the same status as before: you can do this with the lag function.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy('date')
df = (df
.withColumn('status_lag', F.lag('status').over(w))
.filter((F.col('status_lag') != F.col('status')) | (F.col('status_lag').isNull()))
.drop('status_lag')
)
+------+----------+
|status| date|
+------+----------+
| A|2022-10-10|
| B|2022-10-12|
| C|2022-10-13|
| B|2022-12-15|
| C|2022-12-16|
| D|2022-12-17|
| A|2022-12-18|
+------+----------+

I just find the answer to the problem. I realize i have to add another column to do the partitionBy.
w = Window.partitionBy('GP').orderBy("create_when")
df_1= df_0.withColumn("lag",F.lag("status").over(w))\
.where((F.col("status") != F.col("lag")) | (F.col("lag").isNull()))

Related

Pivoting data without column

Starting from an imported df from excel like that:
Code
Material
Text
QTY
A1
X222
Model3
1
A2
4027721
Gruoup1
1
A2
4647273
Gruoup1.1
4
A1
573828
Gruoup1.2
1
I want to create a new pivot table like that:
Code
Qty
A1
2
A2
5
I tried with the following command but they do not work:
df.pivot(index='Code', columns='',values='Qty')
df_pivot = df ("Code").Qty([sum, max])
You don't need pivot but groupby:
out = df.groupby('Code', as_index=False)['QTY'].sum()
# Or
out = df.groupby('Code')['QTY'].agg(['sum', 'max']).reset_index()
Output:
>>> out
Code sum max
0 A1 2 1
1 A2 5 4
The equivalent code with pivot_table:
out = (df.pivot_table('QTY', 'Code', aggfunc=['sum', 'max'])
.droplevel(1, axis=1).reset_index())

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Effectively removing rows from a Pandas DataFrame with groupby and temporal conditions?

I have a dataframe with tens of millions of rows:
| userId | pageId | bannerId | timestap |
|--------+--------+----------+---------------------|
| A | P1 | B1 | 2020-10-10 01:00:00 |
| A | P1 | B1 | 2020-10-10 01:00:10 |
| B | P1 | B1 | 2020-10-10 01:00:00 |
| B | P2 | B2 | 2020-10-10 02:00:00 |
What I'd like to do is remove all rows where for the same userId, pageId, bannerId, timestamp is within n minutes of the previous occurrence of that same userId, pageId, bannerId pair.
What I'm doing now:
# Get all instances of `userId, pageId, bannerId` that repeats,
# although, not all of them will have repeated within the `n` minute
# threshold I'm interested in.
groups = in df.groupby(['userId', 'pageId', 'bannerId']).userId.count()
# Iterate through each group, and manually check if the repetition was
# within `n` minutes. Keep track of all IDs to be removed.
to_remove = []
for user_id, page_id, banner_id in groups.index:
sub = df.loc[
(df.userId == user_id) &
(df.pageId == pageId) &
(df.bannerId == bannerId)
].sort_values('timestamp')
# Now that each occurrence is listed chronologically,
# check time diff.
sub = sub.loc[
((sub.timestamp.shift(1) - sub.timestamp) / pd.Timedelta(minutes=1)).abs() <= n
]
if sub.shape[0] > 0:
to_remove += sub.index.tolist()
This does work as I'd like. Only issue is that with the large amount of data I have, it takes hours to complete.
To get a more instructive result, I took a bit longer
source DataFrame:
userId pageId bannerId timestap
0 A P1 B1 2020-10-10 01:00:00
1 A P1 B1 2020-10-10 01:04:10
2 A P1 B1 2020-10-10 01:05:00
3 A P1 B1 2020-10-10 01:08:20
4 A P1 B1 2020-10-10 01:09:30
5 A P1 B1 2020-10-10 01:11:00
6 B P1 B1 2020-10-10 01:00:00
7 B P2 B2 2020-10-10 02:00:00
Note: timestap column is of datetime type.
Start from defining a "filtering" function for a group
of timestap values (for some combination of userId,
pageId and bannerId):
def myFilter(grp, nMin):
prevTs = np.nan
grp = grp.sort_values()
res = []
for ts in grp:
if pd.isna(prevTs) or (ts - prevTs) / pd.Timedelta(1, 'm') >= nMin:
prevTs = ts
res.append(ts)
return res
Then set the time threshold (the number of minutes):
nMin = 5
And the last thing is to generate the result:
result = df.groupby(['userId', 'pageId', 'bannerId'])\
.timestap.apply(myFilter, nMin).explode().reset_index()
For my data sample, the result is:
userId pageId bannerId timestap
0 A P1 B1 2020-10-10 01:00:00
1 A P1 B1 2020-10-10 01:05:00
2 A P1 B1 2020-10-10 01:11:00
3 B P1 B1 2020-10-10 01:00:00
4 B P2 B2 2020-10-10 02:00:00
Note that "ordinary" diff is not enough, because eg. starting from the
row with timestamp 01:05:00, two following rows (01:08:20 and 01:09:30)
should be dropped, as they are within 5 minutes limit from 01:05:00.
So it is not enough to look at the previous row only.
Starting from some row you should "mark for drop" all following rows until
you find a row with the timestamp more or at least equally distant from the
"start row" than the limit.
And in this case just this rows becomes the starting row for analysis of
following rows (within the current group).

Manipulate map using different ways?

I am hoping to find an elegant way of sorting a map by value first and then by the key.
For example:
B | 50
A | 50
C | 50
E | 10
D | 100
F | 99
I have the following code:
// Making the map into a list first
List<Map.Entry<String, Integer>> sortingList = new LinkedList<>(processMap.entrySet());
// Create a comparator that would compare the values of the map
Comparator<Map.Entry<String, Integer>> c = Comparator.comparingInt(Map -> Map.getValue());
// Sort the list in descending order
sortingList.sort(c.reversed());
I don't need the result to be map again, so this is sufficient, however, my result is:
D | 100
F | 99
B | 50
A | 50
C | 50
E | 10
I would like to sort not just by value, but also by the key, so the result becomes:
D | 100
F | 99
A | 50
B | 50
C | 50
E | 10
I had researched some possible solutions, but the problem is, my values need to be in descending, but my key has to be ascending...
Hoping if anyone can help me.
Try this:
Comparator<Map.Entry<String, Integer>> c = Comparator.comparing(Map.Entry<String, Integer>::getValue)
.reversed()
.thenComparing(Map.Entry::getKey);

Pandas: how to groupby on concatenated dataframes with same column names?

How to properly concat (or maybe this is .merge()?) N dataframes with the same column names, so that I could groupby them with distinguished column keys. For ex:
dfs = {
'A': df1, // columns are C1, C2, C3
'B': df2, // same columns C1, C2, C3
}
gathered_df = pd.concat(dfs.values()).groupby(['C2'])['C3']\
.count()\
.sort_values(ascending=False)\
.reset_index()
I want to get something like
|----------|------------|-------------|
| | A | B |
| C2_val1 | count_perA | count_perB |
| C2_val2 | count_perA | count_perB |
| C2_val3 | count_perA | count_perB |
I think you need reset_index for create columns from MultiIndex and then add column to groupby dor distinguish dataframes. Last reshape by unstack:
gathered_df = pd.concat(dfs).reset_index().groupby(['C2','level_0'])['C3'].count().unstack()
What is the difference between size and count in pandas?
Sample:
df1 = pd.DataFrame({'C1':[1,2,3],
'C2':[4,5,5],
'C3':[7,8,np.nan]})
df2 = df1.mul(10).fillna(1)
df2.C2 = df1.C2
print (df1)
C1 C2 C3
0 1 4 7.0
1 2 5 8.0
2 3 5 NaN
print (df2)
C1 C2 C3
0 10 4 70.0
1 20 5 80.0
2 30 5 1.0
dfs = {
'A': df1,
'B': df2
}
gathered_df = pd.concat(dfs).reset_index().groupby(['C2','level_0'])['C3'].count().unstack()
gathered_df.index.name = None
gathered_df.columns.name = None
print (gathered_df)
A B
4 1 1
5 1 2