Pandas cumsum only if positive else zero - pandas

I am making a table, where i want to show that if there's no income, no expense can happen
it's a cumulative sum table
This is what I've
Incoming
Outgoing
Total
0
150
-150
10
20
-160
100
30
-90
50
70
-110
Required output
Incoming
Outgoing
Total
0
150
0
10
20
0
100
30
70
50
70
50
I've tried
df.clip(lower=0)
and
df['new_column'].apply(lambda x : df['outgoing']-df['incoming'] if df['incoming']>df['outgoing'])
That doesn't work as well
is there any other way?

Update:
A more straightforward approach inspired by your code using clip and without numpy:
diff = df['Incoming'].sub(df['Outgoing'])
df['Total'] = diff.mul(diff.ge(0).cumsum().clip(0, 1)).cumsum()
print(df)
# Output:
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50
Old answer:
Find the row where the balance is positive for the first time then compute the cumulative sum from this point:
start = np.where(df['Incoming'] - df['Outgoing'] >= 0)[0][0]
df['Total'] = df.iloc[start:]['Incoming'].sub(df.iloc[start:]['Outgoing']) \
.cumsum().reindex(df.index, fill_value=0)
Output:
>>> df
Incoming Outgoing Total
0 0 150 0
1 10 20 0
2 100 30 70
3 50 70 50

IIUC, you can check when Incoming is greater than Outgoing using np.where and assign a helper column. Then you can check when this new column is not null, using notnull(), calculate the difference, and use cumsum() on the result:
df['t'] = np.where(df['Incoming'].ge(df['Outgoing']),0,np.nan)
df['t'].ffill(axis=0,inplace=True)
df['Total'] = np.where(df['t'].notnull(),(df['Incoming'].sub(df['Outgoing'])),df['t'])
df['Total'] = df['Total'].cumsum()
df.drop('t',axis=1,inplace=True)
This will give back:
Incoming Outgoing Total
0 0 150 NaN
1 10 20 NaN
2 100 30 70.0
3 50 70 50.0

Related

Pandas Dataframe to subtract the values with Previous Executed Value

I want to subtract the first-row value from the total count of the
test case and the remaining values with the executed count outcome.
**Input:**
Date Count
17-10-2022 20
18-10-2022 18
19-10-2022 15
20-10-2022 10
21-10-2022 5
**Code:**
df['Date'] = pd.to_datetime(df['Date'])
edate = df['Date'].max().strftime('%Y-%m-%d')
sdate = df['Date'].min().strftime('%Y-%m-%d')
df['Date'] = pd.to_datetime(df['Date']).apply(lambda x: x.date())
df=df.groupby(['Date'])['Date'].count().reset_index(name='count')
df['result'] = Test_case - df['count'].iloc[0]
df['result'] = df['result'] - df['count'].shift(1)
**Output generating:**
Date count result
0 2022-10-17 20 NaN
1 2022-10-18 18 40.0
**Expected Output:**
Date Count Result
17-10-2022 20 60(80-20) - 80 is the total Test case count for example
18-10-2022 18 42(60-18)
19-10-2022 15 27(42-15)
20-10-2022 10 17(27-10)
21-10-2022 5 12(17-5)
Is 80 an arbitrary number? then use following code:
n = 80
df.assign(Result=df['Count'].cumsum().mul(-1).add(n))
output:
Date Count Result
0 17-10-2022 20 60
1 18-10-2022 18 42
2 19-10-2022 15 27
3 20-10-2022 10 17
4 21-10-2022 5 12
and you can change n

Pandas DataFrame subtract values

Im new to python
I have a data frame (df) which has the following structure:
ID
rate
Sequential number
a
150
1
a
150
1
a
50
2
b
250
1
c
25
1
d
25
1
d
40
2
d
25
3
The ID are customers, the value are monthly rates and Sequential number is a number that always increases by 1, if the customer changes the monthly rate
I want to do the following:
for every ID find the maximum value in the column Sequential number, take the associated value in the column rate, find the minimum value in the column Sequential number and take associated value in the column rate and subtracting the rates.
At the end I want to have a additional column to my data frame with the difference of the rates. Maybe the loop could do the following:
for id in df()
find max() in column Sequential number and get value in rates -
min () in column Sequential number and get value in rates
return difference
The new df_new should be this
ID
rate
Sequential number
rate_diff
a
150
1
0
a
150
1
0
a
50
2
-100
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
0
d
30
3
5
If an ID has only one entry, the rate_diff should be 0
I tried already the lambda Function:
df['diff_rate'] = df.groupby('ID')['rate'].transform(lambda x : x-x.min())
but this returns
ID
rate
Sequential number
rate_diff
a
150
1
100
a
150
1
100
a
50
2
0
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
15
d
30
3
10
Maybe someone of you have a small workaround for this! :-)
One approach with indexing:
g = df.groupby('ID')['Sequential number']
IMAX = g.idxmax()
IMIN = g.idxmin()
df['rate_diff'] = 0
df.loc[IMAX, 'rate_diff'] = (df.loc[IMAX, 'rate'].to_numpy()
-df.loc[IMIN, 'rate'].to_numpy()
)
Another with groupby.transform+where:
g = df.sort_values(by=['ID', 'Sequential number']).groupby('ID')
m = g['Sequential number'].idxmax()
df['rate_diff'] = (g['rate'].transform(lambda x: x.iloc[-1]-x.iloc[0])
.where(df.index.isin(m), 0)
)
output:
ID rate Sequential number rate_diff
0 a 150 1 0
1 a 150 1 0
2 a 50 2 -100
3 b 250 1 0
4 c 25 1 0
5 d 25 1 0
6 d 40 2 0
7 d 30 3 5

Adding extra n rows at the end of a dataframe of a certain value

I have a dataframe with currently 22 rows
index value
0 23
1 22
2 19
...
21 20
to this dataframe, i want to add 72 rows to make the dataframe exactly 100 rows. So i need to fill loc[22:99] but with a certain value, let's say 100.
I tried something like this
uncon_dstn_2021['balance'].loc[22:99] = 100
but did not work. Any idea?
You can do reindex
out = df.reindex(df.index.tolist() + list(range(22, 99+1)), fill_value = 100)
You can also use pd.concat:
df1 = pd.concat([df, pd.DataFrame({'balance': [100]*(100-len(df))})], ignore_index=True)
print(df1)
# Output
balance
0 1
1 14
2 11
3 11
4 10
.. ...
96 100
97 100
98 100
99 100
[100 rows x 1 columns]

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

Forcing dataframe recalculation after a change of a specific cell

I start with a simple
df = pd.DataFrame({'units':[30,20]})
And I get
units
0 30
1 20
I then add a row to total the column:
my_sum = df.sum()
df = df.append(my_sum, ignore_index=True)
Finally, I add a column to calculate percentages off of the 'units' column:
df['pct'] = df.units / df.units[:-1].sum()
ending with this:
units pct
0 30 0.6
1 20 0.4
2 50 1.0
So far so good - but now the question: I want to change the middle number of units from 20 to, for example, 30. I can use this:
df3.iloc[1, 0] = 40
or
df3.iat[1, 0] = 40
which change the cell, but the calculated values at both the last row and second column don't change to reflect it:
units pct
0 30 0.6
1 40 0.4
2 50 1.0
How do I force these calculated values to adjust following the change in that particular cell?
Make a function that calculates it
def f(df):
return df.append(df.sum(), ignore_index=True).assign(
pct=lambda d: d.units / d.units.iat[-1])
df.iat[1, 0] = 40
f(df)
units pct
0 30 0.428571
1 40 0.571429
2 70 1.000000