Pandas DataFrame subtract values - pandas

Im new to python
I have a data frame (df) which has the following structure:
ID
rate
Sequential number
a
150
1
a
150
1
a
50
2
b
250
1
c
25
1
d
25
1
d
40
2
d
25
3
The ID are customers, the value are monthly rates and Sequential number is a number that always increases by 1, if the customer changes the monthly rate
I want to do the following:
for every ID find the maximum value in the column Sequential number, take the associated value in the column rate, find the minimum value in the column Sequential number and take associated value in the column rate and subtracting the rates.
At the end I want to have a additional column to my data frame with the difference of the rates. Maybe the loop could do the following:
for id in df()
find max() in column Sequential number and get value in rates -
min () in column Sequential number and get value in rates
return difference
The new df_new should be this
ID
rate
Sequential number
rate_diff
a
150
1
0
a
150
1
0
a
50
2
-100
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
0
d
30
3
5
If an ID has only one entry, the rate_diff should be 0
I tried already the lambda Function:
df['diff_rate'] = df.groupby('ID')['rate'].transform(lambda x : x-x.min())
but this returns
ID
rate
Sequential number
rate_diff
a
150
1
100
a
150
1
100
a
50
2
0
b
250
1
0
c
25
1
0
d
25
1
0
d
40
2
15
d
30
3
10
Maybe someone of you have a small workaround for this! :-)

One approach with indexing:
g = df.groupby('ID')['Sequential number']
IMAX = g.idxmax()
IMIN = g.idxmin()
df['rate_diff'] = 0
df.loc[IMAX, 'rate_diff'] = (df.loc[IMAX, 'rate'].to_numpy()
-df.loc[IMIN, 'rate'].to_numpy()
)
Another with groupby.transform+where:
g = df.sort_values(by=['ID', 'Sequential number']).groupby('ID')
m = g['Sequential number'].idxmax()
df['rate_diff'] = (g['rate'].transform(lambda x: x.iloc[-1]-x.iloc[0])
.where(df.index.isin(m), 0)
)
output:
ID rate Sequential number rate_diff
0 a 150 1 0
1 a 150 1 0
2 a 50 2 -100
3 b 250 1 0
4 c 25 1 0
5 d 25 1 0
6 d 40 2 0
7 d 30 3 5

Related

Getting groups by group index

I want to access the group by group index. My dataframe is as given below
import pandas as pd
from io import StringIO
import numpy as np
data = """
id,name
100,A
100,B
100,C
100,D
100,pp;
212,E
212,F
212,ds
212,G
212, dsds
212, sas
300,Endüstrisi`
"""
df = pd.read_csv(StringIO(data))
I want to groupby 'id' and access the groups by its group index.
dfg=df.groupby('id',sort=False,as_index=False)
dfg.get_group(0)
I was expecting this to return the first group which is the group for id =1 (which is the first group)
You need pass value of id:
dfg=df.groupby('id',sort=False)
a = dfg.get_group(100)
print (a)
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
dfg=df.groupby('id',sort=False)
a = dfg.get_group(df.loc[0, 'id'])
print (a)
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
If need enumerate groups is possible use GroupBy.ngroup:
dfg=df.groupby('id',sort=False)
a = df[dfg.ngroup() == 0]
print (a)
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
Detail:
print (dfg.ngroup())
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 1
9 1
10 1
11 2
dtype: int64
EDIT: Another idea is if need select groups by positions (all id are consecutive groups) with compare by unique values of id selected by positions:
ids = df['id'].unique()
print (ids)
[100 212 300]
print (df[df['id'].eq(ids[0])])
id name
0 100 A
1 100 B
2 100 C
3 100 D
4 100 pp;
print (df[df['id'].eq(ids[1])])
id name
5 212 E
6 212 F
7 212 ds
8 212 G
9 212 dsds
10 212 sas

Creating 2 "cartridges" of cumulative sum with conditions using SQL

I need to create 2 cumulative sums based on the value type, for example:
I have values of incoming stock units from 2 types: A and B. and I also have records of outgoing stock units.
If we have enough stock of type "A" it should taken out of type A, if not- it should be taken out of type B. so basically I need to crate the columns "A stock" and "B stock" below, representing the current balance of each type.
I tried using cumulative sum but I'm having trouble with the condition... is there a way to write this query without using a loop ? ( Vertica DB)
In table below A_stock and B_stock are the final result I need to create
ID Type In OUT A stock B stock Order_id
1 A 100 0 100 0 1
1 B 50 0 100 50 2
1 A 100 0 200 50 3
1 - 0 -200 0 50 4
1 - 0 -10 0 40 5
1 B 50 0 0 90 6
1 A 40 0 40 90 7
1 - 0 -20 20 90 8
2 A 30 0 30 0 1
2 B 20 0 30 20 2
2 A 10 0 40 20 3
2 - 0 -20 20 20 4
You can use window functions - but you need a column that defines the ordering of the rows, I assumed ordering_id:
select t.*,
sum(case when type = 'A' then in + out else 0 end) over(partition by id order by ordering_id) a_stock,
sum(case when type = 'B' then in + out else 0 end) over(partition by id order by ordering_id) b_stock
from mytable t
This assumes that you want the stock on a per-id basis; if that's not the case, just remove the partition clause from the over() clause.

Pandas column merging on condition

This is my pandas df:
Id Protein A_Egg B_Meat C_Milk Category
A 10 10 20 0 egg
B 20 10 0 10 milk
C 20 10 10 10 meat
D 25 20 10 0 egg
I wish to merge protein column with other column based on "Category"
My output is
Id Protein_final
A 20
B 30
C 30
D 45
Ideally, I would like to show how I am approaching but, I am frankly clueless!!
EDIT: Also, How to handle is the category is blank or does meet one of the column (in that can final should be same as initial value in protein column)
Use DataFrame.lookup with some preprocessing with remove values in columns names before _ and lowercase, last add to column:
arr = df.rename(columns=lambda x: x.split('_')[-1].lower()).lookup(df.index, df['Category'])
df['Protein'] += arr
print (df)
Id Protein A_Egg B_Meat C_Milk Category
0 A 20 10 20 0 egg
1 B 30 10 0 10 milk
2 C 30 10 10 10 meat
3 D 45 20 10 0 egg
If need only 2 columns finally:
df = df[['Id','Protein']]
You can melt the dataframe, and filter for rows where category equals the variable column, and sum the final columns :
(
df
.melt(["Id", "Protein", "Category"])
.assign(variable=lambda x: x.variable.str[2:].str.lower(),
Protein_final=lambda x: x.Protein + x.value)
.query("Category == variable")
.filter(["Id", "Protein_final"])
)
Id Protein_final
0 A 20
3 D 45
6 C 30
9 B 30

How to split numbers in pandas column into deciles?

I have a column in pandas dataset of random values ranging btw 100 and 500.
I need to create a new column 'deciles' out of it - like ranking, total of 20 deciles. I need to assign rank number out of 20 based on the value.
10 to 20 - is the first decile, number 1
20 to 30 - is the second decile, number 2
x = np.random.randint(100,501,size=(1000)) # column of 1000 rows with values ranging btw 100, 500.
df['credit_score'] = x
df['credit_decile_rank'] = df['credit_score'].map( lambda x: int(x/20) )
df.head()
Use integer division by 10:
df = pd.DataFrame({
'credit_score':[4,15,24,55,77,81],
})
df['credit_decile_rank'] = df['credit_score'] // 10
print (df)
credit_score credit_decile_rank
0 4 0
1 15 1
2 24 2
3 55 5
4 77 7
5 81 8

Pandas - create new rows with constraints and forward fill them with existing data

I have two dataframes. We'll call them main and valid_dates
main:
name from amount days
A 7/31/18 200 1
B 7/31/18 300 1
C 7/30/18 200 1
D 7/27/18 100 3
......
G 7/17/18 50 1
H 7/13/18 150 4
valid_dates:
date
7/13/18
7/16/18
7/17/18
7/27/18
7/30/18
7/31/18
Here's where it gets complicated. I need to expand the rows where days > 1, but I can't use a non-valid date.
output:
name from amount days
A 7/31/18 200 1
B 7/31/18 300 1
C 7/30/18 200 1
D 7/27/18 100 3
......
G 7/17/18 50 1
H 7/16/18 150 1
H 7/13/18 150 3
alternate (and equally valid) output:
name from amount days rep_days rep_date
A 7/31/18 200 1 1 7/31/18
B 7/31/18 300 1 1 7/31/18
C 7/30/18 200 1 1 7/30/18
D 7/27/18 100 3 3 7/27/18
......
G 7/17/18 50 1 1 7/17/18
H 7/13/18 150 4 1 7/16/18
H 7/13/18 150 4 3 7/13/18
To clarify what happened:
-7/27 + 3 = 7/30. However, no date between 7/27 and 7/30 was in valid_dates, so that entry is left alone with 7/27 representing 3 days.
-7/13 + 4 = 7/17. The only date between 7/13 and 7/17 in valid_dates is 7/16. Therefore a 7/16 entry will be added, and it'll represent one day. 7/13 has to represent 3. Rest of the row data is duplicated.
-Going by the above example. If 7/15 AND 7/16 were in valid_dates, then a 7/15 and 7/16 entry would be added, each representing one day. 7/13 would represent 2. Rest of the row data is duplicated.
You can assume that where days > 1, from + days will never be greater than another entry in the from column.
I realize this may be confusing so let me know if you have any questions.