Pandas Dataframe how to iterate over rows and perform calculations on their values - pandas

I've started to work with Pandas Dataframe and try to figure out how to deal with the below task.
I have an excel spreadsheet that needs to be imported to Pandas DataFrame and the below calculations need to be done to populate PercentageOnSale , Bonus and EmployeesIncome columns.
If the sum of all SalesValues for the EmployeeID is less than 5000 the PercentageOnSale should be 5% of SalesValue.
If the sum of all SalesValues for the EmployeeID is equal or more than 5000 the PercentageOnSale should be 7% of SalesValue.
If the sum of all SalesValues for the EmployeeID is more than 10.000 the PercentageOnSale should be 7% of SalesValue and additionaly a Bonus of 3% should be calculated.
EmployeesIncome is the sum of PercentageOnSale and Bonus columns.
sample excel view

You could try groupby-apply as follows:
# Data
df = pd.DataFrame({"EmployeeID":[1,1,2,3,1,3,5,1],
"ProductSold":["P1","P2","P3","P1","P2","P3","P1","P2"],
"SalesValue":[3000,3500,4000,3000,5000,3000,3000,4000]})
# Calculations
def calculate(x):
# Calcualte Bonus
x['Bonus'] = 0
if x['SalesValue'].sum() > 10000:
x['Bonus'] = 0.03*x['SalesValue']
# Calculate PercentageOnSale
if x['SalesValue'].sum() < 3000:
x['PercentageOnSale'] = 0.05*x['SalesValue']
else:
x['PercentageOnSale'] = 0.07*x['SalesValue']
# Total Income per sale
x['EmployeesIncome'] = x['PercentageOnSale'] + x['Bonus']
return x
df_final = df.groupby('EmployeeID').apply(calculate)
The output is as follows:
EmployeeID ProductSold SalesValue Bonus PercentageOnSale EmployeesIncome
0 1 P1 3000 90.0 210.0 300.0
1 1 P2 3500 105.0 245.0 350.0
2 2 P3 4000 0.0 280.0 280.0
3 3 P1 3000 0.0 210.0 210.0
4 1 P2 5000 150.0 350.0 500.0
5 3 P3 3000 0.0 210.0 210.0
6 5 P1 3000 0.0 210.0 210.0
7 1 P2 4000 120.0 280.0 400.0

Related

How can I group a continuous column (0-1) into equal sizes? Scala spark

I have a dataframe column that I want to split into equal size buckets. The values in this column are floats between 0-1. Most of the data is skewed, so most values fall in the 0.90's and 1.
Bucket 10: All 1's (the size of this bucket will be different from 2-9 and 1)
Bucket 2-9: Any values > 0 and < 1 (equal sized)
Bucket 1: All 0's (the size of this bucket will be different from 2-9 and 10)
Example:
continous_number_col
Bucket
0.001
2
0.95
9
1
10
0
1
This should be how it looks when I groupBy("Bucket")
Counts of bucket 1 and 10 aren't significant here, they will just be in their own bucket.
And the 75 count will be different, just using as an example.
Bucket
Count
Values
1
1000
0
2
75
0.01 - 0.50
3
75
0.51 - 0.63
4
75
0.64 - 0.71
5
75
0.72 - 0.83
6
75
0.84 - 0.89
7
75
0.90 - 0.92
8
75
0.93 - 0.95
9
75
0.95 - 0.99
10
2000
1
I've tried using the QuantileDiscretizer() Function as this:
val df = {
rawDf
//Taking 1's and 0's out for the moment
.filter(col("continuous_number_col") =!= 1 && col("continuous_number_col") =!= 0)
}
val discretizer = new QuantileDiscretizer()
.setInputCol("continuous_number_col")
.setOutputCol("bucket_result")
.setNumBuckets(8)
val result = discretizer.fit(df).transform(df)
However, this gives me the following, not equal buckets:
bucket_result
count
7.0
20845806
6.0
21096698
5.0
21538813
4.0
21222511
3.0
21193393
2.0
21413413
1.0
21032666
0.0
21681424
Hopefully this gives enough context to what I'm trying to do. Thanks in advance.

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Sorting Pandas data frame with groupby and conditions

I'm trying to sort a data frame based on groups meeting conditions.
The I'm getting a syntax error for the way I'm sorting the groups.
And I'm losing the initial order of the data frame before attempting the above.
This is the order of sorting that I'm trying to achieve:
1) Sort on First and Test columns.
2) Test==1 groups, sort on Secondary then by Final column.
---Test==0 groups, sort on Final column only.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.1,.1,.2,.2,.3,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
def sorter(x):
if x["Test"]==1:
x.sort_values(['Secondary','Final'], inplace=True)
else:
x=x.sort_values('Final', inplace=True)
df=df.sort_values(["First","Test"],ascending=[False, False]).reset_index(drop=True)
df.groupby(['First','Test']).apply(lambda x: sorter(x))
df
Expected result:
First Test Secondary Final
200 1 0.4 10.1
200 1 0.3* 9.9*
200 1 0.3* 8.8*
200 0 0.4 11.11*
200 0 0.3 7.7*
100 1 0.5 2.2
100 1 0.1* 3.3*
100 1 0.1* 1.1*
100 0 0.3 6.6*
100 0 0.2 5.5*
100 0 0.2 4.4*
You can try of sorting in descending order without groupby,
w.r.t sequence you gave, the order of sorting will change.will it work for you
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df = df.groupby(['First','Test']).apply(lambda x: x.sort_values(['First','Test','Secondary','Final'],ascending=False) if x.iloc[0]['Test']==1 else x.sort_values(['First','Test','Final'],ascending=False)).reset_index(drop=True)
df.sort_values(['First','Test'],ascending=[True,False])
Out:
Final First Secondary Test
3 2.20 100 0.5 1
4 3.30 100 0.1 1
5 1.10 100 0.1 1
0 6.60 100 0.1 0
1 5.50 100 0.4 0
2 4.40 100 0.9 0
8 10.10 200 0.4 1
9 9.90 200 0.3 1
10 8.80 200 0.3 1
6 11.11 200 0.4 0
7 7.70 200 0.3 0
The trick was to sort subsets separately and replace the values in the original df.
This came up in other solutions to pandas sorting problems.
import pandas as pd
df=pd.DataFrame({"First":[100,100,100,100,100,100,200,200,200,200,200],"Test":[1,1,1,0,0,0,0,1,1,1,0],"Secondary":[.1,.5,.1,.9,.4,.1,.3,.3,.3,.4,.4],"Final":[1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9,10.10,11.11]})
df.sort_values(['First','Test','Secondary','Final'],ascending=False, inplace=True)
index_subset=df[df["Test"]==0].index
sorted_subset=df[df["Test"]==0].sort_values(['First','Final'],ascending=False)
df.loc[index_subset,:]=sorted_subset.values
print(df)

Forcing dataframe recalculation after a change of a specific cell

I start with a simple
df = pd.DataFrame({'units':[30,20]})
And I get
units
0 30
1 20
I then add a row to total the column:
my_sum = df.sum()
df = df.append(my_sum, ignore_index=True)
Finally, I add a column to calculate percentages off of the 'units' column:
df['pct'] = df.units / df.units[:-1].sum()
ending with this:
units pct
0 30 0.6
1 20 0.4
2 50 1.0
So far so good - but now the question: I want to change the middle number of units from 20 to, for example, 30. I can use this:
df3.iloc[1, 0] = 40
or
df3.iat[1, 0] = 40
which change the cell, but the calculated values at both the last row and second column don't change to reflect it:
units pct
0 30 0.6
1 40 0.4
2 50 1.0
How do I force these calculated values to adjust following the change in that particular cell?
Make a function that calculates it
def f(df):
return df.append(df.sum(), ignore_index=True).assign(
pct=lambda d: d.units / d.units.iat[-1])
df.iat[1, 0] = 40
f(df)
units pct
0 30 0.428571
1 40 0.571429
2 70 1.000000

"Indexing" a price series to a starting time point (index level = 100) with pandas data frame : P(i,t) / P(i)

I have a pandas data frame, where datetime is the index of the data frame (I use t=0 for simplification, in fact there is something like 20170101 09:30:00)
datetime Stock A Stock B
t=0 5 20
t=1 6 30
t=2 8 25
t=3 4 20
and I would like to return:
datetime Stock A Stock B
t=0 100 100
t=1 120 150
t=2 140 125
t=3 80 100
in mathematical terms: Index(i, t) = P(i, t) / P(i, 0).
I tried
df_norm = df[0:] / df[0:1]
print(df_norm)
which gives me an error.
edit1: I tried option 3 which works fine (couldn't try on NaN's yet, but at least it does not create an NaN for the first obs (which is caused by pctchange)). I wonder also that after performing, my datetime is not the set index anymore, which is easy to fix by just re-assigning it.
Now I am trying now to wrap it in a function, but I think the index is causing a problem (actually same error as with my "first" attempt):
def norming(x):
return x.assign(**x.drop('datetime', 1).pipe(
lambda d: d.div(d.shift().bfill()).cumprod()))
edit2: if my column datetime is an index, i.e.
df_norm.set_index(['datetime'], inplace = True)
I'll get an error though, what would I need to change?
Option 1
df.set_index('datetime').pct_change().fillna(0) \
.add(1).cumprod().mul(100).reset_index()
datetime Stock A Stock B
0 t=0 100.0 100.0
1 t=1 120.0 150.0
2 t=2 160.0 125.0
3 t=3 80.0 100.0
Option 2
def idx_me(a):
a = np.asarray(a)
r = np.append(1, a[1:] / a[:-1])
return r.cumprod() * 100
df.assign(**df.drop('datetime', 1).apply(idx_me))
datetime Stock A Stock B
0 t=0 100.0 100.0
1 t=1 120.0 150.0
2 t=2 160.0 125.0
3 t=3 80.0 100.0
Option 3
df.assign(**df.drop('datetime', 1).pipe(
lambda d: d.div(d.shift().bfill()).cumprod().mul(100)))
datetime Stock A Stock B
0 t=0 100.0 100.0
1 t=1 120.0 150.0
2 t=2 160.0 125.0
3 t=3 80.0 100.0
Seems like
p=100/df.iloc[0,1:]
df.iloc[:,1:]*=p
df
Out[1413]:
datetime StockA StockB
0 t=0 100 100
1 t=1 120 150
2 t=2 160 125
3 t=3 80 100