Pandas aggregation on a group per row wise - pandas

I have a table like this
score
3
4
5
6
1
I want an HHI value corresponding to every row. e.g.
HHI(i) = sum of squares(x/sum(x)) where x!=score(i)
Can I do row-wise aggregation on remaining rows in pandas?

Related

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Select only a number of rows from a pandas Dataframe based on a condition

I'm want to sample n rows from each different value in column named club
columns = ['long_name','age','dob','height_cm','weight_kg','club']
teams = ['Real Madrid','FC Barcelona','Chelsea','CA Osasuna','Paris Saint-Germain','FC Bayern München','Atlético Madrid','Manchester City','Liverpool','Hull City']
playersDataDB = playersData.loc[playersData['club'].isin(teams)][columns]
playersDataDB.head()
In the code above i have selected my desired colums based on them belonging to the teams selected.
The output from this code is a 299 rows × 6 columns Dataframe meaning that i'm sampling all the player from the team but i want to get just 16 of them from each club.
Not sure how your dataframe looks like but you could groupby teams and then use head(16) to get only the first 16 of them.
df.groupby('club').head(16)
You can use isin like this:
playersDataDB = playersData[playersData['club'].isin(teams)]
playersDataDB.head()

Pandas Dataframe: Groupby on first two columns and count the occurence for first column [duplicate]

This question already has answers here:
Pandas, groupby and count
(3 answers)
Closed 2 years ago.
I had a dataset as a result of groupby as :
CUSTID TRANSACTION_ID COUNT
CU_1 TR_1 1
CU_1 TR_2 1
CU_1 TR_3 1
CU_2 TR_4 1
CU_2 TR_5 1
I needed to have result as
CUSTID TOTAL_COUNT
CU_1 3
CU_2 2
Run just:
df.groupby('CUSTID').COUNT.sum()
You need to group by a single column (CUSTID) only, then,
from each group, take COUNT column and compute its sum().
Additional step may be to provide the required name to the resulting
Series. If it is important, append .rename('TOTAL_COUNT')
to the above code.
Yet another step may be to convert this Series into a DataFrame.
To do it, append .to_frame() to the above code.

Deleting/Selecting rows from pandas based on conditions on multiple columns

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.
First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')