Deleting/Selecting rows from pandas based on conditions on multiple columns - pandas

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.

First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]

Related

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

How can I delete a group of rows if they don't satisfy a condition?

I have a dataframe with stock option information. I want to filter this dataframe in order to have exactly 8 options per date. The problem is that some dates have only 6 or 7 options. I want to write a code where I delete entirely this group of options.The option dataframe that I want to filter
Take this small dataframe as an example:
dates = ['2013-01-01','2013-01-01','2013-01-01','2013-01-02','2013-01-02','2013-01-03','2013-01-03','2013-01-03']
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=list('ABCD'))
In this particular case I want to drop the rows indexed in date '2013-01-02' since I only want dates who have 3 consecutive rows.
First group by count on index
odf = df.groupby(df.index).count()
filter the dataframe and get the resulting index
idx = odf[odf['A'] == 3].index
select by index
df.loc[idx]

Excluding specfic columns in Pandas for column based computations

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe
I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')
You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)

Pandas aggregation on a group per row wise

I have a table like this
score
3
4
5
6
1
I want an HHI value corresponding to every row. e.g.
HHI(i) = sum of squares(x/sum(x)) where x!=score(i)
Can I do row-wise aggregation on remaining rows in pandas?

grouping by column and then doing a boxplot by the index in pandas

I have a large dataframe which I would like to group by some column and examine graphically the distribution per group using a boxplot. I found that df.boxplot() will do it for each column of the dataframe and put it in one plot, just as I need.
The problem is that after a groupby operation, my data is all in one column with the group labels in the index , so i can't call boxplot on the result.
here is an example:
df = DataFrame({'a':rand(10),'b':[x%2 for x in range(10)]})
df
a b
0 0.273548 0
1 0.378765 1
2 0.190848 0
3 0.646606 1
4 0.562591 0
5 0.409250 1
6 0.637074 0
7 0.946864 1
8 0.203656 0
9 0.276929 1
Now I want to group by column b and boxplot the distribution of both groups in one boxplot. How can I do that?
You can use the by argument of boxplot. Is that what you are looking for?
df.boxplot(column='a', by='b')