Excluding specfic columns in Pandas for column based computations - pandas

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe

I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')

You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)

Related

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

How can I delete a group of rows if they don't satisfy a condition?

I have a dataframe with stock option information. I want to filter this dataframe in order to have exactly 8 options per date. The problem is that some dates have only 6 or 7 options. I want to write a code where I delete entirely this group of options.The option dataframe that I want to filter
Take this small dataframe as an example:
dates = ['2013-01-01','2013-01-01','2013-01-01','2013-01-02','2013-01-02','2013-01-03','2013-01-03','2013-01-03']
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=list('ABCD'))
In this particular case I want to drop the rows indexed in date '2013-01-02' since I only want dates who have 3 consecutive rows.
First group by count on index
odf = df.groupby(df.index).count()
filter the dataframe and get the resulting index
idx = odf[odf['A'] == 3].index
select by index
df.loc[idx]

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Deleting/Selecting rows from pandas based on conditions on multiple columns

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.
First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]

How to select data from multiple dataframes

I'm beginner on pandas i have to dataframes first called
DATA_DF which contains many fields and i'm interested for DATA_DF['Date effet'] as type datetime
and i have other dataframe called TAUX_DF contains years and every year has a value
TAUX_DF =
Année <10 ans >10 ans
1987 2,8168% 3,4664%
1988 2,8168% 3,4664%
1989 2,8168% 3,4664%
1990 2,8168% 3,4664%
i want to create new column on DATA_DF called "DATA_DF['Taux technique']"
it take from DATA_DF['Date effet'].dt.year compare it with the year on TAUX_DF['Année'] and put value like this on Excel
=SI(G5>120;RECHERCHEV(ANNEE(C5);Taux!$A$2:$C$29;3;FAUX);RECHERCHEV(ANNEE(C5);Taux!$A$2:$C$29;2;FAUX))
DATA_DF['Année'] = DATA_DF['Date effet'].dt.year ## ==> make column with year of the data_df in order to compare (merge) it later on with the 'TAUX_DF'.
DATA_DF = pd.merge(DATA_DF,TAUX_DF,left_on='Année',right_on='Année', how='left')