How can I delete a group of rows if they don't satisfy a condition? - pandas

I have a dataframe with stock option information. I want to filter this dataframe in order to have exactly 8 options per date. The problem is that some dates have only 6 or 7 options. I want to write a code where I delete entirely this group of options.The option dataframe that I want to filter
Take this small dataframe as an example:
dates = ['2013-01-01','2013-01-01','2013-01-01','2013-01-02','2013-01-02','2013-01-03','2013-01-03','2013-01-03']
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=list('ABCD'))
In this particular case I want to drop the rows indexed in date '2013-01-02' since I only want dates who have 3 consecutive rows.

First group by count on index
odf = df.groupby(df.index).count()
filter the dataframe and get the resulting index
idx = odf[odf['A'] == 3].index
select by index
df.loc[idx]

Related

Pandas how to group by day and other column

I am getting the daily counts of rows from a dataframe using
df = df.groupby(by=df['startDate'].dt.date).count()
How can I modify this so I can also group by another column 'unitName'?
Thank you
Use list with GroupBy.size:
df = df.groupby([df['startDate'].dt.date, 'unitName']).size()
If need count non missing values, e.g. column col use DataFrameGroupBy.count:
df = df.groupby([df['startDate'].dt.date, 'unitName'])['col'].count()

Excluding specfic columns in Pandas for column based computations

Year A B C D
1900 1 2 3 4
1901 2 3 4 5
I have a dataset which aligns with the above format.
When i want to perform calculations on column values the year is getting added to the column values and distorting the result. For example
df['mean'] = df.mean(axis='columns')
In the above example i just want to exclude year from calculations. I have 100 plus columns in my data frame and i cannot manually use each of the columns . 'year' is also the Index for my dataframe
I realized the problem and solution
df.set_index(['Year']
df['mean'] = df.mean(axis='columns')
This did not work
But when i added inplace = True , it worked.
'df.set_index(['Year'],inplace = True)'
df['mean'] = df.mean(axis='columns')
You can also drop the year column and create a new dataframe and after applying the mean to individual columns we can add the year column.
df2 = df.drop('Year')
df2['Mean']=df.mean(axis='columns')
df2.concat(df.Year,df2)

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

How to group row wise in a dataframe in pandas

I have a dataframe and I want to group them in 1000 rows in each group and then I want to find the max and min of each group and finally save both min and max row of each group in a new dataframe.
I have tried various options but no luck. I am fairly naive to pandas.
df.groupby(df['col_name'] // 1000).apply(max)
and another one was,
k = 1000
res = df.col_name[::-1].rolling(k,1).max().shift(1)[::-1]
pd.concat([df.col_name, res], axis = 1)
If above is my dataframe and suppose I want 3 rows in each group. The new dataframe should be:
use groupby and stack
df.groupby(df.index//3).agg(['first', 'last']).stack().reset_index(drop=True)

Deleting/Selecting rows from pandas based on conditions on multiple columns

From a pandas dataframe, I need to delete specific rows based on a condition applied on two columns of the dataframe.
The dataframe is
0 1 2 3
0 -0.225730 -1.376075 0.187749 0.763307
1 0.031392 0.752496 -1.504769 -1.247581
2 -0.442992 -0.323782 -0.710859 -0.502574
3 -0.948055 -0.224910 -1.337001 3.328741
4 1.879985 -0.968238 1.229118 -1.044477
5 0.440025 -0.809856 -0.336522 0.787792
6 1.499040 0.195022 0.387194 0.952725
7 -0.923592 -1.394025 -0.623201 -0.738013
I need to delete some rows where the difference between column 1 and columns 2 is less than threshold t.
abs(column1.iloc[index]-column2.iloc[index]) < t
I have seen examples where conditions are applied individually on column values but did not find anything where a row is deleted based on a condition applied on multiple columns.
First select columns by DataFrame.iloc for positions, subtract, get Series.abs, compare by thresh with inverse opearator like < to >= or > and filter by boolean indexing:
df = df[(df.iloc[:, 0]-df.iloc[:, 1]).abs() >= t]
If need select columns by names, here 0 and 1:
df = df[(df[0]-df[1]).abs() >= t]