Delete All Rows with Year != Pandas - pandas

I have a huge panda df with hourly data from years 1991-2021 and I need to drop all rows with year != 2021 or the current year. In my dataframe there is a column "year" with years ranging from 1991-2021 of hourly data. I am using this line of code below but it does not seem to be doing anything for dataframe df1. Is there a better way to delete all rows that do not equal year == 2021?:
trimmed_df1 = df1.drop(df1[df1.year != '2021'].index)
My data is a 4532472 X 10 column df in this format:
df1.columns.values
Out[20]:
array(['plant_name', 'business_name', 'business_code',
'maint_region_name', 'power_kwh', 'wind_speed_ms', 'mos_time',
'dataset', 'month', 'year'], dtype=object)

This should do the job:
>>> trimmed_df1 = df1.query(‘year != 2021’).reset_index()
Maybe you don’t even need to reset the index - it’s up to you.

Instead of deleting lines, why not use a .loc[] call to select the lines you do want?
trimmed_df1 = df1.loc[df1.year == '2021']

Related

How can I delete a group of rows if they don't satisfy a condition?

I have a dataframe with stock option information. I want to filter this dataframe in order to have exactly 8 options per date. The problem is that some dates have only 6 or 7 options. I want to write a code where I delete entirely this group of options.The option dataframe that I want to filter
Take this small dataframe as an example:
dates = ['2013-01-01','2013-01-01','2013-01-01','2013-01-02','2013-01-02','2013-01-03','2013-01-03','2013-01-03']
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=list('ABCD'))
In this particular case I want to drop the rows indexed in date '2013-01-02' since I only want dates who have 3 consecutive rows.
First group by count on index
odf = df.groupby(df.index).count()
filter the dataframe and get the resulting index
idx = odf[odf['A'] == 3].index
select by index
df.loc[idx]

How to count specific value occurancies for x amount of rows in Pandas DataFrame?

I have Pandas DataFrame:
df = pd.read_csv("file.csv")
I need to count not all occurancies of columnn 'gender', but for the first 5 entries (would love to see code for any interval of rows, let's say from 10th to the 20th etc.)
Currently I am familliar only with this:
df[df['gender'] == 1].shape[0]
and more complicated with lambda:
A2_object = df.apply(lambda x: True if x['gender'] == 1 else False, axis=1)
A2 = len(A2_object[A2_object == True].index)
I am learning, and I see that loops don't work in dataframe the same way as in the lists or dictionaries.
I am trying something like that:
df[df['gender'] == 1 and df.index < 5].shape[0]
I love this post, but can't get my head around the examples.
As Mr. #Quang-Hoang posted, I needed to use slicing for indecies and in dataframe format indecies are .iloc (.loc?). Thank you, Sir.
Answer: df.iloc[start:end]['gender'].eq(1).sum()

How to resample a dataframe with different functions applied to each column if we have more than 20 columns?

I know this question has been asked before. The answer is as follows:
df.resample('M').agg({'col1': np.sum, 'col2': np.mean})
But I have 27 columns and I want to sum the first 25, and average the remaining two. Should I write this('col1' - 'col25': np.sum) for 25 columns and this('col26': np.mean, 'col27': np.mean) for two columns?
Mt dataframe contains hourly data and I want to convert it to monthly data. I want to try something like that but it is nonsense:
for i in col_list:
df = df.resample('M').agg({i-2: np.sum, 'col26': np.mean, 'col27': np.mean})
Is there any shortcut for this situation?
You can try this, not for loop :
sum_col = ['col1','col2','col3','col4', ...]
sum_df = df.resample('M')[sum_col].sum()
mean_col = ['col26','col27']
mean_df = df.resample('M')[mean_col].mean()
df = sum_col.join(mean_df)

How to populate a column in a dataframe with a function

I'm very very new in python and pandas so my question is very basic.
I have a simple dataframe that has its index and a column 'years':
Years = pd.DataFrame({'Years': range(1900,2000,1)})
To this dataframe, I need to add a column that for each year it performs a specific calculation, say: Year i = X*i
The year (i.e. 1900, 1991, etc.) doesn't matter as such, more that each "i" belongs to a specific year.
I hope you can help me resolve this. Thansks very much.
Does this solve your problem?
Years = pd.DataFrame({'Years': range(1900,2000,1)})
Years['calculation'] = 0
for row in Years.index:
Years['calculation'][row] = 10**row
You can also specifically use the value in the row before, e.g. like
for row in Years.index:
Years['calculation'][row] = 10 * (row - 1)
df = pd.DataFrame({'Years': range(1900,2000,1)})
df.head()
to do calculations all you have to do is this:
df['new_column'] = df['Years'] * 3
df.head()

pandas / numpy arithmetic mean in csv file

I have a csv file which contains 3000 rows and 5 columns, which constantly have more rows appended to it on a weekly basis.
What i'm trying to do is to find the arithmetic mean for the last column for the last 1000 rows, every week. (So when new rows are added to it weekly, it'll just take the average of most recent 1000 rows)
How should I construct the pandas or numpy array to achieve this?
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#How should I write the next line of codes to get the average for the most 1000 rows?
I'm on a different machine than what my pandas is installed on so I'm going on memory, but I think what you'll want to do is...
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
#Let's pretend your 5th column has a name (header) of `Stuff`
last_thousand = df_1.tail(1000)
np.mean(last_thousand.Stuff)
A little bit quicker using mean():
df = pd.read_csv("fds.csv", header = 0)
results = df.tail(1000).mean()
Results will contain the mean for each column within the last 1000 rows. If you want more statistics, you can also use describe():
resutls = df.tail(1000).describe().unstack()
So basically I needed to use the pandas tail function. My Code below works.
df = pd.read_csv(fds.csv, index_col=False, header=0)
df_1 = df['Results']
numpy.average(df_1.tail(1000))