Delete rows in a dataframe by a range of dates - pandas

I have a dataframe with a column 'date' (YYYY-MM-DD HH:MM:SS) and datetime64 type.
I want to drop/eliminate rows by selecting ranges of dates. How can I do this on python/pandas?
Thank you so much in advance

(I cannot post comments, thus I dare to put an answer) The following questions also refer to deleting or filtering a data frame based on the value of a given column:
Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError
Deleting DataFrame row in Pandas based on column value
Basically, you can pass a boolean array to the index operator [ ] of the data frame, this returns the filtered data frame. Here the pandas v1.0.1 (!) documentation of how to index data frames. Also this question is helpful.

Related

pandas set_index() is creating duplicate columns

I have to format a dataframe so some code can receive it. It has to be in a specific format. The raw dataframe produced gives me a multi-index format. When I pass to the code it gives an IndexError because it was expecting 1 index.
Raw dataframe
I make a copy of the dataframe and remove the index.
ticker_data_2 = ticker_data.copy().reset_index()
Removed index
I need the timestamp column to be the index, so I reset the index to be timestamp. But now I have 2 columns named timestamp. Set index is supposed to remove the timestamp column and place it as the index, not make a copy.
ticker_data_2.set_index(ticker_data_2['timestamp'], inplace=True)
Duplicate timestamp columns
How do I fix to make it so only timestamp shows as the index, and not have a second timestamp column.

Find the average of a column based on another Pandas?

I'm working on a jupyter notebook, and I would like to get the average 'pcnt_change' based on 'day_of_week'. How do I do this?
A simple groupby call would do the trick here.
If df is the pandas dataframe:
df.groupby('day_of_week').mean()
would return a dataframe with average of all numeric columns in the dataframe with day_of_week as index. If you want only certain column(s) to be returned, select only the needed columns on the groupby call (for e.g.,
df[['open_price', 'high_price', 'day_of_week']].groupby('day_of_week').mean()

How to convert pandas dataframe to single index after aggregation?

I have been playing with aggregation in pandas dataframe. Considering the following dataframe:
df=pd.DataFrame({'a':[1,2,3,4,5,6,7,8],
'batch':['q','q','q','w','w','w','w','e'],
'c':[4,1,3,4,5,1,3,2]})
I have to do aggregation on the batch column with mean for column a and min for column c.
I used the following method to do the aggregation:
agg_dict = {'a':{'a':'mean'},'c':{'c':'min'}}
aggregated_df = df.groupby("batch").agg(agg_dict)
The problem is that I want the final data frame to have the same columns as the original data frame with the slight difference of having the aggregated values present in each of the columns.
The result of the above aggregation is a multi-index data frame, and am not sure how to convert it to an individual data frame?
I followed the link: Reverting from multiindex to single index dataframe in pandas . But, this didn't work, and the final output was still a multi-index data frame.
Great, if someone could help
you can try the following code df.groupby('batch').aggregate({'c':'min','a':mean})

Pandas Dataframe: How to get the cell instead of is value

I have a task to compare two dataframe with same columns name but different size, we can call it previous and current. I am trying to get the difference between (previous and current) in the Quantity and Booked Columns and highlight it as yellow. The common key between the two dataframe would be the 'SN' columns
I have coded out the following
for idx, rows in df_n.iterrows():
if rows["Quantity"] == rows['Available'] + rows['Booked']:
continue
else:
rows["Quantity"] = rows["Quantity"] - rows['Available'] - rows['Booked']
df_n.loc[idx, 'Quantity'].style.applymap('background-color: yellow')
# pdb.set_trace()
if (df_o['Booked'][df_o['SN'] == rows["SN"]] != rows['Booked']).bool():
df_n.loc[idx, 'Booked'].style.apply('background-color: yellow')
I realise I have a few problems here and need some help
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type. How can I get a dataframe from one cell. Do I have to pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns ='Quantity'). Will this create a copy or will update the reference?
How do I compare the SN of both dataframe, looking for a better way to compare. One thing I could think of is to use set index for both dataframe and when finished using them, reset them back?
My dataframe:
Previous dataframe
Current Dataframe
df_n.loc[idx, 'Quantity'] returns value instead of a dataframe type.
How can I get a dataframe from one cell. Do I have to
pd.DataFrame(data=df_n.loc[idx, 'Quantity'], index=idx, columns
='Quantity'). Will this create a copy or will update the reference?
To create a DataFrame from one cell you can try: df_n.loc[idx, ['Quantity']].to_frame().T
How do I compare the SN of both dataframe, looking for a better way to
compare. One thing I could think of is to use set index for both
dataframe and when finished using them, reset them back?
You can use df_n.merge(df_o, on='S/N') to merge dataframes and 'compare' columns.

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?
You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.