I'm working on a jupyter notebook, and I would like to get the average 'pcnt_change' based on 'day_of_week'. How do I do this?
A simple groupby call would do the trick here.
If df is the pandas dataframe:
df.groupby('day_of_week').mean()
would return a dataframe with average of all numeric columns in the dataframe with day_of_week as index. If you want only certain column(s) to be returned, select only the needed columns on the groupby call (for e.g.,
df[['open_price', 'high_price', 'day_of_week']].groupby('day_of_week').mean()
Related
df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()
I have been playing with aggregation in pandas dataframe. Considering the following dataframe:
df=pd.DataFrame({'a':[1,2,3,4,5,6,7,8],
'batch':['q','q','q','w','w','w','w','e'],
'c':[4,1,3,4,5,1,3,2]})
I have to do aggregation on the batch column with mean for column a and min for column c.
I used the following method to do the aggregation:
agg_dict = {'a':{'a':'mean'},'c':{'c':'min'}}
aggregated_df = df.groupby("batch").agg(agg_dict)
The problem is that I want the final data frame to have the same columns as the original data frame with the slight difference of having the aggregated values present in each of the columns.
The result of the above aggregation is a multi-index data frame, and am not sure how to convert it to an individual data frame?
I followed the link: Reverting from multiindex to single index dataframe in pandas . But, this didn't work, and the final output was still a multi-index data frame.
Great, if someone could help
you can try the following code df.groupby('batch').aggregate({'c':'min','a':mean})
I have a grouped dataframe according to two columns.
Now i want to plot the data of Date vs Confirmed in seaborn.
Is there a good way to do it.
grouped_series = cases.groupby(['Country/Region','ObservationDate'])['Confirmed','Deaths','Recovered'].sum()
print(grouped_series)
You can change aggregatetion for grouping by datetimes only:
cases.groupby(['ObservationDate'])['Confirmed'].sum().plot()
Or if need summed values per ObservationDate and Country/Region:
cases.groupby(['Country/Region','ObservationDate'])['Confirmed'].sum().unstack(0).plot()
I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)
I am trying to convert the following pandas line into pyspark:
df = df.groupby('ID', as_index=False).head(1)
Now, I am familiar with the pyspark df.groupby("col1", "col2") method in pyspark, as well as the following to get whatever the first element is within a group:
df = df.withColumn("row_num", row_number().over(Window.partitionBy("ID").orderBy("SOME_DATE_COLUMN"))).where(col("row_num") < 2)
However, without an orderBy argument, this grouping and fetching of the first element in each group doesn't work (and I am literally trying to convert from pandas to spark, whatever the pandas line does):
An error occurred while calling o2547.withColumn.
: org.apache.spark.sql.AnalysisException: Window function row_number() >requires window to be ordered, please add ORDER BY clause. For example >SELECT row_number()(value_expr) OVER (PARTITION BY window_partition >ORDER BY window_ordering) from table
Looking at the pandas groupby documentation, I cannot grasp what groupby does without a following sort/agg function applied to the groups; i.e. what is the default order within a group from which the $.head(1)$ fetches the first element?
It depends on the order of your pandas dataframe before the groupby. From the pandas groupby documentation:
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Converting the pandas beheaviour exactly to pyspark is impossible as pyspark dataframes aren't ordered. But if your data source can provide a row number or something like that, it is possible.