Pandas how to group by day and other column - pandas

I am getting the daily counts of rows from a dataframe using
df = df.groupby(by=df['startDate'].dt.date).count()
How can I modify this so I can also group by another column 'unitName'?
Thank you

Use list with GroupBy.size:
df = df.groupby([df['startDate'].dt.date, 'unitName']).size()
If need count non missing values, e.g. column col use DataFrameGroupBy.count:
df = df.groupby([df['startDate'].dt.date, 'unitName'])['col'].count()

Related

python: aggregate columns in pivot table with multiindex structure

if i have multi-index pivot table like this:
what would be the way to aggregate total 'sum' and 'count' for all dates?
I want to see additional column with totals for all rows in the table.
Thanks to #Nik03 for the idea. The methond of concat returns required data frame but with single index level. To add it to original dataframe, you have to create columns first and assign new dataframes to:
table_to_show = pd.concat([table_to_record.filter(like='sum').sum(1), table_to_record.filter(like='count').sum(1)], axis=1)
table_to_show.columns = ['sum', 'count']
table_to_record['total_sum'] = table_to_show['sum']
table_to_record['total_count'] = table_to_show['count']
column_1st = table_to_record.pop('total_sum')
column_2nd = table_to_record.pop('total_count')
table_to_record.insert(0, 'total_sum', column_1st)
table_to_record.insert(1,'total_count', column_2nd)
and here is the result:
One way:
df1 = pd.concat([df.filter(like='sum').sum(
1), df.filter(like='mean').sum(1)], axis=1)
df1.columns = ['sum', 'mean']

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

PySpark Aggregation on Comma Seperated Column

I have a huge DataFrame with two of many columns: "NAME", "VALUE". One of the row value for "NAME" column is "X,Y,V,A".
I want to transpose my DataFrame so the "NAME" values are columns and the average of the "VALUE" are the row values.
I used the pivot function:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')
All NAME values except for "X,Y,V,A" work well with the above. I am not sure how to separate the 4 values of "X,Y,V,A" and aggregate on individual value.
IIUC, you need to split and explode the string first:
from pyspark.sql.functions import split, explode
df = df.withColumn("NAME", explode(split("NAME", ",")))
Now you can group and pivot:
df1 = df.groupby('DEVICE', 'DATE').pivot('NAME').avg('VALUE')

renaming columns after group by and sum in pandas dataframe

This is my group by command:
pdf_chart_data1 = pdf_chart_data.groupby('sell').value.agg(['sum']).rename(
columns={'sum':'valuesum','sell' : 'selltime'}
)
I am able to change the column name for value but not for 'sell'.
Please help to resolve this issue.
You cannot rename it, because it is index. You can add as_index=False for return DataFrame or add reset_index:
pdf_chart_data1=pdf_chart_data.groupby('sell', as_index=False)['value'].sum()
.rename(columns={'sum':'valuesum','sell' : 'selltime'})
Or:
pdf_chart_data1=pdf_chart_data.groupby('sell')['value'].sum()
.reset_index()
.rename(columns={'sum':'valuesum','sell' : 'selltime'})
df = df.groupby('col1')['col1'].count()
df1= df.to_frame().rename(columns={'col1':'new_name'}).reset_index()
If you join to groupby with the same index where one is nunique ->number of unique items and one is unique->list of unique items then you get two columns called Sport. Using as_index=False I was able to rename the second Sport name using rename then concat the two lists together and sort descending on sport and display the 10 five sportcounts.
grouped=df.groupby('NOC', as_index=False)
Nsport=grouped['Sport'].nunique()\
.rename(columns={'Sport':'SportCount'})
Nsport=Nsport.set_index('NOC')
country_grouped=df.groupby('NOC')
Nsport2=country_grouped['Sport'].unique()
df2=pd.concat([Nsport,Nsport2], join='inner',axis=1).reindex(Nsport.index)
df2=df2.sort_values(by=["SportCount"],ascending=False)
print(df2.columns)
for key,item in df2.head(5).iterrows():
print(key,item)

Pandas - select dataframe columns if statistic is greater than certain value

I have pandas dataframe df. I would like to select columns which have standard deviation grater than 1. Here is what I tried
df2 = df[df.std() >1]
df2 = df.loc[df.std() >1]
Both generated error. What am I doing wrong?
Use df.loc[:, df.std() > 1] and it will fix it.
The first part which is [: refers to the rows and the second part df.std() > 1 refers to the columns