how to get row total in pandas - pandas

I am trying to get Row total and column total from my dataframe. I have no issue with the column total. However, My row total is adding up all the job descriptions rather than showing total
here's my code:
Newdata= data.groupby(['Job Description','AgeBand'])['AgeBand'].count().reset_index(name="count")
Newdata= Newdata.sort_values(by = ['AgeBand'],ascending=True)
df=Newdata.pivot_table(index='Job Description', values = 'count', columns = 'AgeBand').reset_index()
df.loc['Total',:]= df.sum(axis=0)
df.loc[:,'Total'] = df.sum(axis=1)
df=df.fillna(0).astype(int, errors='ignore')
df

First preselect the columns you wish to add row wise, then use df.sum(axis=1).
I think you're after:
df.loc[:,'Total'] = df.loc[:,'20-29':'UP TO 20'].sum(axis=1)

Related

Pandas how to group by day and other column

I am getting the daily counts of rows from a dataframe using
df = df.groupby(by=df['startDate'].dt.date).count()
How can I modify this so I can also group by another column 'unitName'?
Thank you
Use list with GroupBy.size:
df = df.groupby([df['startDate'].dt.date, 'unitName']).size()
If need count non missing values, e.g. column col use DataFrameGroupBy.count:
df = df.groupby([df['startDate'].dt.date, 'unitName'])['col'].count()

Pyspark dataframe in each group fill in zero/null values with previous rows' values with aggregation on other columns

I have a dataframe as this and I want a new column "expected_target" based on where my "var_2" is zero. I want it to be filled with the difference of "var_2" and "var_1" from the previous row. This is to happen for each groupby on "id" and there might be any number of zero values in "var_2".
data = [(1,0,1,4,4),
(1,0,1,4,4),
(1,0,1,4,4),
(1,1,1,4,4),
(1,1,2,0,3),
(1,1,2,0,2),
(1,1,2,1,1),
(2,0,1,24,24),
(2,0,1,24,24),
(2,0,1,24,24),
(2,1,1,24,24),
(2,1,2,0,23),
(2,1,2,0,22),
(2,1,2,0,21),
(2,1,2,21,20)
]
cols = ['id','id_2','var_1','var_2','expected_target']
data_df = spark.createDataFrame(data=data,schema=cols)
display(data_df)
Please help!

Pandas sum corresponding values based on values in another column

df1 contains Itemlist1 and Itemlist2 where each cell can contain any number of items. df2 contains Price and Cost of each item.
Want to obtain a final df with 2 new columns, Totalprice and Totalcost, added to df1. The Totalprice and Totalcost is the sum of all the items in each row of df1.
Managed to arrive at df3 where each item is put in a separate cell. Any suggestion from here please. Thank you.
From your df3, do the replace, then sum with axis=1
cost_dict = dict(zip(df2.Itemcode,df2.Cost))
price_dict = dict(zip(df2.Itemcode,df2.Price))
df1['totalcost'] = df3.replace(cost_dict).sum(axis=1)
df1['totalprice'] = df3.replace(price_dict).sum(axis=1)

Frequency of Value Column Given a Count Column

A dataframe has two columns ['Value', 'Count']. Value contains non-unique values. Count contains the number of occurances of Value. I want to plot Value vs sum of Count. Although this code works, I feel it doesn't utilize the power of pandas. What am I missing?
df = pd.DataFrame({'Value':[1,3,2,1],'Count':[5,2,1,4]})
gdf = df.groupby('Value')
sumdf = pd.DataFrame({'Value':k,'Sum':g['Count'].sum()} for k,g in gdf)
sumdf['Pct'] = sumdf['Sum'] / sumdf['Sum'].sum() * 100
sumdf.plot(x='Value',y='Pct',kind='bar',title='Frequency of Value')
Here's a one-liner:
ax = (df.groupby('Value')['Count'].sum() / df['Count'].sum() * 100).plot.bar(title='Frequency of Value')
Output:

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)