Frequency of Value Column Given a Count Column - pandas

A dataframe has two columns ['Value', 'Count']. Value contains non-unique values. Count contains the number of occurances of Value. I want to plot Value vs sum of Count. Although this code works, I feel it doesn't utilize the power of pandas. What am I missing?
df = pd.DataFrame({'Value':[1,3,2,1],'Count':[5,2,1,4]})
gdf = df.groupby('Value')
sumdf = pd.DataFrame({'Value':k,'Sum':g['Count'].sum()} for k,g in gdf)
sumdf['Pct'] = sumdf['Sum'] / sumdf['Sum'].sum() * 100
sumdf.plot(x='Value',y='Pct',kind='bar',title='Frequency of Value')

Here's a one-liner:
ax = (df.groupby('Value')['Count'].sum() / df['Count'].sum() * 100).plot.bar(title='Frequency of Value')
Output:

Related

python: aggregate columns in pivot table with multiindex structure

if i have multi-index pivot table like this:
what would be the way to aggregate total 'sum' and 'count' for all dates?
I want to see additional column with totals for all rows in the table.
Thanks to #Nik03 for the idea. The methond of concat returns required data frame but with single index level. To add it to original dataframe, you have to create columns first and assign new dataframes to:
table_to_show = pd.concat([table_to_record.filter(like='sum').sum(1), table_to_record.filter(like='count').sum(1)], axis=1)
table_to_show.columns = ['sum', 'count']
table_to_record['total_sum'] = table_to_show['sum']
table_to_record['total_count'] = table_to_show['count']
column_1st = table_to_record.pop('total_sum')
column_2nd = table_to_record.pop('total_count')
table_to_record.insert(0, 'total_sum', column_1st)
table_to_record.insert(1,'total_count', column_2nd)
and here is the result:
One way:
df1 = pd.concat([df.filter(like='sum').sum(
1), df.filter(like='mean').sum(1)], axis=1)
df1.columns = ['sum', 'mean']

How to operate over subset of row on pandas data frame?

def getDF(threshold):
df = pd.read_pickle(filename)
df['threshold'] = float(threshold)
df.set_index('date')
df['anomaly'] = [any values of the row] > df['threshold']
I have the above function that needs to set the anomaly column if any of the floats, columns 0 - 9, are greater than the threshold. I know how to do this on one column but what about multiple?
I could probably do this with brute force the long way, but I'm sure there is a pandas way of doing it much faster.
Thank you for your time.
You can first calculate the row-wise maximum, and then check if this maximum is greater than the threshold:
df['anomaly'] = df[['column1', 'column2', 'column3']].max(axis=1) > df['threshold']
If the threshold is however a single value, then you can simply use the value itself:
df['anomaly'] = df[['column1', 'column2', 'column3']].max(axis=1) > threshold
or for the first ten columns:
df['anomaly'] = df.iloc[:,:10].max(axis=1) > threshold

how to get row total in pandas

I am trying to get Row total and column total from my dataframe. I have no issue with the column total. However, My row total is adding up all the job descriptions rather than showing total
here's my code:
Newdata= data.groupby(['Job Description','AgeBand'])['AgeBand'].count().reset_index(name="count")
Newdata= Newdata.sort_values(by = ['AgeBand'],ascending=True)
df=Newdata.pivot_table(index='Job Description', values = 'count', columns = 'AgeBand').reset_index()
df.loc['Total',:]= df.sum(axis=0)
df.loc[:,'Total'] = df.sum(axis=1)
df=df.fillna(0).astype(int, errors='ignore')
df
First preselect the columns you wish to add row wise, then use df.sum(axis=1).
I think you're after:
df.loc[:,'Total'] = df.loc[:,'20-29':'UP TO 20'].sum(axis=1)

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?
We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)
nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.
Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

Sample Pandas dataframe based on values in column

I have a large dataframe that I want to sample based on values on the target column value, which is binary : 0/1
I want to extract equal number of rows that have 0's and 1's in the "target" column. I was thinking of using the pandas sampling function but not sure how to declare the equal number of samples I want from both classes for the dataframe based on the target column.
I was thinking of using something like this:
df.sample(n=10000, weights='target', random_state=1)
Not sure how to edit it to get 10k records with 5k 1's and 5k 0's in the target column. Any help is appreciated!
You can group the data by target and then sample,
df = pd.DataFrame({'col':np.random.randn(12000), 'target':np.random.randint(low = 0, high = 2, size=12000)})
new_df = df.groupby('target').apply(lambda x: x.sample(n=5000)).reset_index(drop = True)
new_df.target.value_counts()
1 5000
0 5000
Edit: Use DataFrame.sample
You get similar results using DataFrame.sample
new_df = df.groupby('target').sample(n=5000)
You can use DataFrameGroupBy.sample method as follwing:
sample_df = df.groupby("target").sample(n=5000, random_state=1)
Also found this to be a good method:
df['weights'] = np.where(df['target'] == 1, .5, .5)
sample_df = df.sample(frac=.1, random_state=111, weights='weights')
Change the value of frac depending on the percent of data you want back from the original dataframe.
You will have to run a df0.sample(n=5000) and df1.sample(n=5000) and then combine df0 and df1 into a dfsample dataframe. You can create df0 and df1 by df.filter() with some logic. If you provide sample data I can help you construct that logic.