Matplotlib plot with x-axis as binned data and y-axis as the mean value of various variables in the bin? - pandas

My apologies if this is rather basic; I can't seem to find a good answer yet because everything refers only to histograms. I have circular data, with a degrees value as the index. I am using pd.cut() to create bins of a few degrees in order to summarize the dataset. Then, I use df.groupby() and .mean() to calculate mean values of all columns for the respective bins.
Now - I would like to plot this, with the bins on the x-axis, and lines for the columns.
I tried to iterate over the columns, adding them as:
for i in df.columns:
ax.plot(df.index,df[i])
However, this gives me the error: "float() argument must be a string or number, not 'pandas._libs.interval.Interval'
Therefore, I assume it wants the x-axis values to be numbers or strings and not intervals. Is there a way I can make this work?
To get the dataframe containing the mean values of each variable with respect to bins, I used:
bins = np.arange(0,360,5)
df = df.groupby(pd.cut(df[Dir]),bins)).mean()
Here is what df looks like at the point of plotting - each column includes mean values for each variable 0,1,2 etc. for each bin, which I would like plotted on y-axis, and "Dir" is the index with bins.
0 1 2 3 4 5
Dir
(0, 5] 37.444135 2922.848675 3244.325904 4203.001446 36.262371 37.493497
(5, 10] 42.599494 3248.194328 3582.355759 4061.098517 36.351476 37.148341
(10, 15] 47.277694 2374.379517 2709.435714 2932.064076 36.537377 36.878293
(15, 20] 52.345712 2626.774240 2659.391040 3087.324800 36.114965 36.603918
(20, 25] 57.318976 2207.845000 2228.002353 2811.066176 36.279392 37.165979
(25, 30] 62.454386 2436.117405 2839.255696 3329.441772 36.762896 37.861577
(30, 35] 67.705955 3138.968411 3462.831977 4007.180620 36.462313 37.560977
(35, 40] 72.554786 2554.552620 2548.955581 3079.570159 36.256386 36.819579
(40, 45] 77.501479 2862.703066 2965.408491 2857.901887 36.170788 36.140976
(45, 50] 82.386679 2973.858188 2539.348967 2000.606359 36.067776 37.210645

We have multiple options, we can obtain the middle of the bin using as shown below. You can also access the left and right side of the bins, as described here. Let me know if you need any further help.
df = pd.DataFrame(data={'x': np.random.uniform(low=0, high=10, size=10), 'y': np.random.exponential(size=10)})
bins = range(0,360,5)
df['bin'] = pd.cut(df['x'], bins)
agg_df = df.groupby(by='bin').mean()
# this is the important step. We can obtain the interval index from the categorical input using this line.
mids = pd.IntervalIndex(agg_df.index.get_level_values('bin')).mid
# to apply for plots:
for col in df.columns:
plt.plot(mids, df[col])

Related

Histogram as stacked bar chart based on categories

I have data with a numeric and categorical variable. I want to show a histogram of the numeric column, where each bar is stacked by the categorical variable. I tried to do this with ax.hist(data, histtype='bar', stacked=True), but couldn't quite get it to work.
If my data is
df = pd.DataFrame({'age': np.random.normal(45, 5, 100), 'job': np.random.choice(['engineer', 'barista',
'quantity surveyor'], size=100)})
I've organised it like this:
df['binned_age'] = pd.qcut(df.age, 5)
df.groupby('binned_age')['job'].value_counts().plot(kind='bar')
Which gives me a bar chart divided the way I want, but side by side, not stacked, and without different colours for each category.
Is there a way to stack this plot? Or just do it a regular histogram, but stacked by category?
IIUC, you will need to reshape your dataset first - i will do that using pivot_table and use len for an aggregator as that will give you the frequency.
Then you can use a similar code to the one you provided above.
df.drop('age',axis=1,inplace=True)
df_reshaped = df.pivot_table(index=['binned_age'], columns=['job'], aggfunc=len)
df_reshaped.plot(kind='bar', stacked=True, ylabel='Frequency', xlabel='Age binned',
title='Age group frequency by Job', rot=45)
prints:
You can use the documentation to tailor the chart to your needs
df['age_bin'] = pd.cut(df['age'], [20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
df.groupby('age_bin')['job'].value_counts().unstack().plot(kind='bar', stacked=True)

Sample random row from df.groupby("column1")["column2].max() and not first one if multiple candidates

What would be the correct way to return n random max values from a groupby?
I have a dataframe containing audio events, with the following columns:
audio
start_time
end_time
duration
labelling confidence (1 to 5)
label ("Ambulance", "Engine", ...)
I have multiple events/rows for each label and I have 26 labels in total.
What I would like to achieve is to get one event per label with max confidence.
Let's say we have 7 events that have label "Ambulance" and they have the following labelling confidence: 2, 5, 5, 4, 4, 3, 5.
The max confidence is 5 in this case, which gives us 3 selectable events.
I would like to get one of the three at random.
Doing the following with pandas: df.groupby("label").max() will return the first row with max labelling confidence. I would like it to be a random selection.
Many thanks in advance
Cheers
Antoine
Edit: following a comment from the OP, the simplest solution is to shuffle the data frame before picking the max rows:
# Some random data
labels = list('ABCDE')
repeats = np.random.randint(1, 6, len(labels))
df = pd.DataFrame({
'label': np.repeat(labels, repeats),
'confidence': np.random.randint(1, 6, repeats.sum())
})
# Shuffle the data frame. For each `label` get the first row,
# which we can be sure has the max `confidence` because we
# sorted it
(
df.sample(frac=1)
.sort_values(['label', 'confidence'], ascending=[True, False])
.groupby('label')
.head(1)
)
If you are running this in IPython / Jupyter Notebook, watch the index of the resulting data frame to see the randomness of the result.
Here is how I finally managed to do it:
shuffled_df = df.sample(frac=1)
filtered_df = shuffled_df.loc[shuffled_df.groupby("label")["confidence"].idxmax()]

Combining Pandas Subplots into a Single Figure

I'm having trouble understanding Pandas subplots - and how to create axes so that all subplots are shown (not over-written by subsequent plot).
For each "Site", I want to make a time-series plot of all columns in the dataframe.
The "Sites" here are 'shark' and 'unicorn', both with 2 variables. The output should be be 4 plotted lines - the time-indexed plot for Var 1 and Var2 at each site.
Make Time-Indexed Data with Nans:
df = pd.DataFrame({
# some ways to create random data
'Var1':pd.np.random.randn(100),
'Var2':pd.np.random.randn(100),
'Site':pd.np.random.choice( ['unicorn','shark'], 100),
# a date range and set of random dates
'Date':pd.date_range('1/1/2011', periods=100, freq='D'),
# 'f':pd.np.random.choice( pd.date_range('1/1/2011', periods=365,
# freq='D'), 100, replace=False)
})
df.set_index('Date', inplace=True)
df['Var2']=df.Var2.cumsum()
df.loc['2011-01-31' :'2011-04-01', 'Var1']=pd.np.nan
Make a figure with a sub-plot for each site:
fig, ax = plt.subplots(len(df.Site.unique()), 1)
counter=0
for site in df.Site.unique():
print(site)
sitedat=df[df.Site==site]
sitedat.plot(subplots=True, ax=ax[counter], sharex=True)
ax[0].title=site #Set title of the plot to the name of the site
counter=counter+1
plt.show()
However, this is not working as written. The second sub-plot ends up overwriting the first. In my actual use case, I have 14 variable number of sites in each dataframe, as well as a variable number of 'Var1, 2, ...'. Thus, I need a solution that does not require creating each axis (ax0, ax1, ...) by hand.
As a bonus, I would love a title of each 'site' above that set of plots.
The current code over-writes the first 'Site' plot with the second. What I missing with the axes here?!
When you are using DataFrame.plot(..., subplot=True) you need to provide the correct number of axes that will be used for each column (and with the right geometry, if using layout=). In your example, you have 2 columns, so plot() needs two axes, but you are only passing one in ax=, therefore pandas has no choice but to delete all the axes and create the appropriate number of axes itself.
Therefore, you need to pass an array of axes of length corresponding to the number of columns you have in your dataframe.
# the grouper function is from itertools' cookbook
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
fig, axs = plt.subplots(len(df.Site.unique())*(len(df.columns)-1),1, sharex=True)
for (site,sitedat),axList in zip(df.groupby('Site'),grouper(axs,len(df.columns)-1)):
sitedat.plot(subplots=True, ax=axList)
axList[0].set_title(site)
plt.tight_layout()

Panda's Update information based on bins / cutting

I'm working on a dataset which has a large amount of missing information.
I understand I could use FillNA but i'd like to base my updates on the binned values of another column.
Selection of missing data:
missing = train[train['field'].isnull()]
Bin the data (this works correctly):
filter_values = [0, 42, 63, 96, 118, 160]
labels = [1,2,3,4,5]
out = pd.cut(missing['field2'], bins = filter_values, labels=labels)
counts = pd.value_counts(out)
print(counts)
Now, based on the bin assignments, I would like to set the correct bin label, to the missing/train['field'] for all data assigned to this bin.
IIUC:
You just need to fillna
train['field'] = train['field'].fillna(out)

How to ensure get label for zero counts in python pandas pd.cut

I am analyzing a DataFrame and getting timing counts which I want to put into specific buckets (0-10 seconds, 10-30 seconds, etc).
Here is a simplified example:
import pandas as pd
filter_values = [0, 10, 20, 30] # Bucket Values for pd.cut
#Sample Times
df1 = pd.DataFrame([1, 3, 8, 20], columns = ['filtercol'])
#Use cut to get counts for each bucket
out = pd.cut(df1.filtercol, bins = filter_values)
counts = pd.value_counts(out)
print counts
The above prints:
(0, 10] 3
(10, 20] 1
dtype: int64
You will notice it does not show any values for (20, 30]. This is a problem because I want to put this into my output as zero. I can handle it using the following code:
bucket1=bucket2=bucket3=0
if '(0, 10]' in counts:
bucket1=counts['(0, 10]']
if '(10, 20]' in counts:
bucket2=counts['(10, 30]']
if '(20, 30]' in counts:
bucket3=counts['(30, 60]']
print bucket1, bucket2, bucket3
But I want a simpler cleaner approach where I can use:
print counts['(0, 10]'], counts['(10, 30]'], counts['(30, 60]']
Ideally where the print is based on the values in filter_values so they are only in one place in the code. Yes I know I can change the print to use filter_values[0]...
Lastly when using cut is there a way to specify infinity so the last bucket is all values greater than say 60?
Cheers,
Stephen
You can reindex by the categorical's levels:
In [11]: pd.value_counts(out).reindex(out.levels, fill_value=0)
Out[11]:
(0, 10] 3
(10, 20] 1
(20, 30] 0
dtype: int64