Pandas: Creating a histogram from string counts - pandas

I need to create a histogram from a dataframe column that contains the values "Low', 'Medium', or 'High'. When I try to do the usual df.column.hist(), i get the following error.
ex3.Severity.value_counts()
Out[85]:
Low 230
Medium 21
High 16
dtype: int64
ex3.Severity.hist()
TypeError Traceback (most recent call last)
<ipython-input-86-7c7023aec2e2> in <module>()
----> 1 ex3.Severity.hist()
C:\Users\C06025A\Anaconda\lib\site-packages\pandas\tools\plotting.py in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, bins, **kwds)
2570 values = self.dropna().values
2571
->2572 ax.hist(values, bins=bins, **kwds)
2573 ax.grid(grid)
2574 axes = np.array([ax])
C:\Users\C06025A\Anaconda\lib\site-packages\matplotlib\axes\_axes.py in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
5620 for xi in x:
5621 if len(xi) > 0:
->5622 xmin = min(xmin, xi.min())
5623 xmax = max(xmax, xi.max())
5624 bin_range = (xmin, xmax)
TypeError: unorderable types: str() < float()

ex3.Severity.value_counts().plot(kind='bar')
Is what you actually want.
When you do:
ex3.Severity.value_counts().hist()
it gets the axes the wrong way round i.e. it tries to partition your y axis (counts) into bins, and then plots the number of string labels in each bin.

Just an updated answer (as this comes up a lot.) Pandas has a nice module for styling dataframes in many ways, such as the case mentioned above....
ex3.Severity.value_counts().to_frame().style.bar()
...will print the dataframe with bars built-in (as sparklines, using excel-terminology). Nice for quick analysis on jupyter notebooks.
see pandas styling docs

It is a matplotlib issue which cannot order string together, however you can achieve the desired result by labeling the x-ticks:
# emulate your ex3.Severity.value_counts()
data = {'Low': 2, 'Medium': 4, 'High': 5}
df = pd.Series(data)
plt.bar(range(len(df)), df.values, align='center')
plt.xticks(range(len(df)), df.index.values, size='small')
plt.show()

You assumed that because your data was composed of strings that calling plot() on this would automatically perform the value_counts() but this is not the case hence the error, all you needed to do was:
ex3.Severity.value_counts().hist()

Related

Matplotlib issue match # of values to # of labels -- ValueError: 'label' must be of length 'x'

I have a df called high that looks like this:
white black asian native NH_PI latin
0 10239 26907 1079 670 80 1101`
I'm trying to create a simple pie chart with matplotlib. I've looked at multiple examples and other SO pages like this one, but I keep getting this error:
Traceback (most recent call last):
File "I:\Sustainability & Resilience\Food Policy\Interns\Lara Haase\data_exploration.py", line 62, in <module>
plt.pie(sizes, explode=None, labels = high.columns, autopct='%1.1f%%', shadow=True, startangle=140)
File "C:\Python27\ArcGIS10.6\lib\site-packages\matplotlib\pyplot.py", line 3136, in pie
frame=frame, data=data)
File "C:\Python27\ArcGIS10.6\lib\site-packages\matplotlib\__init__.py", line 1819, in inner
return func(ax, *args, **kwargs)
File "C:\Python27\ArcGIS10.6\lib\site-packages\matplotlib\axes\_axes.py", line 2517, in pie
raise ValueError("'label' must be of length 'x'")
ValueError: 'label' must be of length 'x'`
I've tried multiple different ways to make sure the labels and values match up. There are 6 of each, but I can't understand why Python disagrees with me.
Here is one way I've tried:
plt.pie(high.values, explode=None, labels = high.columns, autopct='%1.1f%%', shadow=True, startangle=140)
And another way:
labels = list(high.columns)
sizes = list(high.values)
plt.pie(sizes, explode=None, labels = labels, autopct='%1.1f%%', shadow=True, startangle=140)`
Also have tried with .iloc:
labels = list(high.columns)
sizes = high.loc[[0]]
print(labels)
print(sizes)
plt.pie(sizes, explode=None, labels = labels, autopct='%1.1f%%', shadow=True, startangle=140)
But no matter what I've tried, I keep getting that same key error. Any thoughts?
Just to expand on #ScottBoston's post,
Plotting a pie chart from a data frame with one row is not possible unless you reshape the data into a single column or series.
An operation I typically use is .stack(),
df = df.stack()
.stack() is very similar to .T, but returns a series with the column names as a second index level. This is handy when you have multiple rows and want to retain the original indexing. The result of df.stack() is:
0 white 10239
black 26907
asian 1079
native 670
NH_PI 80
latin 1101
dtype: int64
After I stack() a data frame, I typically assign a name to a series using:
df.name = 'Race'
Setting a name is not required, but helps when you are actually trying to plot the data using pd.DataFrame.plot.pie.
If the data frame df had more than one row of data, you could then plot pie charts for each row using .groupby
for name, group in df.groupby(level=0):
group.index = group.index.droplevel(0)
group.plot.pie(autopct='%1.1f%%', shadow=True, startangle=140)
Since the first level of the index only provides the positional index from the input data, I drop that level to make the labels on the plot appear as desired.
If you don't want to use pandas to make the pie chart, this worked for me:
plt.pie(df.squeeze().values, labels=df.columns.tolist(),autopct='%1.1f%%', shadow=True, startangle=140)
This attempt didn't work because high.columns is not list-like.
#attempt 1
plt.pie(high.values, explode=None, labels = high.columns, autopct='%1.1f%%', shadow=True, startangle=140)
This attempt didn't work because list(high.values) returns a list with an array as the first element.
#attempt 2
labels = list(high.columns)
sizes = list(high.values)
plt.pie(sizes, explode=None, labels = labels, autopct='%1.1f%%', shadow=True, startangle=140)
The last attempt didn't work because high.loc[[0]] returns a dataframe. Matplotlib does not know parse a dataframe as an input.
labels = list(high.columns)
sizes = high.loc[[0]]
print(labels)
print(sizes)
plt.pie(sizes, explode=None, labels = labels, autopct='%1.1f%%', shadow=True, startangle=140)
You can try this, using pandas dataframe plot:
df.T.plot.pie(y=0, autopct='%1.1f%%', shadow=True, startangle=140, figsize=(10,8), legend=False)
Output:

How to show following data with colors and color bar. What will be suitable command for this? [duplicate]

I want to make a scatterplot (using matplotlib) where the points are shaded according to a third variable. I've got very close with this:
plt.scatter(w, M, c=p, marker='s')
where w and M are the data points and p is the variable I want to shade with respect to.
However I want to do it in greyscale rather than colour. Can anyone help?
There's no need to manually set the colors. Instead, specify a grayscale colormap...
import numpy as np
import matplotlib.pyplot as plt
# Generate data...
x = np.random.random(10)
y = np.random.random(10)
# Plot...
plt.scatter(x, y, c=y, s=500) # s is a size of marker
plt.gray()
plt.show()
Or, if you'd prefer a wider range of colormaps, you can also specify the cmap kwarg to scatter. To use the reversed version of any of these, just specify the "_r" version of any of them. E.g. gray_r instead of gray. There are several different grayscale colormaps pre-made (e.g. gray, gist_yarg, binary, etc).
import matplotlib.pyplot as plt
import numpy as np
# Generate data...
x = np.random.random(10)
y = np.random.random(10)
plt.scatter(x, y, c=y, s=500, cmap='gray')
plt.show()
In matplotlib grey colors can be given as a string of a numerical value between 0-1.
For example c = '0.1'
Then you can convert your third variable in a value inside this range and to use it to color your points.
In the following example I used the y position of the point as the value that determines the color:
from matplotlib import pyplot as plt
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [125, 32, 54, 253, 67, 87, 233, 56, 67]
color = [str(item/255.) for item in y]
plt.scatter(x, y, s=500, c=color)
plt.show()
Sometimes you may need to plot color precisely based on the x-value case. For example, you may have a dataframe with 3 types of variables and some data points. And you want to do following,
Plot points corresponding to Physical variable 'A' in RED.
Plot points corresponding to Physical variable 'B' in BLUE.
Plot points corresponding to Physical variable 'C' in GREEN.
In this case, you may have to write to short function to map the x-values to corresponding color names as a list and then pass on that list to the plt.scatter command.
x=['A','B','B','C','A','B']
y=[15,30,25,18,22,13]
# Function to map the colors as a list from the input list of x variables
def pltcolor(lst):
cols=[]
for l in lst:
if l=='A':
cols.append('red')
elif l=='B':
cols.append('blue')
else:
cols.append('green')
return cols
# Create the colors list using the function above
cols=pltcolor(x)
plt.scatter(x=x,y=y,s=500,c=cols) #Pass on the list created by the function here
plt.grid(True)
plt.show()
A pretty straightforward solution is also this one:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8,8))
p = ax.scatter(x, y, c=y, cmap='cmo.deep')
fig.colorbar(p,ax=ax,orientation='vertical',label='labelname')

TypeError: scatter() got multiple values for argument 'c'

I am trying to do hierarchy clustering on my MFCC array 'signal_mfcc' which is an ndarray with dimensions of (198, 12). 198 audio frames/observation and 12 coefficients/dimensions?
I am using a random threshold of '250' with 'distance' for the criterion as shown below:
thresh = 250
print(signal_mfcc.shape)
clusters = hcluster.fclusterdata(signal_mfcc, thresh, criterion="distance")
With the specified threshold, the output variable 'cluster' is a sequence [1 1 1 ... 1] with the length of 198 or (198,) which I assume points all the data to a single cluster.
Then, I am using pyplot to plot scatter() with the following code:
# plotting
print(*(signal_mfcc.T).shape)
plt.scatter(*np.transpose(signal_mfcc), c=clusters)
plt.axis("equal")
title = "threshold: %f, number of clusters: %d" % (thresh) len(set(clusters)))
plt.title(title)
plt.show()
The output is:
plt.scatter(*np.transpose(signal_mfcc), c=clusters)
TypeError: scatter() got multiple values for argument 'c'
The scatter plot would not show. Any clues to what may went wrong?
Thanks in advance!
From this SO Thread, you can see why you have this error.
Fom the Scatter documentation, c is the 2nd optional argument, and the 4th argument total. This error means that your unpacking on np.transpose(signal_mfcc) returns more than 4 items. And as you define c later on, it is defined twice and it cannot choose which one is correct.
Example :
def temp(n, c=0):
pass
temp(*[1, 2], c=1)
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: temp() got multiple values for argument 'c'

Matplotlib how to divide an histogram by a constant number

I would like to perform a personalized normalization on histograms on matplotlib. In particular I have two histograms and I would like to divide each of them by a given number (number of generated events).
I don't want to "normally" normalize it, because the "normal normalization" makes the area equal to 1. What I wish for is basically to divide the value of each bin by a given number N, so that if my histogram has 2 bins, one with 5 entries and one with 3, the resulting "normalized" (or "divided") histogram would have the first bin with 5/N entries and the second one with 3/N.
I searched far&wide and found nothing really helpful. Do you have any handy solution? This is my code, working with pandas:
num_bins = 128
list_1 = dataframe_1['E']
list_2 = dataframe_2['E']
fig, ax = plt.subplots()
ax.set_xlabel('Proton energy [MeV]')
ax.set_ylabel('Normalized frequency')
ax.set_title('Proton energy distribution')
n, bins, patches = ax.hist(list_1, num_bins, density=1, alpha=0.5, color='red', ec='red', label='label_1')
n, bins, patches = ax.hist(list_2, num_bins, density=1, alpha=0.5, color='blue', ec='blue', label='label_2')
plt.legend(loc='upper center', fontsize='x-large')
fig.savefig('NiceTitle.pdf')
plt.close('all')

Pandas dataframe - multiplying DF's elementwise on same dates - something wrong?

I've been banging my head over this, I just cannot seem to get it right and I don't understand what is the problem... So I tried to do the following:
#!/usr/bin/env python
import matplotlib.pyplot as plt
import numpy as np
import quandl
btc_usd_price_kraken = quandl.get('BCHARTS/KRAKENUSD', returns="pandas")
btc_usd_price_kraken.replace(0, np.nan, inplace=True)
plt.plot(btc_usd_price_kraken.index, btc_usd_price_kraken['Weighted Price'])
plt.grid(True)
plt.title("btc_usd_price_kraken")
plt.show()
eur_usd_price = quandl.get('BUNDESBANK/BBEX3_D_USD_EUR_BB_AC_000', returns="pandas")
eur_dkk_price = quandl.get('ECB/EURDKK', returns="pandas")
usd_dkk_price = eur_dkk_price / eur_usd_price
btc_dkk = btc_usd_price_kraken['Weighted Price'] * usd_dkk_price
plt.plot(btc_dkk.index, btc_dkk) # WHY IS THIS [4785 rows x 1340 columns] ???
plt.grid(True)
plt.title("Historic value of 1 BTC converted to DKK")
plt.show()
As you can see in the comment, I don't understand why I get a result (which I'm trying to plot) that has size: [4785 rows x 1340 columns] ?
Anyway, the code results in a lot of error messages, something like e.g.
> Traceback (most recent call last): File
> "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_qt5agg.py",
> line 197, in __draw_idle_agg
> FigureCanvasAgg.draw(self) File "/usr/lib/python3.6/site-packages/matplotlib/backends/backend_agg.py",
...
> return _from_ordinalf(x, tz) File "/usr/lib/python3.6/site-packages/matplotlib/dates.py", line 254, in
> _from_ordinalf
> dt = datetime.datetime.fromordinal(ix).replace(tzinfo=UTC) ValueError: ordinal must be >= 1
I read some posts and I know that Pandas/Dataframe when using multiply is able to automatically only do an elementwise multiplication, on data-pairs, where the date is the same (so if one DF has timeseries for e.g. 1999-2017 and the other only has e.g. 2012-2015, then only common dates between 2012-2015 will be multiplied, i.e. the intersection subset of the data set) - so this problem about understanding the error message(s) (and the solution) - the whole problem is related to calculating btc_dkk variable and plotting it (which is the price for Bitcoin in the currency DKK)...
This should work:
usd_dkk_price.multiply(btc_usd_price_kraken['Weighted Price'], axis='index').dropna()
You are multiplying on columns, not index (this happens since you are multiplying a dataframe and a series, if you had selected the column in usd_dkk_price, this would not have happened). Then afterwards just drop the rows with NaN.