Using pd.cut to create bins for a graph, but bin values are not coming out as expected - pandas

Here is the code I'm running:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index() #grouping by 'fare' rounded to an integer and 'sex' and then getting the survivability
x =pd.cut(y.fare, (0,17,35,70,300,515)) #I'm not sure if my format is correct but this is how I cut up the fare values
y['Fare_bins']= x # adding the newly created bins to a new column "Fare_bins' in original dataframe.
#graphing with seaborn
sns.set(style="whitegrid")
g = sns.factorplot(x='Fare_bins', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
The problem I'm having is that Fare_values are showing up as (0,17].
The left side is a circle bracket and the right side is square bracket.
If possible I would like to have something like this:
(0-17) or [0-17]
Next, there seems to be a gap between each bar plot. I was expecting them to be adjoined. There are two graphs being represented, so I don't expect of the bars to be ajoined, but the first 5 bars(first graph)should be connected and the last 5 bars to eachother(second graph).
How can I go about fixing these two issues?

It seems I can add labels.
Just by adding labels to the "cut" method parameters, I can display the Fare_values as I want.
x =pd.cut(y.fare, (0,17,35,70,300,515), labels = ('(0-17)', '(17-35)', '(35-70)', '(70-300)','(300-515)') )
As for the brackets showing around the fare_value groups,
according to the documentation:
right : bool, optional
Indicates whether the bins include the rightmost edge or not. If right == True (the default), then the bins [1,2,3,4] indicate (1,2], (2,3], (3,4].
Still not sure if it's possible to join the bars though.

Related

How to overlay hatches on shapefile with condition?

I've been trying to plot hatches (like this pattern, "//") on polygons of a shapefile, based on a condition. The condition is that whichever polygon values ("Sig") are greater than equal to 0.05, there should be a hatch pattern for them. Unfortunately the resulting map doesn't meet my requirements.
So I first plot the "AMOTL" variable and then wanted to plot the hatches (variable Sig) on top of them (if the values are greater than equal to 0.05). I have used the following code:
import contextily as ctx
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as ticker
from matplotlib.patches import Ellipse, Polygon
data = gpd.read_file("mapsignif.shp")
Sig = data.loc[data["Sig"].ge(0.05)]
data.loc[data["AMOTL"].eq(0), "AMOTL"] = np.nan
ax = data.plot(
figsize=(12, 10),
column="AMOTL",
legend=True,
cmap="bwr",
vmin = -1,
vmax= 1,
missing_kwds={"color":"white"},
)
Sig.plot(
ax=ax,
hatch='//'
)
map = Basemap(
llcrnrlon=-50,
llcrnrlat=30,
urcrnrlon=50.0,
urcrnrlat=85.0,
resolution="i",
lat_0=39.5,
lon_0=1,
)
map.fillcontinents(color="lightgreen")
map.drawcoastlines()
map.drawparallels(np.arange(10,90,20),labels=[1,1,1,1])
map.drawmeridians(np.arange(-180,180,30),labels=[1,1,0,1])
Now the problem is that my original image (on which I want to plot the hatches) is different from the image resulting from the above code:
Original Image -
Resultant image from above code:
I basically want to plot hatches on that first image. This topic is similar to correlation plots where you have places with hatches (if the p-value is greater than 0.05). The first image plots the correlation variable and some of them are significant (defined by Sig). So I want to plot the Sig variable on top of the AMOTL. I've tried variations of the code, but still can't get through.
Would be grateful for some assistance... Here's my file - https://drive.google.com/file/d/10LPNjBtQMdQMw6XmXdJEg6Uq4icx_LD6/view?usp=sharing
I’d bet this is the culprit:
data.loc[data["Sig"].ge(0.05), "Sig"].plot(
column="Sig", hatch='//'
)
In this line, you’re selecting only the 'Sig' column, eliminating all spatial data in the 'geometry' column and returning a pandas.Series instead of a geopandas.GeoDataFrame. In order to plot a data column using the geometries column for your shapes you must maintain at least both of those columns in the object you call .plot on.
So instead, don’t select the column:
data.loc[data["Sig"].ge(0.05)].plot(
column="Sig", hatch='//'
)
You are already telling geopandas to plot the "Sig" column by using the column argument to .plot - no need to limit the actual data too.
Also, when overlaying a plot on an existing axis, be sure to pass in the axis object:
data.loc[data["Sig"].ge(0.05)].plot(
column="Sig", hatch='//', ax=ax
)

How to plot a line that is partially colorized?

As shown in the figure,
How can I plot a line that have different colors based on a specific value of x ?
The simplest solution here may be to slice your data at the corresponding index of x_lim found by np.where :
from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0,2*np.pi,100)
y = np.cos(x)*np.exp(-x/2)
# specify your x limitation
x_lim = np.pi
# find the first corresponding idx where the condition x>=x_lim hold
x_lim_idx = np.where(x>=x_lim)[0][0]
# plot sliced data
plt.plot(x[:x_lim_idx],y[:x_lim_idx],'r')
plt.plot(x[x_lim_idx:],y[x_lim_idx:],'b')
which gives for x_lim = np.pi :
And if the remaining gap between the lines bothers you, for small x discretization for instance, you can still close it by making the two slices overlap.

How to plot Series with selective ticks?

I have a Series that I would like to plot as a bar chart: pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts()
Since I have many bars I only want to display some (equidistant) ticks.
However, unless I actively work against it, pyplot will print the wrong labels. E.g. if I leave out set_xticklabels in the code below I get
where every element from the index is taken and just displayed with the specified distance.
This code does what I want:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
mi,ma = min(s.index), max(s.index)
s = s.reindex(range(mi,ma+1,1), fill_value=0)
distance = 10
a = s.plot(kind='bar')
condition = lambda t: int(t[1].get_text()) % 10 == 0
ticks_,labels_=zip(*filter(condition, zip(a.get_xticks(), a.get_xticklabels())))
a.set_xticks(ticks_)
a.set_xticklabels(labels_)
plt.show()
But I still feel like I'm being unnecessarily clever here. Am I missing a function? Is this the best way of doing that?
Consider not using a pandas bar plot in case you intend to plot numeric values; that is because pandas bar plots are categorical in nature.
If instead using a matplotlib bar plot, which is numeric in nature, there is no need to tinker with any ticks at all.
s = pd.Series([-4,2, 3,3, 4,5,9,20]).value_counts().sort_index()
plt.bar(s.index, s)
I think you overcomplicated it. You can simply use the following. You just need to find the relationship between the ticks and the ticklabels.
a = s.plot(kind='bar')
xticks = np.arange(0, max(s)*10+1, 10)
plt.xticks(xticks + abs(mi), xticks)

Get real range in colormap with LogLocator

The following code
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import ticker
n = 50
A = np.tile(np.linspace(-26,-2,n),(n,1))
plt.figure()
plt.contourf(A)
plt.colorbar()
B = np.tile(np.logspace(-26,-2,n),(n,1))
plt.figure()
plt.contourf(B,locator=ticker.LogLocator())
plt.colorbar()
plt.show()
produces these two plots:
For the linear case (first image), every color in the colorbar is present in the image, and the min and max values of A lie respectively in the first and last color bin (going bottom to top).
For the log case (second image), the colorbar's min and max values don't make sense to me anymore.
The minimum of B is 10^-26, so this value lies at the border between the first and second color bin of the colormap, but there are none of these two first colors in the image.
The maximum of B is 10^-2, and it lies at the border between the before-before last, and the before last color bins, so it could be considered in either.
But then, why is the last (yellow) color bin here, especially since there is no yellow in the image ?
So I find the default behavior of the colormap limits (for the LogLocator) weird because it is not representative of the real (or at least approximate) data range (like in the linear case), and it adds color bins (in this case 3 : 2 below the min, and 1 above the max) that are not present in the image.
Is this a bug or is there something I didn't understand ?
#ImportanceOfBeingErnest's answer below gives the output that I want, but it just feels like I shouldn't have to do this and that I can expect the same behavior from the colormap with linear values, and from the LogLocator color mapper.
If you want to have specific intervals in your contour plot you would need to decide for them and supply them to the contouring function via the levels argument.
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import ticker
n = 50
A = np.tile(np.logspace(-26,-2,n),(n,1))
levels = 10.**np.arange(-26,-1,4)
plt.figure()
plt.contourf(A,levels=levels, locator=ticker.LogLocator())
plt.colorbar()
plt.show()

return values of subplot

Currently I trying to get myself acquainted with the matplotlib.pyplot library. After having seeing quite some examples and tutorial, I noticed that the subplots function also has some returns values which usually are used later on. However, on the matplotlib website I was unable to find any specification on what exactly is returned, and none of the examples are the same (although it usually seems to be an ax object). Can you guys give me some to pointers as to what is returned, and how I can use it. Thanks in advance!
In the documentation it says that matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or not depends on the number of subplots).
Common use is:
import matplotlib.pyplot as plt
import numpy as np
f, axes = plt.subplots(1,2) # 1 row containing 2 subplots.
# Plot random points on one subplots.
axes[0].scatter(np.random.randn(10), np.random.randn(10))
# Plot histogram on the other one.
axes[1].hist(np.random.randn(100))
# Adjust the size and layout through the Figure-object.
f.set_size_inches(10, 5)
f.tight_layout()
Generally, the matplotlib.pyplot.subplots() returns a figure instance and an object or an array of Axes objects.
Since you haven't posted the code with which you are trying to get your hands dirty, I will do it by taking 2 test cases :
case 1 : when number of subplots needed(dimension) is mentioned
import matplotlib.pyplot as plt #importing pyplot of matplotlib
import numpy as np
x = [1, 3, 5, 7]
y = [2, 4, 6, 8]
fig, axes = plt.subplots(2, 1)
axes[0].scatter(x, y)
axes[1].boxplot(x, y)
plt.tight_layout()
plt.show()
As you can see here since we have given the number of subplots needed, (2,1) in this case which means no. of rows, r = 2 and no. of columns, c = 1.
In this case, the subplot returns the figure instance along with an array of axes, length of which is equal to the total no. of the subplots = r*c , in this case = 2.
case 2 : when number of subplots(dimension) is not mentioned
import matplotlib.pyplot as plt #importing pyplot of matplotlib
import numpy as np
x = [1, 3, 5, 7]
y = [2, 4, 6, 8]
fig, axes = plt.subplots()
#size has not been mentioned and hence only one subplot
#is returned by the subplots() method, along with an instance of a figure
axes.scatter(x, y)
#axes.boxplot(x, y)
plt.tight_layout()
plt.show()
In this case, no size or dimension has been mentioned explicitly, therefore only one subplot is created, apart from the figure instance.
You can also control the dimensions of the subplots by using the squeeze keyword. See documentation. It is an optional argument, having default value as True.
Actually, 'matplotlib.pyplot.subplots()' is returning two objects:
The figure instance.
The 'axes'.
'matplotlib.pyplot.subplots()' takes many arguments. That has been given below:
matplotlib.pyplot.subplots(nrows=1, ncols=1, *, sharex=False, sharey=False, squeeze=True, subplot_kw=None, gridspec_kw=None, **fig_kw)
The first two arguments are : nrows : the number of rows I want to creat in my Subplot grid , ncols : The number of columns should have in the subplot grid. But, if 'nrows' and 'ncols' are not decleared explicitely, it will take the values of 1 in each by default.
Now, come to objects that has been created:
(1)The figure instance is nothing but throwing a figure which will hold all the plots.
(2)The 'axes' object will contain all the informations about each subplots.
Let's understand through an example:
Here, 4 subplots are being created at the positions of (0,0),(0,1),(1,0),(1,1).
Now, let's suppose, at the position (0,0), I want to have a scatterplot. What will I do: I will incorporate the scatterplot into "axes[0,0]" object that will hold all the informations about the scatterplot and reflect it into the figure instance.
The same thing will happen for all the other three positions.
Hope this will help and let me know your thought about this.