When plotting using matplotlib, I ran into an interesting issue where the y axis is scaled by a very inconvenient quantity. Here's a MWE that demonstrates the problem:
import numpy as np
import matplotlib.pyplot as plt
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
plt.show()
When I run this, I get a figure that looks like this picture
The y-axis clearly is scaled by a silly quantity even though the y data are all between 1 and 2.
This is similar to the question:
Axis numerical offset in matplotlib
I'm not satisfied with the answer to this question in that it makes no sense to my why I need to go the the convoluted process of changing axis settings when the data are between 1 and 2 (EDIT: between 0 and 1). Why does this happen? Why does matplotlib use such a bizarre scaling?
The data in the plot are all between 0.696000000017 and 0.696000000273. For such cases it makes sense to use some kind of offset.
If you don't want that, you can use you own formatter:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker
l = np.linspace(0.5,2,2**10)
a = (0.696*l**2)/(l**2 - 9896.2e-9**2)
plt.plot(l,a)
fmt = matplotlib.ticker.StrMethodFormatter("{x:.12f}")
plt.gca().yaxis.set_major_formatter(fmt)
plt.show()
Related
I managed to make a displot as I intended with seaborn and the only thing I want to change is the bars' outline width. Specifically, I want to make it thinner. Here's the code and a sample of how the dataframe is composed.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data_final = pd.merge(data, data_filt)
q = sns.displot(data=data_final[data_final['cond_state'] == True], y='Brand', hue='Style', multiple='stack')
plt.title('Sample of brands and their offering of ramen styles')
I'm specifying that the plot should only use rows where the cond_state is True. Here is a sample of the data_final dataframe.
Here is how the plot currently looks like.
I've tried various ways published online, but most of them use the deprecated distplot instead of displot. There also doesn't seem to be a parameter for changing the bars' outline width in the seaborn documentation for displot and FacetGrid
The documentation for the seaborn displot function doesn't have this parameter listed, but you can pass matplotlib axes arguments, such as linewidth = 0.25, to the seaborn.displot function to solve your problem.
The plot below shows the correlation for one column. The problem is that the numbers are not readable, because there are many columns in it.
How is it possible to show only 5 or 6 most important columns and not all of them with very low importance?
plt.figure(figsize=(20,3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:].T, annot=True,
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
You can limit the cells shown via .iloc[1:7]. If you also want to show the highest negative values, you could create a second plot with .iloc[-6:]. To have both together, you could use numpy's slicing function and write .iloc[np.r_[1:4, -3:0]].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(np.random.rand(7, 27), columns=['price'] + [*'abcdefghijklmnopqrstuvwxyz'])
plt.figure(figsize=(20, 3))
sns.heatmap(df.corr()[['price']].sort_values('price', ascending=False).iloc[1:7].T,
annot=True, annot_kws={'rotation':90, 'size': 20},
cmap='Spectral_r', vmax=0.9, vmin=-0.31)
plt.show()
annot can also be a list of labels. Using this, you can define a string matrix that you use to display the desired numbers and set the others to an empty string.
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(0)
import seaborn as sns; sns.set_theme()
import pandas as pd
from string import ascii_letters
# generate random data
rs = np.random.RandomState(33)
df = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
importance_index = 5 # until which idx to hide values
data = df.corr()[['A']].sort_values('A', ascending=False).iloc[1:].T
labels = data.astype(str) # make a str-copy
labels.iloc[0,:importance_index] = ' ' # mask columns that you want to hide
sns.heatmap(data, annot=labels, cmap='Spectral_r', vmax=0.9, vmin=-0.31, fmt='', annot_kws={'rotation':90})
plt.show()
The output on some random data:
This works but it has its limits, particulary with setting fmt='' (can't use it to conveniently format decimals anymore, need to do it manually now). I would also question whether your approach is even the best one to take here. I think consistency in plots is quite important. I would rather evaluate if we can't rotate the heatmap labels (I've included it above) or leave them out completely since it is technically redundant due to the color-coding. Alternatively, you could only plot the cells with the "important" values.
I put multiple heatmaps in one figure with matplotlib. I cannot layout it well. Here is my code.
import matplotlib; matplotlib.use('agg')
import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(6,240,240)
y = np.random.rand(6,240,240)
t = np.random.rand(6,240,240)
plt.subplots_adjust(wspace=0.2, hspace=0.3)
c=1
for i in range(6):
ax=plt.subplot(6,3,c)
plt.imshow(x[i])
ax.set_title("x"+str(i))
c+=1
ax=plt.subplot(6,3,c)
plt.imshow(y[i])
ax.set_title("y"+str(i))
c+=1
ax=plt.subplot(6,3,c)
plt.imshow(t[i])
ax.set_title("t"+str(i))
c+=1
plt.tight_layout()
plt.savefig("test.png")
test.png looks like this.
I want to
make each heatmap bigger
reduce the margin between each heatmaps in row.
I tried to adjust by "subplots_adjust", but it doesn't work.
Additional information
According to ImportanceOfBeingErnest's comment, I removed tight_layout(). It generated this.
It makes bigger each heatmap, but titles overlappes on subplots. And I still want to make each heatmap more bigger, and I want to reduce the margin in row.
I am trying to make a cubic spline interpolation and for some reason, the interpolation drops off in the middle of it. It's very mysterious and I can't find any mention of similar occurrences anywhere online.
This is for my dissertation so I have excluded some labels etc. to keep it obscure intentionally, but all the relevant code is as follows. For context, this is an astronomy related plot.
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
W = np.array([0.435,0.606,0.814,1.05,1.25,1.40,1.60])
sum_all = np.array([sum435,sum606,sum814,sum105,sum125,sum140,sum160])
sum_can = np.array([sumc435,sumc606,sumc814,sumc105,sumc125,sumc140,sumc160])
fall = CubicSpline(W,sum_all)
newallx=np.arange(0.435,1.6,0.001)
newally=fall(newallx)
fcan = CubicSpline(W,sum_can)
newcanx=np.arange(0.435,1.6,0.001)
newcany=fcan(newcanx)
#----plot
plt.plot(newallx,newally)
plt.plot(newcanx,newcany)
plt.plot(W,sum_all,marker='o',color='r',linestyle='')
plt.plot(W,sum_can,marker='o',color='b',linestyle='')
plt.yscale("log")
plt.ylabel("Flux S$_v$ [erg s$^-$$^1$ cm$^-$$^2$ Hz$^-$$^1$]")
plt.xlabel("Wavelength [n$\lambda$]")
plt.show()
The plot that I get from that comes out like this, with a clear gap in the interpolation:
And in case you are wondering, these are the values in the sum_all and sum_can arrays (I assume it doesn't matter, but just in case you want the numbers to plot it yourself):
sum_all:
[ 3.87282732e+32 8.79993191e+32 1.74866333e+33 1.59946687e+33
9.08556547e+33 6.70458731e+33 9.84832359e+33]
can_all:
[ 2.98381061e+28 1.26194810e+28 3.30328780e+28 2.90254609e+29
3.65117723e+29 3.46256846e+29 3.64483736e+29]
The gap happens between [0.606,1.26194810e+28] and [0.814,3.30328780e+28]. If I change the intervals from 0.001 to something higher, it's obvious that the plot doesn't actually break off but merely dips below 0 on the y-axis (but the plot is continuous). So why does it do that? Surely that's not a correct interpolation? Just looking with our eyes, that's clearly not a well-interpolated connection between those two points.
Any tips or comments would be extremely appreciated. Thank you so much in advance!
The reason for the breakdown can be better observed on a linear scale.
We see that the spline actually passes below 0, which is undefined on a log scale.
So I would suggest to first take the logarithm of the data, perform the spline interpolation on the logarithmically scaled data, and then scale back by the 10th power.
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
W = np.array([0.435,0.606,0.814,1.05,1.25,1.40,1.60])
sum_all = np.array([ 3.87282732e+32, 8.79993191e+32, 1.74866333e+33, 1.59946687e+33,
9.08556547e+33, 6.70458731e+33, 9.84832359e+33])
sum_can = np.array([ 2.98381061e+28, 1.26194810e+28, 3.30328780e+28, 2.90254609e+29,
3.65117723e+29, 3.46256846e+29, 3.64483736e+29])
fall = CubicSpline(W,np.log10(sum_all))
newallx=np.arange(0.435,1.6,0.001)
newally=fall(newallx)
fcan = CubicSpline(W,np.log10(sum_can))
newcanx=np.arange(0.435,1.6,0.01)
newcany=fcan(newcanx)
plt.plot(newallx,10**newally)
plt.plot(newcanx,10**newcany)
plt.plot(W,sum_all,marker='o',color='r',linestyle='')
plt.plot(W,sum_can,marker='o',color='b',linestyle='')
plt.yscale("log")
plt.ylabel("Flux S$_v$ [erg s$^-$$^1$ cm$^-$$^2$ Hz$^-$$^1$]")
plt.xlabel("Wavelength [n$\lambda$]")
plt.show()
The following code when graphed looks really messy at the moment. The reason is I have too many values for 'fare'. 'Fare' ranges from [0-500] with most of the values within the first 100.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
titanic = sns.load_dataset("titanic")
y =titanic.groupby([titanic.fare//1,'sex']).survived.mean().reset_index()
sns.set(style="whitegrid")
g = sns.factorplot(x='fare', y= 'survived', col = 'sex', kind ='bar' ,data= y,
size=4, aspect =2.5 , palette="muted")
g.despine(left=True)
g.set_ylabels("Survival Probability")
g.set_xlabels('Fare')
plt.show()
I would like to try slicing up the 'fare' of the plots into subsets but would like to see all the graphs at the same time on one screen. I was wondering it this is possible without having to resort to groupby.
I will have to play around with the values of 'fare' to see what I would want each graph to represent, but for a sample let's use break up the graph into these 'fare' values.
[0-18]
[18-35]
[35-70]
[70-300]
[300-500]
So the total would be 10 graphs on one page, because of the juxtaposition with the opposite sex.
Is it possible with Seaborn? Do I need to do a lot of configuring with matplotlib? Thanks.
Actually I wrote a little blog post about this a while ago. If you are plotting histograms you can use the by keyword:
import matplotlib.pyplot as plt
import seaborn.apionly as sns
sns.set() #rescue matplotlib's styles from the early '90s
data = sns.load_dataset('titanic')
data.hist(by='class', column = 'fare')
plt.show()
Otherwise if you're just plotting value-counts, you have to roll your own grid:
def categorical_hist(self,column,by,layout=None,legend=None,**params):
from math import sqrt, ceil
if layout==None:
s = ceil(sqrt(self[column].unique().size))
layout = (s,s)
return self.groupby(by)[column]\
.value_counts()\
.sort_index()\
.unstack()\
.plot.bar(subplots=True,layout=layout,legend=None,**params)
categorical_hist(data, by='class', column='embark_town')
Edit If you want survival rate by fare range, you could do something like this
data.groupby(pd.cut(data.fare,10)).apply(lambda x.survived.sum(): x./len(x))