I have a pandas dataframe with three columns (A,B,C). I have drawn a regression line of A vs B using
sns.lmplot(x='A', y='B', data = df, x_bins=10, ci=None)
I am using 10 bins and no confidence interval as I have a large number (~5million) datapoints.
I would like to show the value of C on this plot. C has nothing to do with the regression of A against B. I would just like to show C by making the marker size of each bin equal to the average value of C within that bin.
It seems seaborn doesn't have a markersize parameter that can be set equal to a column of the dataframe. Is this even possible?
I cam across this stackexchange post which suggests using scatter_kws={"s": 100} to set the marker size. However, when I tried scatter_kws={"s": df['C']} it threw an error.
If this is not possible in seaborn, are there any alternative solutions?
Related
i have a data set with 37 columns and 230k rows
i am trying using seaborn to histogram every column
i have not yet cleaned my data
here is my code
for i in X.columns:
plt.figure()
ax = sns.histplot(data=df,x=i)
i got also this File C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\function_base.py:135 in linspace y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta))
any solution for this please
It may be due to the size of your dataset. So you can try to draw one histogram at a time.
I think there is a inconsistency in your code : you loop over the columns of the dataframe X but you draw the columns of the dataframe df. It is more consistent like that :
for i in df.columns:
plt.figure()
ax = sns.histplot(data=df,x=i)
problem solved by determining the number of bins, since the bins default is set to auto and this was the reason, normally this leads to a huge computational error for high dataset size and with high variance
the code solved my issue as below:
for i in X.columns:
plt.figure()
ax = sns.histplot(data=df,x=i,bins=50)
I was wondering if anyone could shed some light into how I can average this data:
I have a .nc file with data (dimensions: 2029,64,32) which relates to time, latitude and longitude. Using these commands I can plot individual timesteps:
timestep = data.variables['precip'][0]
plt.imshow(timestep)
plt.colorbar()
plt.show()
Giving a graph in this format for the 0th timestep:
I was wondering if there was any way to average this first dimension (the snapshots in time).
If you are looking to take a mean over all times, try using np.mean where you use the axis keyword to say which axis you want to average.
time_avaraged = np.mean(data.variables['precip'], axis = 0)
If you have NaN values then np.mean will give NaN for that lon/lat point. If you'd rather ignore them then use np.nanmean.
If you want to do specific times only, e.g. the first 1000 time steps, then you could do
time_avaraged = np.mean(data.variables['precip'][:1000,:,:], axis = 0)
I think if you're using pandas and numpy this may help you.Look for more details
import pandas as pd
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7])
d = pd.Series(data)
print(d.rolling(4).mean())
I want to do a scatterplot according x and y variables, and the points size depend of a numeric variable and the color of every point depend of a categorical variable.
First, I was trying this with plt.scatter:
Graph 1
After, I tried this using lmplot but the point size is different in relation to the first graph.
I think the two graphs should be equals. Why not?
The point size is different in every graph.
Graph 2
Your question is no so much descriptive but i guess you want to control the size of the marker. Here is more documentation
Here is the start point for you.
A numeric variable can also be assigned to size to apply a semantic mapping to the areas of the points:
import seaborn as sns
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="size", size="size")
For seaborn scatterplot:
df = sns.load_dataset("anscombe")
sp = sns.scatterplot(x="x", y="y", hue="dataset", data=df)
And to change the size of the points you use the s parameter.
sp = sns.scatterplot(x="x", y="y", hue="dataset", data=df, s=100)
I'm trying to use seaborn dataframe functionality (e.g. passing column names to x, y and hue plot parameters) for my timeseries (in pandas datetime format) plots.
x should come from a timeseries column(converted from a pd.Series of strings with pd.to_datetime)
y should come from a float column
hue comes from a categorical column that I calculated.
There are multiple streams in the same series that I am trying to separate (and use the hue for separating them visually), and therefore they should not be connected by a line (like in a scatterplot)
I have tried the following plot types, each with a different problem:
sns.scatterplot: gets the plotting right and the labels right bus has problems with the xlimits, and I could not set them right with plt.xlim() using data.Dates.min and data.Dates.min
sns.lineplot: gets the limits and the labels right but I could not find a setting to disable the lines between the individual datapoints like in matplotlib. I tried the setting the markers and the dashes parameters to no avail.
sns.stripplot: my last try, plotted the datapoints correctly and got the xlimits right but messed the labels ticks
Example input data for easy reproduction:
dates = pd.to_datetime(('2017-11-15',
'2017-11-29',
'2017-12-15',
'2017-12-28',
'2018-01-15',
'2018-01-30',
'2018-02-15',
'2018-02-27',
'2018-03-15',
'2018-03-27',
'2018-04-13',
'2018-04-27',
'2018-05-15',
'2018-05-28',
'2018-06-15',
'2018-06-28',
'2018-07-13',
'2018-07-27'))
values = np.random.randn(len(dates))
clusters = np.random.randint(1, size=len(dates))
D = {'Dates': dates, 'Values': values, 'Clusters': clusters}
data = pd.DataFrame(D)
To each of the functions I am passing the same arguments:
sns.OneOfThePlottingFunctions(x='Dates',
y='Values',
hue='Clusters',
data=data)
plt.show()
So to recap, what I want is a plot that uses seaborn's pandas functionality, and plots points(not lines) with correct x limits and readable x labels :)
Any help would be greatly appreciated.
ax = sns.scatterplot(x='Dates', y='Values', hue='Clusters', data=data)
ax.set_xlim(data['Dates'].min(), data['Dates'].max())
Suppose I have multiple time dependent variables and I want to plot them all together stacked one of on top of another like the image below, how would I do so in matplotlib? Currently when I try plotting them they appear as multiple independent plots.
EDIT:
I have a Pandas dataframe with K columns corresponding to dependent variables and N rows corresponding to observed values for those K variables.
Sample code:
df = get_representation(mat) #df is the Pandas dataframe
for i in xrange(len(df.columns)):
plt.plot(df.ix[:,i])
plt.show()
I would like to plot them all one on top of another.
You could just stack all the curves by shifting each curve vertically:
df = get_representation(mat) #df is the Pandas dataframe
for i in xrange(len(df.columns)):
plt.plot(df.ix[:, i] + shift*i)
plt.show()
Here shift denotes the average distance between the curves.