How to display all the informations of a scatterplot dot with pick event when using Seaborn and facet-grid - dataframe

I have a Pandas database (below I am creating a random database to mimic mine).
I use Seaborn, facet-grid and scatterplot to plot the data the way I want : Epsilon1 as a function of no, I distinguish the data from the sub categories A and B using different subplots and colors. This part of the code works correctly.
Then I want that the user can click on any dot in order to display in the IPython console and in the status bar of the Matplotlib figure (as here) all the informations about this dot : that is to say all the values of the corresponding dataframe row: something like:
'no':5, 'Date':1997-12-15 03:50:41, 'A':A6, 'B':B4, 'Epsilon1':0.670635, 'Epsilon2':0.756461, 'Epsilon3':0.530825
I have made first tests using onpick event (not shown here) but all were unsuccessful.
Actually I can't get by with the function onpick(event) because I do not understand why print(event.ind) gives me a list of integers...
Here is my code
import pandas as pd
import numpy as np
import seaborn as sns
import random
# size of the database
n = 1000
nA = 6
nB = 5
no = np.arange(n)
date = np.random.randint(1e9, size=n).astype('datetime64[s]')
A = [''.join(['A',str(random.randint(1, nA))]) for j in range(n)]
B = [''.join(['B',str(random.randint(1, nB))]) for j in range(n)]
Epsilon1 = np.random.random_sample((n,))
Epsilon2 = np.random.random_sample((n,))
Epsilon3 = np.random.random_sample((n,))
data = pd.DataFrame({'no':no,
'Date':date,
'A':A,
'B':B,
'Epsilon1':Epsilon1,
'Epsilon2':Epsilon2,
'Epsilon3':Epsilon3})
def onpick(event):
print(event.ind)
def plot_Epsilon1_seaborn():
sns.set_theme()
g = sns.FacetGrid(data,
col="A",
col_wrap=4,
hue='B',
hue_order=data['B'].sort_values().drop_duplicates().to_list(),
palette="viridis",
col_order=data['A'].sort_values().drop_duplicates().to_list())
g.map(sns.scatterplot,
'no',
'Epsilon1',
picker=True)
g.add_legend()
g.fig.canvas.mpl_connect("pick_event", onpick)
if __name__ == '__main__':
plot_Epsilon1_seaborn()

Related

Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot?

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()
What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

Time series plot of categorical or binary variables in pandas or matplotlib

I have data that represent a time series of categorical variables. I want to display the transitions in categories below a traditional line plot of related continuous time series to show off context as time evolves. I'd like to know the best way to do this. My attempt was in terms of Rectangles. The appearance is a bit weird, and importantly the axis labels for the x axis don't render as dates.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from pandas.plotting import register_matplotlib_converters
import matplotlib.dates as mdates
register_matplotlib_converters()
t0 = pd.DatetimeIndex(["2017-06-01 00:00","2017-06-17 00:00","2017-07-03 00:00","2017-08-02 00:00","2017-08-09 00:00","2017-09-01 00:00"])
t1 = pd.DatetimeIndex(["2017-06-01 00:00","2017-08-15 00:00","2017-09-01 00:00"])
df0 = pd.DataFrame({"cat":[0,2,1,2,0,1]},index = t0)
df1 = pd.DataFrame({"op":[0,1,0]},index=t1)
# Create new plot
fig,ax = plt.subplots(1,figsize=(8,3))
data_layout = {
"cat" : {0: ('bisque','Low'),
1: ('lightseagreen','Medium'),
2: ('rebeccapurple','High')},
"op" : {0: ('darkturquoise','Open'),
1: ('tomato','Close')}
}
vars =("cat","op")
dfs = [df0,df1]
all_ticks = []
leg = []
for j,(v,d) in enumerate(zip(vars,dfs)):
dvals = d[v][:].astype("d")
normal = mpl.colors.Normalize(vmin=0, vmax=2.)
colors = plt.cm.Set1(0.75*normal(dvals.as_matrix()))
handles = []
for i in range(d.count()-1):
s = d[v].index.to_pydatetime()
level = d[v][i]
base = d[v].index[i]
w = s[i+1] - s[i]
patch=mpl.patches.Rectangle((base,float(j)),width=w,color=data_layout[v][level][0],height=1,fill=True)
ax.add_patch(patch)
for lev in data_layout[v]:
print data_layout[v][level]
handles.append(mpl.patches.Patch(color=data_layout[v][lev][0],label=data_layout[v][lev][1]))
all_ticks.append(j+0.5)
leg.append( plt.legend(handles=handles,loc = (3-3*j+1)))
plt.axhline(y=1.,linewidth=3,color="gray")
plt.xlim(pd.Timestamp(2017,6,1).to_pydatetime(),pd.Timestamp(2017,9,1).to_pydatetime())
plt.ylim(0,2)
ax.add_artist(leg[0]) # two legends on one axis
ax.format_xdata = mdates.DateFormatter('%Y-%m-%d') # This fails
plt.yticks(all_ticks,vars)
plt.show()
which produces this with no dates and has jittery lines:. How do I fix this? Is there a better way entirely?
This is a way to display dates on x-axis:
In your code substitute the line that fails with this one:
ax.xaxis.set_major_formatter((mdates.DateFormatter('%Y-%m-%d')))
But I don't remember how it should look like, can you show us the end-result again?

pick_event in Jupyter with matplotlib scatter plot

I really like the simplicity with how ipywidgets.interactive works with pandas dataframe but I am having trouble getting data when a point in a scatter plot is selected.
I have looked at some examples that use matplotlib.widgets etc. but none that use it with interactive in Jupyter. It looks like this technique would be described here but it comes up just short:
http://minrk-ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html
Here is an ipynb of what I am trying to accomplish:
from ipywidgets import interactive
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.widgets import Button
from matplotlib.text import Annotation
from io import StringIO
data_ssv = """tone_amp_0 tone_freq_0 SNR
75.303 628.0 68.374
84.902 8000.0 61.292
92.856 288.0 70.545
70.000 2093.0 35.036
76.511 6834.0 66.952 """
data = pd.read_table(StringIO(data_ssv), sep="\s+", header=0)
col_names=list(data.columns.values)
plottable_col=( ['tone_amp_0', 'tone_freq_0', 'SNR'] )
def annotate(axis, text, x, y):
text_annotation = Annotation(text, xy=(x, y), xycoords='data')
axis.add_artist(text_annotation)
def onpick(event):
ind = event.ind
label_pos_x = event.mouseevent.xdata
label_pos_y = event.mouseevent.ydata
offset = 0 # just in case two dots are very close, this offset will help the labels not appear one on top of each other
for i in ind: # if the dots are to close one to another, a list of dots clicked is returned by the matplotlib library
label = "gen_labels" # generated_labels[i]
print( "index", i, label ) # step 4: log it for debugging purposes
ax=plt.gca()
annotate(ax,label,label_pos_x + offset,label_pos_y + offset)
ax.figure.canvas.draw_idle()
offset += 0.01 # alter the offset just in case there are more than one dots affected by the click
def update_plot(X='tone_amp_0', Y='tone_frq_0', Z='SNR'):
plt.scatter( data.loc[:, [X]],data.loc[:, [Y]], marker='.', edgecolors='none', c=data.loc[:,[Z]], picker=True, cmap='RdYlGn' )
plt.title(X+' vs '+Y); plt.xlabel(X); plt.ylabel(Y); plt.colorbar().set_label(Z, labelpad=+1)
plt.grid(); plt.show()
plt.gcf().canvas.mpl_connect('pick_event', onpick)
interactive(update_plot, X=plottable_col, Y=plottable_col, Z=plottable_col)
When I select a data point nothing is happening. Not sure how to debug this or understand what I am doing wrong. Can someone point out what I am doing wrong here?
Try put a semicolon at the end of plt.gcf().canvas.mpl_connect('pick_event', onpick).

Visualizing class labels in self-organizing map plot or iris dataset

I am trying to produce a visualization of the SOM mapping for the Iris dataset ( https://archive.ics.uci.edu/ml/datasets/Iris).
My code so far:
from sklearn.datasets import load_iris
from mvpa2.suite import *
import pandas as pd
import numpy as np
df = pd.read_csv(filepath_or_buffer='data/iris.data', header=None, sep=',')
df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end
# split the data table into feature data x and class labels y
x = df.ix[:,0:4].values # the first 4 columns are the features
y = df.ix[:,4].values # the last column is the class label
t = np.zeros(len(y), dtype=int)
t[y == 'Iris-setosa'] = 0
t[y == 'Iris-versicolor'] = 1
t[y == 'Iris-virginica'] = 2
som = SimpleSOMMapper((240, 320), 100, learning_rate=0.05)
som.train(x)
pl.imshow(som.K, origin='lower')
mapped = som(x)
for i, m in enumerate(mapped):
pl.text(m[1], m[0], t[i], ha='center', va='center',
bbox=dict(facecolor='white', alpha=0.5, lw=0))
pl.show()
which produces this mapping:
Is there any way to customize the palette so it looks nicer like this one? (taken from https://github.com/JustGlowing/minisom)?
Basically I am trying to use a nicer palette (perhaps with fewer colors) and mark the class labels in a nicer way.
Thank you.
I will answer my own question: it turns out that I forgot to slice my data:
pl.imshow(som.K[:,:,0], origin='lower')
Everything looks fine now:

Update data point labels in bokeh plot

I use bokeh in an ipython notebook and would like to have a button next to a plot to switch on or off labels of the data points. I found a solution using IPython.html.widgets.interact, but this solution resets the plot for each update including zooming and padding
This is the minimal working code example:
from numpy.random import random
from bokeh.plotting import figure, show, output_notebook
from IPython.html.widgets import interact
def plot(label_flag):
p = figure()
N = 10
x = random(N)+2
y = random(N)+2
labels = range(N)
p.scatter(x, y)
if label_flag:
pass
p.text(x, y, labels)
output_notebook()
show(p)
interact(plot, label_flag=True)
p.s. If there is an easy way to do this in matplotlib I would also switch back again.
By using bokeh.models.ColumnDataSource to store and change the plot's data I was able to achieve what I wanted.
One caveat is, that I found no way to make it work w/o refresh w/o calling output_notebook twice in two different cells. If I remove one of the two output_notebook calls the gui of the tools-button looks breaks or changing a setting also results in a reset of the plot.
from numpy.random import random
from bokeh.plotting import figure, show, output_notebook
from IPython.html.widgets import interact
from bokeh.models import ColumnDataSource
output_notebook()
## <-- new cell -->
p = figure()
N = 10
x_data = random(N)+2
y_data = random(N)+2
labels = range(N)
source = ColumnDataSource(
data={
'x':x_data,
'y':y_data,
'desc':labels
}
)
p.scatter('x', 'y', source=source)
p.text('x', 'y', 'desc', source=source)
output_notebook()
def update_plot(label_flag=True):
if label_flag:
source.data['desc'] = range(N)
else:
source.data['desc'] = ['']*N
show(p)
interact(update_plot, label_flag=True)