How can I specify multiple variables for the hue parameters when plotting with seaborn? - pandas

When using seaborn, is there a way I can include multiple variables (columns) for the hue parameter? Another way to ask this question would be how can I group my data by multiple variables before plotting them on a single x,y axis plot?
I want to do something like below. However currently I am not able to specify two variables for the hue parameter.:
sns.relplot(x='#', y='Attack', hue=['Legendary', 'Stage'], data=df)
For example, assume I have a pandas DataFrame like below containing an a Pokemon database obtained via this tutorial.
I want to plot on the x-axis the pokedex #, and the y-axis the Attack. However, I want to data to be grouped by both Stage and Legendary. Using matplotlib, I wrote a custom function that groups the dataframe by ['Legendary','Stage'], and then iterates through each group for the plotting (see results below). Although my custom function works as intended, I was hoping this can be achieved simply by seaborn. I am guessing there must be other people what have attempted to visualize more than 3 variables in a single plot using seaborn?
fig, ax = plt.subplots()
grouping_variables = ['Stage','Legendary']
group_1 = df.groupby(grouping_variables)
for group_1_label, group_1_df in group_1:
ax.scatter(group_1_df['#'], group_1_df['Attack'], label=group_1_label)
ax_legend = ax.legend(title=grouping_variables)
Edit 1:
Note: In the example I provided, I grouped the data by obly two variables (ex: Legendary and Stage). However, other situations may require arbitrary number of variables (ex: 5 variables).

You can leverage the fact that hue accepts either a column name, or a sequence of the same length as your data, listing the color categories to assign each data point to. So...
sns.relplot(x='#', y='Attack', hue='Stage', data=df)
... is basically the same as:
sns.relplot(x='#', y='Attack', hue=df['Stage'], data=df)
You typically wouldn't use the latter, it's just more typing to achieve the same thing -- unless you want to construct a custom sequence on the fly:
sns.relplot(x='#', y='Attack', data=df,
hue=df[['Legendary', 'Stage']].apply(tuple, axis=1))
The way you build the sequence that you pass via hue is entirely up to you, the only requirement is that it must have the same length as your data, and if an array-like, it must be one-dimensional, so you can't just pass hue=df[['Legendary', 'Stage']], you have to somehow concatenate the columns into one. I chose tuple as the simplest and most versatile way, but if you want to have more control over the formatting, build a Series of strings. I'll save it into a separate variable here for better readability and so that I can assign it a name (which will be used as the legend title), but you don't have to:
hue = df[['Legendary', 'Stage']].apply(
lambda row: f"{row.Legendary}, {row.Stage}", axis=1)
hue.name = 'Legendary, Stage'
sns.relplot(x='#', y='Attack', hue=hue, data=df)

To use hue of seaborn.relplot, consider concatenating the needed groups into a single column and then run the plot on new variable:
def run_plot(df, flds):
# CREATE NEW COLUMN OF CONCATENATED VALUES
df['_'.join(flds)] = pd.Series(df.reindex(flds, axis='columns')
.astype('str')
.values.tolist()
).str.join('_')
# PLOT WITH hue
sns.relplot(x='#', y='Attack', hue='_'.join(flds), data=random_df, aspect=1.5)
plt.show()
plt.clf()
plt.close()
To demonstrate with random data
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
### DATA
np.random.seed(22320)
random_df = pd.DataFrame({'#': np.arange(1,501),
'Name': np.random.choice(['Bulbasaur', 'Ivysaur', 'Venusaur',
'Charmander', 'Charmeleon'], 500),
'HP': np.random.randint(1, 100, 500),
'Attack': np.random.randint(1, 100, 500),
'Defense': np.random.randint(1, 100, 500),
'Sp. Atk': np.random.randint(1, 100, 500),
'Sp. Def': np.random.randint(1, 100, 500),
'Speed': np.random.randint(1, 100, 500),
'Stage': np.random.randint(1, 3, 500),
'Legend': np.random.choice([True, False], 500)
})
Plots
run_plot(random_df, ['Legend', 'Stage'])
run_plot(random_df, ['Legend', 'Stage', 'Name'])

In seaborn's scatterplot(), you can combine both a hue= and a style= parameter to produce different markers and different colors for each combinations
example (taken verbatim from the documentation):
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip", data=tips)
ax = sns.scatterplot(x="total_bill", y="tip",
hue="day", style="time", data=tips)

Related

Seaborn time series plotting: a different problem for each function

I'm trying to use seaborn dataframe functionality (e.g. passing column names to x, y and hue plot parameters) for my timeseries (in pandas datetime format) plots.
x should come from a timeseries column(converted from a pd.Series of strings with pd.to_datetime)
y should come from a float column
hue comes from a categorical column that I calculated.
There are multiple streams in the same series that I am trying to separate (and use the hue for separating them visually), and therefore they should not be connected by a line (like in a scatterplot)
I have tried the following plot types, each with a different problem:
sns.scatterplot: gets the plotting right and the labels right bus has problems with the xlimits, and I could not set them right with plt.xlim() using data.Dates.min and data.Dates.min
sns.lineplot: gets the limits and the labels right but I could not find a setting to disable the lines between the individual datapoints like in matplotlib. I tried the setting the markers and the dashes parameters to no avail.
sns.stripplot: my last try, plotted the datapoints correctly and got the xlimits right but messed the labels ticks
Example input data for easy reproduction:
dates = pd.to_datetime(('2017-11-15',
'2017-11-29',
'2017-12-15',
'2017-12-28',
'2018-01-15',
'2018-01-30',
'2018-02-15',
'2018-02-27',
'2018-03-15',
'2018-03-27',
'2018-04-13',
'2018-04-27',
'2018-05-15',
'2018-05-28',
'2018-06-15',
'2018-06-28',
'2018-07-13',
'2018-07-27'))
values = np.random.randn(len(dates))
clusters = np.random.randint(1, size=len(dates))
D = {'Dates': dates, 'Values': values, 'Clusters': clusters}
data = pd.DataFrame(D)
To each of the functions I am passing the same arguments:
sns.OneOfThePlottingFunctions(x='Dates',
y='Values',
hue='Clusters',
data=data)
plt.show()
So to recap, what I want is a plot that uses seaborn's pandas functionality, and plots points(not lines) with correct x limits and readable x labels :)
Any help would be greatly appreciated.
ax = sns.scatterplot(x='Dates', y='Values', hue='Clusters', data=data)
ax.set_xlim(data['Dates'].min(), data['Dates'].max())

sns.clustermap ticks are missing

I'm trying to visualize what filters are learning in CNN text classification model. To do this, I extracted feature maps of text samples right after the convolutional layer, and for size 3 filter, I got an (filter_num)*(length_of_sentences) sized tensor.
df = pd.DataFrame(-np.random.randn(50,50), index = range(50), columns= range(50))
g= sns.clustermap(df,row_cluster=True,col_cluster=False)
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) # ytick rotate
g.cax.remove() # remove colorbar
plt.show()
This code results in :
Where I can't see all the ticks in the y-axis. This is necessary
because I need to see which filters learn which information. Is there
any way to properly exhibit all the ticks in the y-axis?
kwargs from sns.clustermap get passed on to sns.heatmap, which has an option yticklabels, whose documentation states (emphasis mine):
If True, plot the column names of the dataframe. If False, don’t plot the column names. If list-like, plot these alternate labels as the xticklabels. If an integer, use the column names but plot only every n label. If “auto”, try to densely plot non-overlapping labels.
Here, the easiest option is to set it to an integer, so it will plot every n labels. We want every label, so we want to set it to 1, i.e.:
g = sns.clustermap(df, row_cluster=True, col_cluster=False, yticklabels=1)
In your complete example:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
df = pd.DataFrame(-np.random.randn(50,50), index=range(50), columns=range(50))
g = sns.clustermap(df, row_cluster=True, col_cluster=False, yticklabels=1)
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0) # ytick rotate
g.cax.remove() # remove colorbar
plt.show()

Adding Arbitrary points on pandas time series using Dataframe.plot function

I have been trying to plot some time series graphs using the pandas dataframe plot function. I was trying to add markers at some arbitrary points on the plot to show anomalous points. The code I used :
df1 = pd.DataFrame({'Entropy Values' : MeanValues}, index=DateRange)
df1.plot(linestyle = '-')
I have a list of Dates on which I need to add markers.Such as:
Dates = ['15:45:00', '15:50:00', '15:55:00', '16:00:00']
I had a look at this link matplotlib: Set markers for individual points on a line. Does DF.plot have a similar functionality?
I really appreciate the help. Thanks!
DataFrame.plot passes all keyword arguments it does not recognize to the matplotlib plotting method. To put markers at a few points in the plot you can use the markevery argument. Here is an example:
import pandas as pd
df = pd.DataFrame({'A': range(10), 'B': range(10)}).set_index('A')
df.plot(linestyle='-', markevery=[1, 5, 7, 8], marker='o', markerfacecolor='r')
In your case, you would have to do something like
df1.plot(linestyle='-', markevery=Dates, marker='o', markerfacecolor='r')

return values of subplot

Currently I trying to get myself acquainted with the matplotlib.pyplot library. After having seeing quite some examples and tutorial, I noticed that the subplots function also has some returns values which usually are used later on. However, on the matplotlib website I was unable to find any specification on what exactly is returned, and none of the examples are the same (although it usually seems to be an ax object). Can you guys give me some to pointers as to what is returned, and how I can use it. Thanks in advance!
In the documentation it says that matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or not depends on the number of subplots).
Common use is:
import matplotlib.pyplot as plt
import numpy as np
f, axes = plt.subplots(1,2) # 1 row containing 2 subplots.
# Plot random points on one subplots.
axes[0].scatter(np.random.randn(10), np.random.randn(10))
# Plot histogram on the other one.
axes[1].hist(np.random.randn(100))
# Adjust the size and layout through the Figure-object.
f.set_size_inches(10, 5)
f.tight_layout()
Generally, the matplotlib.pyplot.subplots() returns a figure instance and an object or an array of Axes objects.
Since you haven't posted the code with which you are trying to get your hands dirty, I will do it by taking 2 test cases :
case 1 : when number of subplots needed(dimension) is mentioned
import matplotlib.pyplot as plt #importing pyplot of matplotlib
import numpy as np
x = [1, 3, 5, 7]
y = [2, 4, 6, 8]
fig, axes = plt.subplots(2, 1)
axes[0].scatter(x, y)
axes[1].boxplot(x, y)
plt.tight_layout()
plt.show()
As you can see here since we have given the number of subplots needed, (2,1) in this case which means no. of rows, r = 2 and no. of columns, c = 1.
In this case, the subplot returns the figure instance along with an array of axes, length of which is equal to the total no. of the subplots = r*c , in this case = 2.
case 2 : when number of subplots(dimension) is not mentioned
import matplotlib.pyplot as plt #importing pyplot of matplotlib
import numpy as np
x = [1, 3, 5, 7]
y = [2, 4, 6, 8]
fig, axes = plt.subplots()
#size has not been mentioned and hence only one subplot
#is returned by the subplots() method, along with an instance of a figure
axes.scatter(x, y)
#axes.boxplot(x, y)
plt.tight_layout()
plt.show()
In this case, no size or dimension has been mentioned explicitly, therefore only one subplot is created, apart from the figure instance.
You can also control the dimensions of the subplots by using the squeeze keyword. See documentation. It is an optional argument, having default value as True.
Actually, 'matplotlib.pyplot.subplots()' is returning two objects:
The figure instance.
The 'axes'.
'matplotlib.pyplot.subplots()' takes many arguments. That has been given below:
matplotlib.pyplot.subplots(nrows=1, ncols=1, *, sharex=False, sharey=False, squeeze=True, subplot_kw=None, gridspec_kw=None, **fig_kw)
The first two arguments are : nrows : the number of rows I want to creat in my Subplot grid , ncols : The number of columns should have in the subplot grid. But, if 'nrows' and 'ncols' are not decleared explicitely, it will take the values of 1 in each by default.
Now, come to objects that has been created:
(1)The figure instance is nothing but throwing a figure which will hold all the plots.
(2)The 'axes' object will contain all the informations about each subplots.
Let's understand through an example:
Here, 4 subplots are being created at the positions of (0,0),(0,1),(1,0),(1,1).
Now, let's suppose, at the position (0,0), I want to have a scatterplot. What will I do: I will incorporate the scatterplot into "axes[0,0]" object that will hold all the informations about the scatterplot and reflect it into the figure instance.
The same thing will happen for all the other three positions.
Hope this will help and let me know your thought about this.

Colors for pandas timeline graphs with many series

I am using pandas for graphing data for a cluster of nodes. I find that pandas is repeating color values for the different series, which makes them indistinguishable.
I tried giving custom color values like this and passed the my_colors to the colors field in plot:
my_colors = []
for node in nodes_list:
my_colors.append(rand_color())
rand_color() is defined as follows:
def rand_color():
from random import randrange
return "#%s" % "".join([hex(randrange(16, 255))[2:] for i in range(3)])
But here also I need to avoid color values that are too close to distinguish. I sometimes have as many as 60 nodes (series). Most probably a hard-coded list of color values would be best option?
You can get a list of colors from any colormap defined in Matplotlib, and even custom colormaps, by:
>>> import matplotlib.pyplot as plt
>>> colors = plt.cm.Paired(np.linspace(0,1,60))
Plotting an example with these colors:
>>> plt.scatter( range(60), [0]*60, color=colors )
<matplotlib.collections.PathCollection object at 0x04ED2830>
>>> plt.axis("off")
(-10.0, 70.0, -0.0015, 0.0015)
>>> plt.show()
I found the "Paired" colormap to be especially useful for this kind of things, but you can use any other available or custom colormap.