matplotlib scatter plot: How to use the data= argument - matplotlib

The matplotlib documentation for scatter() states:
In addition to the above described arguments, this function can take a data keyword argument. If such a data argument is given, the following arguments are replaced by data[]:
All arguments with the following names: ‘s’, ‘color’, ‘y’, ‘c’, ‘linewidths’, ‘facecolor’, ‘facecolors’, ‘x’, ‘edgecolors’.
However, I cannot figure out how to get this to work.
The minimal example
import matplotlib.pyplot as plt
import numpy as np
data = np.random.random(size=(3, 2))
props = {'c': ['r', 'g', 'b'],
's': [50, 100, 20],
'edgecolor': ['b', 'g', 'r']}
plt.scatter(data[:, 0], data[:, 1], data=props)
plt.show()
produces a plot with the default color and sizes, instead of the supplied one.
Anyone has used that functionality?

This seems to be an overlooked feature added about two years ago. The release notes have a short example (
https://matplotlib.org/users/prev_whats_new/whats_new_1.5.html#working-with-labeled-data-like-pandas-dataframes). Besides this question and a short blog post (https://tomaugspurger.github.io/modern-6-visualization.html) that's all I could find.
Basically, any dict-like object ("labeled data" as the docs call it) is passed in the data argument, and plot parameters are specified based on its keys. For example, you can create a structured array with fields a, b, and c
coords = np.random.randn(250, 3).view(dtype=[('a', float), ('b', float), ('c', float)])
You would normally create a plot of a vs b using
pyplot.plot(coords['a'], coords['b'], 'x')
but using the data argument it can be done with
pyplot.plot('a', 'b','x', data=coords)
The label b can be confused with a style string setting the line to blue, but the third argument clears up that ambiguity. It's not limited to x and y data either,
pyplot.scatter(x='a', y='b', c='c', data=coords)
Will set the point color based on column 'c'.
It looks like this feature was added for pandas dataframes, and handles them better than other objects. Additionally, it seems to be poorly documented and somewhat unstable (using x and y keyword arguments fails with the plot command, but works fine with scatter, the error messages are not helpful). That being said, it gives a nice shorthand when the data you want to plot has labels.

In reference to your example, I think the following does what you want:
plt.scatter(data[:, 0], data[:, 1], **props)
That bit in the docs is confusing to me, and looking at the sources, scatter in axes/_axes.py seems to do nothing with this data argument. Remaining kwargs end up as arguments to a PathCollection, maybe there is a bug there.
You could also set these parameters after scatter with the the various set methods in PathCollection, e.g.:
pc = plt.scatter(data[:, 0], data[:, 1])
pc.set_sizes([500,100,200])

Related

Stratigraphic column in matplotlib

My goal is to create a stratigraphic column (colored stacked rectangles) using matplotlib like the example below.
Data is in this format:
depth = [1,2,3,4,5,6,7,8,9,10] #depth (feet) below ground surface
lithotype = [4,4,4,5,5,5,6,6,6,2] #lithology type. 4 = clay, 6 = sand, 2 = silt
I tried matplotlib.patches.Rectangle but it's cumbersome. Wondering if someone has another suggestion.
Imho using Rectangle is not so difficult nor cumbersome.
from numpy import ones
from matplotlib.pyplot import show, subplots
from matplotlib.cm import get_cmap
from matplotlib.patches import Rectangle as r
# a simplification is to use, for the lithology types, a qualitative colormap
# here I use Paired, but other qualitative colormaps are displayed in
# https://matplotlib.org/stable/tutorials/colors/colormaps.html#qualitative
qcm = get_cmap('Paired')
# the data, augmented with type descriptions
# note that depths start from zero
depth = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] # depth (feet) below ground surface
lithotype = [4, 4, 4, 5, 5, 5, 6, 1, 6, 2] # lithology type.
types = {1:'swiss cheese', 2:'silt', 4:'clay', 5:'silty sand', 6:'sand'}
# prepare the figure
fig, ax = subplots(figsize = (4, 8))
w = 2 # a conventional width, used to size the x-axis and the rectangles
ax.set(xlim=(0,2), xticks=[]) # size the x-axis, no x ticks
ax.set_ylim(ymin=0, ymax=depth[-1])
ax.invert_yaxis()
fig.suptitle('Soil Behaviour Type')
fig.subplots_adjust(right=0.5)
# plot a series of dots, that eventually will be covered by the Rectangle\s
# so that we can draw a legend
for lt in set(lithotype):
ax.scatter(lt, depth[1], color=qcm(lt), label=types[lt], zorder=0)
fig.legend(loc='center right')
ax.plot((1,1), (0,depth[-1]), lw=0)
# do the rectangles
for d0, d1, lt in zip(depth, depth[1:], lithotype):
ax.add_patch(
r( (0, d0), # coordinates of upper left corner
2, d1-d0, # conventional width on x, thickness of the layer
facecolor=qcm(lt), edgecolor='k'))
# That's all, folks!
show()
As you can see, placing the rectangles is not complicated, what is indeed cumbersome is to properly prepare the Figure and the Axes.
I know that I omitted part of the qualifying details from my solution, but I hope these omissions won't stop you from profiting from my answer.
I made a package called striplog for handling this sort of data and making these kinds of plots.
The tool can read CSV, LAS, and other formats directly (if the format is rather particular), but we can also construct a Striplog object manually. First let's set up the basic data:
depth = [1,2,3,4,5,6,7,8,9,10]
lithotype = [4,4,4,5,5,5,6,6,6,2]
KEY = {2: 'silt', 4: 'clay', 5: 'mud', 6: 'sand'}
Now you need to know that a Striplog is composed of Interval objects, each of which can have one or more Component elements:
from striplog import Striplog, Component, Interval
intervals = []
for top, base, lith in zip(depth, depth[1:], lithotype):
comp = Component({'lithology': KEY[lith]})
iv = Interval(top, base, components=[comp])
intervals.append(iv)
s = Striplog(intervals).merge_neighbours() # Merge like with like.
This results in Striplog(3 Intervals, start=1.0, stop=10.0). Now we'd like to make a plot using an appropriate Legend object.
from striplog import Legend
legend_csv = u"""colour, width, component lithology
#F7E9A6, 3, Sand
#A68374, 2.5, Silt
#99994A, 2, Mud
#666666, 1, Clay"""
legend = Legend.from_csv(text=legend_csv)
s.plot(legend=legend, aspect=2, label='lithology')
Which gives:
Admittedly the plotting is a little limited, but it's just matplotlib so you can always add more code. To be honest, if I were to build this tool today, I think I'd probably leave the plotting out entirely; it's often easier for the user to do their own thing.
Why go to all this trouble? Fair question. striplog lets you merge zones, make thickness or lithology histograms, make queries ("show me sandstone beds thicker than 2 m"), make 'flags', export LAS or CSV, and even do Markov chain sequence analysis. But even if it's not what you're looking for, maybe you can recycle some of the plotting code! Good luck.

Use of plt.plot vs plt.scatter with two variables (x and f(x,y))

I am new in Python and stack overflow so please bear with me.
I was trying to plot using plt.plot and plt.scatter. The former works perfectly alright while the latter not. Down below is the relevant part of code:
enter code here
def vis_cal(u, a):
return np.exp(2*np.pi*1j*u*np.cos(a))
u = np.array([[1, 2, 3, 4]])
u = u.reshape((4,1))
a = a([[-np.pi, -np.pi/6]])
plt.figure(figsize=(10, 8))
plt.xlabel("Baseline")
plt.ylabel("Vij (Visibility)")
plt.scatter(u, vis_cal(u, a), 'o', color='blue', label="Vij_ind")
plt.legend(loc="lower left")
plt.show()
This returns an error: ValueError: x and y must be the same size
My questions here are
Why the different array size doesn't matter to plt.plot but it does matter to plt.scatter?
Does this mean that if I want to use plt.scatter I always need to make sure that they arrays must have the same size otherwise I need to use plt.plot?
Thank you very much

How can I specify multiple variables for the hue parameters when plotting with seaborn?

When using seaborn, is there a way I can include multiple variables (columns) for the hue parameter? Another way to ask this question would be how can I group my data by multiple variables before plotting them on a single x,y axis plot?
I want to do something like below. However currently I am not able to specify two variables for the hue parameter.:
sns.relplot(x='#', y='Attack', hue=['Legendary', 'Stage'], data=df)
For example, assume I have a pandas DataFrame like below containing an a Pokemon database obtained via this tutorial.
I want to plot on the x-axis the pokedex #, and the y-axis the Attack. However, I want to data to be grouped by both Stage and Legendary. Using matplotlib, I wrote a custom function that groups the dataframe by ['Legendary','Stage'], and then iterates through each group for the plotting (see results below). Although my custom function works as intended, I was hoping this can be achieved simply by seaborn. I am guessing there must be other people what have attempted to visualize more than 3 variables in a single plot using seaborn?
fig, ax = plt.subplots()
grouping_variables = ['Stage','Legendary']
group_1 = df.groupby(grouping_variables)
for group_1_label, group_1_df in group_1:
ax.scatter(group_1_df['#'], group_1_df['Attack'], label=group_1_label)
ax_legend = ax.legend(title=grouping_variables)
Edit 1:
Note: In the example I provided, I grouped the data by obly two variables (ex: Legendary and Stage). However, other situations may require arbitrary number of variables (ex: 5 variables).
You can leverage the fact that hue accepts either a column name, or a sequence of the same length as your data, listing the color categories to assign each data point to. So...
sns.relplot(x='#', y='Attack', hue='Stage', data=df)
... is basically the same as:
sns.relplot(x='#', y='Attack', hue=df['Stage'], data=df)
You typically wouldn't use the latter, it's just more typing to achieve the same thing -- unless you want to construct a custom sequence on the fly:
sns.relplot(x='#', y='Attack', data=df,
hue=df[['Legendary', 'Stage']].apply(tuple, axis=1))
The way you build the sequence that you pass via hue is entirely up to you, the only requirement is that it must have the same length as your data, and if an array-like, it must be one-dimensional, so you can't just pass hue=df[['Legendary', 'Stage']], you have to somehow concatenate the columns into one. I chose tuple as the simplest and most versatile way, but if you want to have more control over the formatting, build a Series of strings. I'll save it into a separate variable here for better readability and so that I can assign it a name (which will be used as the legend title), but you don't have to:
hue = df[['Legendary', 'Stage']].apply(
lambda row: f"{row.Legendary}, {row.Stage}", axis=1)
hue.name = 'Legendary, Stage'
sns.relplot(x='#', y='Attack', hue=hue, data=df)
To use hue of seaborn.relplot, consider concatenating the needed groups into a single column and then run the plot on new variable:
def run_plot(df, flds):
# CREATE NEW COLUMN OF CONCATENATED VALUES
df['_'.join(flds)] = pd.Series(df.reindex(flds, axis='columns')
.astype('str')
.values.tolist()
).str.join('_')
# PLOT WITH hue
sns.relplot(x='#', y='Attack', hue='_'.join(flds), data=random_df, aspect=1.5)
plt.show()
plt.clf()
plt.close()
To demonstrate with random data
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
### DATA
np.random.seed(22320)
random_df = pd.DataFrame({'#': np.arange(1,501),
'Name': np.random.choice(['Bulbasaur', 'Ivysaur', 'Venusaur',
'Charmander', 'Charmeleon'], 500),
'HP': np.random.randint(1, 100, 500),
'Attack': np.random.randint(1, 100, 500),
'Defense': np.random.randint(1, 100, 500),
'Sp. Atk': np.random.randint(1, 100, 500),
'Sp. Def': np.random.randint(1, 100, 500),
'Speed': np.random.randint(1, 100, 500),
'Stage': np.random.randint(1, 3, 500),
'Legend': np.random.choice([True, False], 500)
})
Plots
run_plot(random_df, ['Legend', 'Stage'])
run_plot(random_df, ['Legend', 'Stage', 'Name'])
In seaborn's scatterplot(), you can combine both a hue= and a style= parameter to produce different markers and different colors for each combinations
example (taken verbatim from the documentation):
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip", data=tips)
ax = sns.scatterplot(x="total_bill", y="tip",
hue="day", style="time", data=tips)

How to customize x and y axes in sympy?

I've created a plot in sympy and would like to customize the x- and y-axis. I want to turn them into red and have them be dashed. I've looked around and tried some stuff but nothing seems to work, such as:
plt.axhline(linewidth = 1, linestyle = '--', color = 'red')
plt.axvline(linewidth = 1, linestyle = '--', color = 'red')
Are there some ways to do this that would actually work?
Thanks in advance!
Sympy's source code of */sympy/plotting/plot.py has this comment:
Especially if you need publication ready graphs and this module is
not enough for you - just get the _backend attribute and add
whatever you want directly to it. In the case of matplotlib (the
common way to graph data in python) just copy _backend.fig which
is the figure and _backend.ax which is the axis and work on them
as you would on any other matplotlib object.
This means that, in general, Sympy's plots can be tweaked modifying the underlying Axes object, that can be accessed using the _backend attribute of the Sympy's plot instance.
To address your specific requests, each Axes contains an OrderedDict of Spine objects, the one that you want to modify are the 'bottom' and the 'left' ones (to modify these objects you have to use their set_x methods)
In [33]: from sympy import *
...: x = symbols('x')
...: p = plot(sin(x))
...: for spine in ('bottom', 'left'):
...: p._backend.ax.spines[spine].set_linestyle((0, (5, 10)))
...: p._backend.ax.spines[spine].set_edgecolor('red')
...: p._backend.fig.savefig('Figure_1.png')
produces
Note: if one uses p.save('...') then the figure is reset and they'll miss any tweaking they've made, hence I used the savefig method of the underlying Figure object, accessed again using the _backend attribute.

return values of subplot

Currently I trying to get myself acquainted with the matplotlib.pyplot library. After having seeing quite some examples and tutorial, I noticed that the subplots function also has some returns values which usually are used later on. However, on the matplotlib website I was unable to find any specification on what exactly is returned, and none of the examples are the same (although it usually seems to be an ax object). Can you guys give me some to pointers as to what is returned, and how I can use it. Thanks in advance!
In the documentation it says that matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or not depends on the number of subplots).
Common use is:
import matplotlib.pyplot as plt
import numpy as np
f, axes = plt.subplots(1,2) # 1 row containing 2 subplots.
# Plot random points on one subplots.
axes[0].scatter(np.random.randn(10), np.random.randn(10))
# Plot histogram on the other one.
axes[1].hist(np.random.randn(100))
# Adjust the size and layout through the Figure-object.
f.set_size_inches(10, 5)
f.tight_layout()
Generally, the matplotlib.pyplot.subplots() returns a figure instance and an object or an array of Axes objects.
Since you haven't posted the code with which you are trying to get your hands dirty, I will do it by taking 2 test cases :
case 1 : when number of subplots needed(dimension) is mentioned
import matplotlib.pyplot as plt #importing pyplot of matplotlib
import numpy as np
x = [1, 3, 5, 7]
y = [2, 4, 6, 8]
fig, axes = plt.subplots(2, 1)
axes[0].scatter(x, y)
axes[1].boxplot(x, y)
plt.tight_layout()
plt.show()
As you can see here since we have given the number of subplots needed, (2,1) in this case which means no. of rows, r = 2 and no. of columns, c = 1.
In this case, the subplot returns the figure instance along with an array of axes, length of which is equal to the total no. of the subplots = r*c , in this case = 2.
case 2 : when number of subplots(dimension) is not mentioned
import matplotlib.pyplot as plt #importing pyplot of matplotlib
import numpy as np
x = [1, 3, 5, 7]
y = [2, 4, 6, 8]
fig, axes = plt.subplots()
#size has not been mentioned and hence only one subplot
#is returned by the subplots() method, along with an instance of a figure
axes.scatter(x, y)
#axes.boxplot(x, y)
plt.tight_layout()
plt.show()
In this case, no size or dimension has been mentioned explicitly, therefore only one subplot is created, apart from the figure instance.
You can also control the dimensions of the subplots by using the squeeze keyword. See documentation. It is an optional argument, having default value as True.
Actually, 'matplotlib.pyplot.subplots()' is returning two objects:
The figure instance.
The 'axes'.
'matplotlib.pyplot.subplots()' takes many arguments. That has been given below:
matplotlib.pyplot.subplots(nrows=1, ncols=1, *, sharex=False, sharey=False, squeeze=True, subplot_kw=None, gridspec_kw=None, **fig_kw)
The first two arguments are : nrows : the number of rows I want to creat in my Subplot grid , ncols : The number of columns should have in the subplot grid. But, if 'nrows' and 'ncols' are not decleared explicitely, it will take the values of 1 in each by default.
Now, come to objects that has been created:
(1)The figure instance is nothing but throwing a figure which will hold all the plots.
(2)The 'axes' object will contain all the informations about each subplots.
Let's understand through an example:
Here, 4 subplots are being created at the positions of (0,0),(0,1),(1,0),(1,1).
Now, let's suppose, at the position (0,0), I want to have a scatterplot. What will I do: I will incorporate the scatterplot into "axes[0,0]" object that will hold all the informations about the scatterplot and reflect it into the figure instance.
The same thing will happen for all the other three positions.
Hope this will help and let me know your thought about this.