merging legends when creating a facet chart in Altair - facet

I am trying to create a facet grid of plots in Altair. I have two different data frames that have a common x axis, and a common category for the facet, but each with a different column for determining the color. To plot, I merge these into a single data frame. The problem I am having is that the legend is being displayed on each plot individually. I want the legend just to appear once, at the side of the facet. Here is a simple example of what I am trying to do and the current results.
import pandas as pd
import altair as alt
df1 = pd.DataFrame({'x':[1,2,3,1,2,3,1,2,3],
'y1':[6,7,8,1,3,5,9,8,7],
'cat':['A','A','A','B','B','B','C','C','C'],
'E1':[120,120,120,200,200,200,80,80,80]})
df2 = pd.DataFrame({'x':[1,2,3,4,1,2,3,4,1,2,3,4,5],
'y2':[6,8,8,9,2,4,6,8,9,7,5,4,3],
'cat':['A','A','A','A','B','B','B','B','C','C','C','C','C'],
'E2':[2,1,3,2,1,1,3,2,3,2,2,2,3]})
merged = pd.merge(df2,df1, how='outer', on=['cat','x'])
p1 = alt.Chart(merged).mark_line().encode(
x='x:Q',
y='y1:Q',
color=alt.Color('E1:Q', scale=alt.Scale(scheme='viridis'), bin=alt.Bin(maxbins=5))
)
p2 = alt.Chart(merged).mark_circle().encode(
x='x:Q',
y='y2:Q',
color=alt.Color('E2:N', scale=alt.Scale(domain=[1,2,3],range=['black','red','blue']))
)
alt.layer(p1 + p2).facet('cat:N')

Related

How to build columns in Plotly with multiple values sorted by value?

I have a dataframe with data, the code is below, in which there are 3 columns - date, system and number, building a bar graph in Plotly I get two bars in which I cannot set the sorting by values, they are atomatically sorted by name.
import pandas as pd
import numpy as np
data = [('2022-10-01','Pay1',644), ('2022-10-01','Pay2',1460), ('2022-10-01','Pay3',1221), ('2022-10-01','Pay4',1623),\
('2022-10-01','Pay5',1904), ('2022-10-01','Pay6',1853), ('2022-10-01','Pay7',1826), ('2022-10-01','Pay8',247),\
('2022-10-01','Pay9',713), ('2022-10-01','Pay10',1159), ('2022-10-02','Pay1',755), ('2022-10-02','Pay2',786),\
('2022-10-02','Pay3',623), ('2022-10-02','Pay4',1766), ('2022-10-02','Pay5',1141), ('2022-10-02','Pay6',362),\
('2022-10-02','Pay7',1097), ('2022-10-02','Pay8',655), ('2022-10-02','Pay9',1569), ('2022-10-02','Pay10',796)]
data = pd.DataFrame(data,columns=['date','system','number'])
import plotly.express as px
fig = px.bar(data, x='date', y='number',
color='system')
fig.show()
I want to get a bar that will be sorted by value, from smallest to largest in each case
The expected graph is a stacked graph using the same color for categorical variables, and the order of the graphs is in order of increasing numerical value. To make the categorical variables the same color, create a dictionary of default discrete to maps and system columns. Add a column of colors to each data frame. Extract data frames by date, sort them in numerical order of size, and loop through them row by row.
import plotly.graph_objects as go
import plotly.express as px
colors = px.colors.qualitative.Plotly
system_name = data['system'].unique()
colors_dict = {k:v for k,v in zip(system_name, colors)}
# print(colors_dict)
fig = go.Figure()
dff = data.query('date =="2022-10-01"')
dff = dff.sort_values('number',ascending=False)
dff['color'] = dff['system'].map(colors_dict)
for row in dff.itertuples():
fig.add_trace(go.Bar(x=[row.date], y=[row.number], name=row.system, marker_color=row.color))
fig.update_layout(barmode='stack')
dfm = data.query('date =="2022-10-02"')
dfm = dfm.sort_values('number',ascending=False)
dfm['color'] = dfm['system'].map(colors_dict)
for row in dfm.itertuples():
fig.add_trace(go.Bar(x=[row.date], y=[row.number], name=row.system, marker_color=row.color))
fig.update_layout(barmode='stack')
names = set()
fig.for_each_trace(
lambda trace:
trace.update(showlegend=False)
if (trace.name in names) else names.add(trace.name))
fig.show()

How to change legend labels in scatter matrix

I have a scatter matrix that I want to change the labels for. On the right-hand, I want to change the blue color 1 to Say Mystery and the red color 2 to say Science. I also want to change the labels of each graph to label their counterpart [Spicy, Savory, and Sweet]. I tried using dict to relabel but then my charts came out wrong.
import plotly.express as px
fig = px.scatter_matrix(df,
dimensions=["Q12_Spicy", "Q12_Sav", "Q12_Sweet", ],color="Q11_Ans"
)
fig.show()
You can create a new column called Q11_Labels that maps 1 to Mystery and 2 to Science from the Q11_Ans column, and pass colors='Q11_Labels' to the px.scatter_matrix function. If you still want the legend to display the original column name, you can pass a dictionary to the labels parameter of the px.scatter_matrix function with labels={"Q11_Labels":"Q11_Ans"}
Then you can extend this dictionary to include the other column name to display name mappings as well, so that [Spicy, Savory, Sweet] are displayed instead of [Q12_Spicy, Q12_Savory, Q12_Sweet].
import numpy as np
import pandas as pd
import plotly.express as px
## recreate random data with the same columns
np.random.seed(42)
df = pd.DataFrame(
np.random.randint(0,100,size=(100, 3)),
columns=["Q12_Spicy", "Q12_Sav", "Q12_Sweet"]
)
df["Q11_Ans"] = np.random.randint(1,3,size=100)
df["Q11_Ans"] = df["Q11_Ans"].astype("category")
df = df.sort_values(by="Q11_Ans")
## remap the values of 1 and 2 to their meanings, then pass this as the color
df["Q11_Labels"] = df["Q11_Ans"].map({1: "Mystery", 2: "Science"})
## pass a dictionary to the labels parameter
fig = px.scatter_matrix(df,
dimensions=["Q12_Spicy", "Q12_Sav", "Q12_Sweet"],color="Q11_Labels",
labels = {"Q12_Spicy":"Spicy","Q12_Sav":"Savory","Q12_Sweet":"Sweet", "Q11_Labels":"Q11_Ans"}
)
fig.show()

colormap with pandas dataframe plot function

I have data from multiple sites that record a sharp change in the monitored parameter. How could I plot the data for all these sites using value-dependent colors to enhance the visualization?
import numpy as np
import pandas as pd
import string
# site names
cols = string.ascii_uppercase
# number of days
ndays = 3
# index
index = pd.date_range('2018-05-01', periods=3*24*60, freq='T')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
# df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
Each site (column) gets assigned one color in the above code. Is there a way to represent the parameter values to different colors?
I guess one alternative is using colormaps option from scatter plot function: How to use colormaps to color plots of Pandas DataFrames
ax = plt.subplots(figsize=(12,6))
collection = [plt.scatter(range(len(df)), df[col], c=df[col], s=25, cmap=cmap, edgecolor='None') for col in df.columns]
However, if I plot over time (i.e., x=df.index) things appear not to work as expected.
Is there any other alternative? or suggestion how to better visualize the sudden change in the time series?
In what follows I will use only 3 columns and hourly data in order to make the plots look less messy. The examples work as well with the original data.
cols = string.ascii_uppercase[:3]
ndays = 3
index = pd.date_range('2018-05-01', periods=3*24, freq='H')
# simulated daily data
d1 = np.random.randn(len(index)//ndays, len(cols))
d2 = np.random.randn(len(index)//ndays, len(cols))+2
d3 = np.random.randn(len(index)//ndays, len(cols))-2
data=np.concatenate([d1, d2, d3])
df = pd.DataFrame(data=data, index=index, columns=list(cols))
df.plot(legend=False)
The pandas way
You are out of luck,DataFrame.plot.scatter does not work with datetime-like data due to a long standing bug.
The matplotlib way
Matplotlib's scatter can handle datetime-like data but the x-axis does not scale as expected.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
plt.gcf().autofmt_xdate()
This looks like a bug to me but I could not find any reports. You can work around this by manually adjusting the x-limits.
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])
start, end = df.index[[0, -1]]
xmargin = (end - start) * plt.gca().margins()[0]
plt.xlim(start - xmargin, end + xmargin)
plt.gcf().autofmt_xdate()
Unfortunately the x-axis formatter is not as nice as the pandas one.
The pandas way, revisited
I discovered this trick by chance and I do not understand why it works. If you plot a pandas series indexed by the same datetime data before calling matplotlib's scatter, the autoscaling issue disappear and you get the nice pandas formatting.
So I made an invisible plot of the first column and then the scatter plot.
df.iloc[:, 0].plot(lw=0) # invisible plot
for col in df.columns:
plt.scatter(df.index, df[col], c=df[col])

Creating dataframe boxplot from dataframe with row and column multiindex

I have the following Pandas data frame and I'm trying to create a boxplot of the "dur" value for both client and server organized by qdepth (qdepth on x-axis, duration on y-axis, with two variables client and server). It seems like I need to get client and serveras columns. I haven't been able to figure this out trying combinations ofunstackandreset_index`.
Here's some dummy data I recreated since you didn't post yours aside from an image:
qdepth,mode,runid,dur
1,client,0x1b7bd6ef955979b6e4c109b47690c862,7.0
1,client,0x45654ba030787e511a7f0f0be2db21d1,30.0
1,server,0xb760550f302d824630f930e3487b4444,19.0
1,server,0x7a044242aec034c44e01f1f339610916,95.0
2,client,0x51c88822b28dfa006bf38603d74f9911,15.0
2,client,0xd5a9028fddf9a400fd8513edbdc58de0,49.0
2,server,0x3943710e587e3932adda1cad8eaf2aeb,30.0
2,server,0xd67650fd984a48f2070de426e0a942b0,93.0
Load the data: df = pd.read_clipboard(sep=',', index_col=[0,1,2])
Option 1:
df.unstack(level=1).boxplot()
Option 2:
df.unstack(level=[0,1]).boxplot()
Option 3:
Using seaborn:
import seaborn as sns
sns.boxplot(x="qdepth", hue="mode", y="dur", data=df.reset_index(),)
Update:
To answer your comment, here's a very approximate way (could be used as a starting point) to recreate the seaborn option using only pandas and matplotlib:
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(12,6))
#bp = df.unstack(level=[0,1])['dur'].boxplot(ax=ax, return_type='dict')
bp = df.reset_index().boxplot(column='dur',by=['qdepth','mode'], ax=ax, return_type='dict')['dur']
# Now fill the boxes with desired colors
boxColors = ['darkkhaki', 'royalblue']
numBoxes = len(bp['boxes'])
for i in range(numBoxes):
box = bp['boxes'][i]
boxX = []
boxY = []
for j in range(5):
boxX.append(box.get_xdata()[j])
boxY.append(box.get_ydata()[j])
boxCoords = list(zip(boxX, boxY))
# Alternate between Dark Khaki and Royal Blue
k = i % 2
boxPolygon = mpl.patches.Polygon(boxCoords, facecolor=boxColors[k])
ax.add_patch(boxPolygon)
plt.show()

Overlaying actual data on a boxplot from a pandas dataframe

I am using Seaborn to make boxplots from pandas dataframes. Seaborn boxplots seem to essentially read the dataframes the same way as the pandas boxplot functionality (so I hope the solution is the same for both -- but I can just use the dataframe.boxplot function as well). My dataframe has 12 columns and the following code generates a single plot with one boxplot for each column (just like the dataframe.boxplot() function would).
fig, ax = plt.subplots()
sns.set_style("darkgrid", {"axes.facecolor":"darkgrey"})
pal = sns.color_palette("husl",12)
sns.boxplot(dataframe, color = pal)
Can anyone suggest a simple way of overlaying all the values (by columns) while making a boxplot from dataframes?
I will appreciate any help with this.
This hasn't been added to the seaborn.boxplot function yet, but there's something similar in the seaborn.violinplot function, which has other advantages:
x = np.random.randn(30, 6)
sns.violinplot(x, inner="points")
sns.despine(trim=True)
A general solution for the boxplot for the entire dataframe, which should work for both seaborn and pandas as their are all matplotlib based under the hood, I will use pandas plot as the example, assuming import matplotlib.pyplot as plt already in place. As you have already have the ax, it would make better sense to just use ax.text(...) instead of plt.text(...).
In [35]:
print df
V1 V2 V3 V4 V5
0 0.895739 0.850580 0.307908 0.917853 0.047017
1 0.931968 0.284934 0.335696 0.153758 0.898149
2 0.405657 0.472525 0.958116 0.859716 0.067340
3 0.843003 0.224331 0.301219 0.000170 0.229840
4 0.634489 0.905062 0.857495 0.246697 0.983037
5 0.573692 0.951600 0.023633 0.292816 0.243963
[6 rows x 5 columns]
In [34]:
df.boxplot()
for x, y, s in zip(np.repeat(np.arange(df.shape[1])+1, df.shape[0]),
df.values.ravel(), df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
For a single series in the dataframe, a few small changes is necessary:
In [35]:
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
for x, y, s in zip(np.repeat(1, df.shape[0]),
sub_df.ravel(), sub_df.values.astype('|S5').ravel()):
plt.text(x,y,s,ha='center',va='center')
Making scatter plots is also similar:
#for the whole thing
df.boxplot()
plt.scatter(np.repeat(np.arange(df.shape[1])+1, df.shape[0]), df.values.ravel(), marker='+', alpha=0.5)
#for just one column
sub_df=df.V1
pd.DataFrame(sub_df).boxplot()
plt.scatter(np.repeat(1, df.shape[0]), sub_df.ravel(), marker='+', alpha=0.5)
To overlay stuff on boxplot, we need to first guess where each boxes are plotted at among xaxis. They appears to be at 1,2,3,4,..... Therefore, for the values in the first column, we want them to be plot at x=1; the 2nd column at x=2 and so on.
Any efficient way of doing it is to use np.repeat, repeat 1,2,3,4..., each for n times, where n is the number of observations. Then we can make a plot, using those numbers as x coordinates. Since it is one-dimensional, for the y coordinates, we will need a flatten view of the data, provided by df.ravel()
For overlaying the text strings, we need a anther step (a loop). As we can only plot one x value, one y value and one text string at a time.
I have the following trick:
data = np.random.randn(6,5)
df = pd.DataFrame(data,columns = list('ABCDE'))
Now assign a dummy column to df:
df['Group'] = 'A'
print df
A B C D E Group
0 0.590600 0.226287 1.552091 -1.722084 0.459262 A
1 0.369391 -0.037151 0.136172 -0.772484 1.143328 A
2 1.147314 -0.883715 -0.444182 -1.294227 1.503786 A
3 -0.721351 0.358747 0.323395 0.165267 -1.412939 A
4 -1.757362 -0.271141 0.881554 1.229962 2.526487 A
5 -0.006882 1.503691 0.587047 0.142334 0.516781 A
Use the df.groupby.boxplot(), you get it done.
df.groupby('Group').boxplot()