Plotting subdivided bar graph from a dataframe - pandas

I have a Dataframe as given below
df = pd.DataFrame(
{
'Dates': ['2021-04-11', '2021-06-08', '2021-06-08', '2021-06-09', '2021-06-10', '2021-07-18', '2021-07-18'],
'Results': ['Negative', 'Invalid', 'Negative','Negative','Negative', 'Negative', 'Positive' ],
'Size': [1, 1, 1, 1, 4, 21, 2]
})
I want to plot a subdivided bar graph as shown below in the given picture.
[![enter image description here][1]][1]
I have been stuck on this for a while now. Tried using groupby(), sum(), size() etc, but could not figure it.
Any help is greatly appreciated.
Thanks.
[1]: https://i.stack.imgur.com/HrKRY.jpg

Just reshape it by setting indices on date and results, then unstacking Results. After that, plot.
df.set_index(['Dates', 'Results']).sort_index(0).unstack().plot.bar(stacked=True)
If you don't want it with a column multi-index, and instead want "Invalid", "Negative", "Positive", just create an intermediate data frame from the unstack (I'll call it temp) and assign temp.columns = temp.columns.get_level_values(1). Then call temp.plot.bar(stacked=True).
To specify size using the OOP matplotlib interface:
f, ax = plt.subplots(figsize=(HEIGHT, WIDTH))
temp.plot.bar(ax=ax, stacked=True)
Save the figure to file using f.savefig(PATH).

Related

Same Label displayed for Multiple Data Sets? (MatPlotLib, Python 3.9)

Is it possible to assign data sets to existing data labels with MatPlotLib?
Even if the data sets are assigned the same label name & color, in the actual resulting plot, in the legend they are separate, and have different colors. I'm using Python 3.9.
Let's say we have the following code:
import matplotlib.pyplot as plt
dataX = [[0, 1, 2], [0, 1, 2]]
dataY = [[0, 1, 2], [2, 1, 0]]
dataLabel = "Test Data"
for i in range(len(dataX)):
plt.plot(dataX[i], dataY[i], label = dataLabel)
plt.legend(loc = 'best')
plt.show()
plt.close()
This results in the following plot:
And what I want is for both of these data lines to be the same color, and have them both be labeled as the same "Test Data" label instance.
Is what I'm trying to achieve here even possible with MatPlotLib?
Thanks for reading my post, any help is appreciated!
Define a color:
color = 'blue'
Set this color in plot function. Set label to first line only.
line = plt.plot(dataX[0], dataY[0], label=dataLabel, c=color)
for i in range(1, len(dataX)):
plt.plot(dataX[i], dataY[i], c=color)
plt.legend(loc='best', handles=[line[0]])

Plotting ways (linestrings) over a map in Python

this is my second try for the same question and I really hope that someone may help me...
Even thought some really nice people tried to help me. There is a lot I couldn't figure out, despite there help.
From the beginning:
I created a dataframe. This dataframe is huge and gives information about travellers in a city. The dataframe looks like this. This is only the head.
In origin and destination you have the ids of the citylocations, in move how many travelled from origin to destination. longitude and latitude is where the exact point is and the linestring the combination of the points..
I created the linestring with this code:
erg2['Linestring'] = erg2.apply(lambda x: LineString([(x['latitude_origin'], x['longitude_origin']), (x['latitude_destination'], x['longitude_destination'])]), axis = 1)
Now my question is how to plot the ways over a map. Even thought I tried all th eexamples from the geopandas documentary etc. I cant help myself..
I cant show you what I already plotted because it doesnt make sense and I guess it would be smarter to start plotting from the beginning.
You see that in the column move there are some 0. This means that no one travelled this route. So this I dont need to plot..
I have to plot the lines with the information where the traveller started origin and where he went destination.
also I need to outline the different lines depending on movements..
with this plotting code
fig = px.line_mapbox(erg2, lat="latitude_origin", lon="longitude_origin", color="move",
hover_name= gdf["origin"] + " - " + gdf["destination"],
center =dict(lon=13.41053,lat=52.52437), zoom=3, height=600
)
fig.update_layout(mapbox_style="stamen-terrain", mapbox_zoom=4, mapbox_center_lat = 52.52437,
margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Maybe someone has an idea???
I tried it with thios code:
import requests, io, json
import geopandas as gpd
import shapely.geometry
import pandas as pd
import numpy as np
import itertools
import plotly.express as px
# get some public addressess - hospitals. data that has GPS lat / lon
dfhos = pd.read_csv(io.StringIO(requests.get("http://media.nhschoices.nhs.uk/data/foi/Hospital.csv").text),
sep="¬",engine="python",).loc[:, ["OrganisationName", "Latitude", "Longitude"]]
a = np.arange(len(dfhos))
np.random.shuffle(a)
# establish N links between hospitals
N = 10
df = (
pd.DataFrame({0:a[0:N], 1:a[25:25+N]}).merge(dfhos,left_on=0,right_index=True)
.merge(dfhos,left_on=1, right_index=True, suffixes=("_origin", "_destination"))
)
# build a geopandas data frame that has LineString between two hospitals
gdf = gpd.GeoDataFrame(
data=df,
geometry=df.apply(
lambda r: shapely.geometry.LineString(
[(r["Longitude_origin"], r["Latitude_origin"]),
(r["Longitude_destination"], r["Latitude_destination"]) ]), axis=1)
)
# sample code https://plotly.com/python/lines-on-mapbox/#lines-on-mapbox-maps-from-geopandas
lats = []
lons = []
names = []
for feature, name in zip(gdf.geometry, gdf["OrganisationName_origin"] + " - " + gdf["OrganisationName_destination"]):
if isinstance(feature, shapely.geometry.linestring.LineString):
linestrings = [feature]
elif isinstance(feature, shapely.geometry.multilinestring.MultiLineString):
linestrings = feature.geoms
else:
continue
for linestring in linestrings:
x, y = linestring.xy
lats = np.append(lats, y)
lons = np.append(lons, x)
names = np.append(names, [name]*len(y))
lats = np.append(lats, None)
lons = np.append(lons, None)
names = np.append(names, None)
fig = px.line_mapbox(lat=lats, lon=lons, hover_name=names)
fig.update_layout(mapbox_style="stamen-terrain",
mapbox_zoom=4,
mapbox_center_lon=gdf.total_bounds[[0,2]].mean(),
mapbox_center_lat=gdf.total_bounds[[1,3]].mean(),
margin={"r":0,"t":0,"l":0,"b":0}
)
which looks like the perfect code but I cant really use it for my data..
I am very new to coding. So please be patient a bit;))
Thanks a lot in advance.
All the best
previously answered this question How to plot visualize a Linestring over a map with Python?. I suggested that you update that question, I still recommend that you do
line strings IMHO are not the way to go. plotly does not use line strings, so it's extra complexity to encode to line strings to decode to numpy arrays. check out the examples on official documentation https://plotly.com/python/lines-on-mapbox/. here it is very clear geopandas is just a source that has to be encoded into numpy arrays
data
your sample data it appears should be one Dataframe and has no need for geopandas or line strings
almost all of your sample data is unusable as every row where origin and destination are different have move of zero which you note should be excluded
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.DataFrame({"origin": [88, 88, 88, 88, 88, 87],
"destination": [88, 89, 110, 111, 112, 83],
"move": [20, 0, 5, 0, 0, 10],
"longitude_origin": [13.481016, 13.481016, 13.481016, 13.481016, 13.481016, 13.479667],
"latitude_origin": [52.457055, 52.457055, 52.457055, 52.457055, 52.457055, 52.4796],
"longitude_destination": [13.481016, 13.504075, 13.613772, 13.586891, 13.559341, 13.481016],
"latitude_destination": [52.457055, 52.443923, 52.533194, 52.523562, 52.507418, 52.457055]})
solution
have further refined line_array() function so it can be used to encode hover and color parameters from simplified solution I previously provided
# lines in plotly are delimited by none
def line_array(data, cols=[], empty_val=None):
if isinstance(data, pd.DataFrame):
vals = data.loc[:, cols].values
elif isinstance(data, pd.Series):
a = data.values
vals = np.pad(a.reshape(a.shape[0], -1), [(0, 0), (0, 1)], mode="edge")
return np.pad(vals, [(0, 0), (0, 1)], constant_values=empty_val).reshape(
1, (len(df) * 3))[0]
# only draw lines where move > 0 and destination is different to origin
df = df.loc[df["move"].gt(0) & (df["origin"]!=df["destination"])]
lons = line_array(df, ["longitude_origin", "longitude_destination"])
lats = line_array(df, ["latitude_origin", "latitude_destination"])
fig = px.line_mapbox(
lat=lats,
lon=lons,
hover_name=line_array(
df.loc[:, ["origin", "destination"]].astype(str).apply(" - ".join, axis=1)
),
hover_data={
"move": line_array(df, ["move", "move"], empty_val=-99),
"origin": line_array(df, ["origin", "origin"], empty_val=-99),
},
color=line_array(df, ["origin", "origin"], empty_val=-99),
).update_traces(visible=False, selector={"name": "-99"})
fig.update_layout(
mapbox={
"style": "stamen-terrain",
"zoom": 9.5,
"center": {"lat": lats[0], "lon": lons[0]},
},
margin={"r": 0, "t": 0, "l": 0, "b": 0},
)

Customizing legend with scatterplot

I struggle with customizing the legend of my scatterplot. Here is a snapshot :
And here is a code sample :
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]],
columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size="CI_CT")
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
Also, I work in a Jupyter-lab notebook with Python 3, if it helps.
The red thingy issue
First things first, I wish to hide the name of the CI_CT variable (contoured in red on the picture). After exploring the whole documentation for this afternoon, I found the get_legend_handlers_label method (see here), which produces the following :
>>> g.get_legend_handles_labels()
([<matplotlib.collections.PathCollection at 0xfaaba4a8>,
<matplotlib.collections.PathCollection at 0xfaa3ff28>,
<matplotlib.collections.PathCollection at 0xfaa3f6a0>,
<matplotlib.collections.PathCollection at 0xfaa3fe48>],
['CI_CT', '0', '1', '2'])
Where I can spot my dear CI_CT string. However, I'm unable to change this name or to hide it completely. I found a dirty way, that basically consists in not using efficiently the dataframe passed as a data parameter. Here is the scatterplot call :
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size=my_df["CI_CT"].values)
Result here :
It works, but is there a cleaner way to achieve this?
The green thingy issue
Displaying a 0 level in this legend is incorrect, since there is no zero value in the column CI_CT of my_df. It is therefore misleading for the readers, who might assume the smaller dots represents a value of 0 or 1. I wish to setup a defined scale, in the way one can do it for the x and y axis. However, I cannot achieve it. Any idea?
TL;DR : A broader question that could solve everything
Those adventures make me wonder if there is a way to handle the data you can pass to the scatterplots with hue and size parameters in a clean, x-and-y-axis way. Is it actually possible?
Please pardon my English, please let me know if the question is too broad or uncorrectly labelled.
The "green thing issue", namely that there is one more legend entry than there are sizes, is solved by specifying legend="full".
g = sns.scatterplot(..., legend="full")
The "red thing issue" is more tricky. The problem here is that seaborn misuses a normal legend label as a headline for the legend. An option is indeed to supply the values directly instead of the name of the column, to prevent seaborn from using that column name.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]],
columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size=my_df["CI_CT"].values, legend="full")
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
plt.show()
If you really must use the column name itself, a hacky solution is to crawl into the legend and remove the label you don't want.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]],
columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size="CI_CT", legend="full")
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
#Hack to remove the first legend entry (which is the undesired title)
vpacker = g.get_legend()._legend_handle_box.get_children()[0]
vpacker._children = vpacker.get_children()[1:]
plt.show()
I finally managed to get the result I wish, but the ugly way. It might be useful to someone, but I would not advise to do this.
The solution to fix the scale into the legend consists of moving all the CI_CT column values to the negatives (to keep the order and the consistency of markers size). Then, the values displayed in the legend are corrected accordingly to the previous data changes (inspiration from here).
However, I did not find any better way to make the "CI_CT" text desapear in the legend without leaving an atrociously huge blank space.
Here is the sample of code and the result.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
my_df = pd.DataFrame([[5, 3, 1], [2, 1, 2], [3, 4, 1], [1, 2, 1]], columns=["DUMMY_CT", "FOO_CT", "CI_CT"])
# Substracting the maximal value of CI_CT for each value
max_val = my_df["CI_CT"].agg("max")
my_df["CI_CT"] = my_df.apply(lambda x : x["CI_CT"] - max_val, axis=1)
# scatterplot declaration
g = sns.scatterplot("DUMMY_CT", "FOO_CT", data=my_df, size=my_df["CI_CT"].values)
g.set_title("Number of Baz", weight="bold")
g.set_xlabel("Dummy count")
g.set_ylabel("Foo count")
g.get_legend().set_title("Baz count")
# Correcting legend values
l = g.legend_
for t in l.texts :
t.set_text(int(t.get_text()) + max_val)
# Restoring the DF
my_df["CI_CT"] = my_df.apply(lambda x : x["CI_CT"] + max_val, axis=1)
I'm still looking for a better way to achieve this.

seaborn or matplotlib line chart, line color depending on variable

I have a pandas dataframe with three columns, Date(timestamp), Color('red' or 'blue') and Value(int).
I am currently getting a line chart from it with the following code:
import matplotlib.pyplot as plt
import pandas as pd
Dates=['01/01/2014','02/01/2014','03/01/2014','04/01/2014','05/01/2014','06/01/2014','07/01/2014']
Values=[3,4,6,5,4,5,4]
Colors=['red','red','blue','blue','blue','red','red']
df=pd.DataFrame({'Dates':Dates,'Values':Values,'Colors':Colors})
df['Dates']=pd.to_datetime(df['Dates'],dayfirst=True)
grouped = df.groupby('Colors')
fig, ax = plt.subplots()
for key, group in grouped:
group.plot(ax=ax, x="Dates", y="Values", label=key, color=key)
plt.show()
I'd like the line color to depend on the 'color' columns. How can I achieve that?
I have seen here a similar question for scatterplots, but it doesn't seem I can apply the same solution to a time series line chart.
My output is currently this:
I am trying to achieve something like this (one line only, but several colors)
As I said you could find the answer from the link I attached in the comment:
Dates = ['01/01/2014', '02/01/2014', '03/01/2014', '03/01/2014', '04/01/2014', '05/01/2014']
Values = [3, 4, 6, 6, 5, 4]
Colors = ['red', 'red', 'red', 'blue', 'blue', 'blue']
df = pd.DataFrame({'Dates': Dates, 'Values': Values, 'Colors': Colors})
df['Dates'] = pd.to_datetime(df['Dates'], dayfirst=True)
grouped = df.groupby('Colors')
fig, ax = plt.subplots(1)
for key, group in grouped:
group.plot(ax=ax, x="Dates", y="Values", label=key, color=key)
When color changing you need to add extra point to make line continuous

How do I assign multiple labels at once in matplotlib?

I have the following dataset:
x = [0, 1, 2, 3, 4]
y = [ [0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
[9, 8, 7, 6, 5] ]
Now I plot it with:
import matplotlib.pyplot as plt
plt.plot(x, y)
However, I want to label the 3 y-datasets with this command, which raises an error when .legend() is called:
lineObjects = plt.plot(x, y, label=['foo', 'bar', 'baz'])
plt.legend()
File "./plot_nmos.py", line 33, in <module>
plt.legend()
...
AttributeError: 'list' object has no attribute 'startswith'
When I inspect the lineObjects:
>>> lineObjects[0].get_label()
['foo', 'bar', 'baz']
>>> lineObjects[1].get_label()
['foo', 'bar', 'baz']
>>> lineObjects[2].get_label()
['foo', 'bar', 'baz']
Question
Is there an elegant way to assign multiple labels by just using the .plot() method?
You can iterate over your line objects list, so labels are individually assigned. An example with the built-in python iter function:
lineObjects = plt.plot(x, y)
plt.legend(iter(lineObjects), ('foo', 'bar', 'baz'))`
Edit: after updating to matplotlib 1.1.1, it looks like the plt.plot(x, y), with y as a list of lists (as provided by the author of the question), doesn't work anymore. The one step plotting without iteration over the y arrays is still possible thought after passing y as numpy.array (assuming (numpy)[http://numpy.scipy.org/] as been previously imported).
In this case, use plt.plot(x, y) (if the data in the 2D y array are arranged as columns [axis 1]) or plt.plot(x, y.transpose()) (if the data in the 2D y array are arranged as rows [axis 0])
Edit 2: as pointed by #pelson (see commentary below), the iter function is unnecessary and a simple plt.legend(lineObjects, ('foo', 'bar', 'baz')) works perfectly
It is not possible to plot those two arrays agains each other directly (with at least version 1.1.1), therefore you must be looping over your y arrays. My advice would be to loop over the labels at the same time:
import matplotlib.pyplot as plt
x = [0, 1, 2, 3, 4]
y = [ [0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [9, 8, 7, 6, 5] ]
labels = ['foo', 'bar', 'baz']
for y_arr, label in zip(y, labels):
plt.plot(x, y_arr, label=label)
plt.legend()
plt.show()
Edit: #gcalmettes pointed out that as numpy arrays, it is possible to plot all the lines at the same time (by transposing them). See #gcalmettes answer & comments for details.
I came over the same problem and now I found a solution that is most easy! Hopefully that's not too late for you. No iterator, just assign your result to a structure...
from numpy import *
from matplotlib.pyplot import *
from numpy.random import *
a = rand(4,4)
a
>>> array([[ 0.33562406, 0.96967617, 0.69730654, 0.46542408],
[ 0.85707323, 0.37398595, 0.82455736, 0.72127002],
[ 0.19530943, 0.4376796 , 0.62653007, 0.77490795],
[ 0.97362944, 0.42720348, 0.45379479, 0.75714877]])
[b,c,d,e] = plot(a)
legend([b,c,d,e], ["b","c","d","e"], loc=1)
show()
Looks like this:
The best current solution is:
lineObjects = plt.plot(x, y) # y describes 3 lines
plt.legend(['foo', 'bar', 'baz'])
You can give the labels while plotting the curves
import pylab as plt
x = [0, 1, 2, 3, 4]
y = [ [0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [9, 8, 7, 6, 5] ]
labels=['foo', 'bar', 'baz']
colors=['r','g','b']
# loop over data, labels and colors
for i in range(len(y)):
plt.plot(x,y[i],'o-',color=colors[i],label=labels[i])
plt.legend()
plt.show()
In case of numpy matrix plot assign multiple legends at once for each column
I would like to answer this question based on plotting a matrix that has two columns.
Say you have a 2 column matrix Ret
then one may use this code to assign multiple labels at once
import pandas as pd, numpy as np, matplotlib.pyplot as plt
pd.DataFrame(Ret).plot()
plt.xlabel('time')
plt.ylabel('Return')
plt.legend(['Bond Ret','Equity Ret'], loc=0)
plt.show()
I hope this helps
This problem comes up for me often when I have a single set of x values and multiple y values in the columns of an array. I really don't want to plot the data in a loop, and multiple calls to ax.legend/plt.legend are not really an option, since I want to plot other stuff, usually in an equally annoying format.
Unfortunately, plt.setp is not helpful here. In newer versions of matplotlib, it just converts your entire list/tuple into a string, and assigns the whole thing as a label to all the lines.
I've therefore made a utility function to wrap calls to ax.plot/plt.plot in:
def set_labels(artists, labels):
for artist, label in zip(artists, labels):
artist.set_label(label)
You can call it something like
x = np.arange(5)
y = np.random.ranint(10, size=(5, 3))
fig, ax = plt.subplots()
set_labels(ax.plot(x, y), 'ABC')
This way you get to specify all your normal artist parameters to plot, without having to see the loop in your code. An alternative is to put the whole call to plot into a utility that just unpacks the labels, but that would require a lot of duplication to figure out how to parse multiple datasets, possibly with different numbers of columns, and spread out across multiple arguments, keyword or otherwise.
I used the following to show labels for a dataframe without using the dataframe plot:
lines_ = plot(df)
legend(lines_, df.columns) # df.columns is a list of labels
If you're using a DataFrame, you can also iterate over the columns of the data you want to plot:
# Plot figure
fig, ax = plt.subplots(figsize=(5,5))
# Data
data = data
# Plot
for i in data.columns:
_ = ax.plot(data[i], label=i)
_ = ax.legend()
plt.show()