Plotting time series box and whisker plot with missing date values for origin destination pairs - pandas

I have the following data set:
df.head(7)
Origin Dest Date Quantity
0 Atlanta LA 2021-09-09 1
1 Atlanta LA 2021-09-11 4
2 Atlanta Chicago 2021-09-16 1
3 Atlanta Seattle 2021-09-27 12
4 Seattle LA 2021-09-29 2
5 Seattle Atlanta 2021-09-13 2
6 Seattle Newark 2021-09-17 7
In short, this table represents the number of items (Quantity) that were sent from a given origin to a given destination on a given date. The table contains 1 month of data. This table was read with:
shipments = pd.read_csv('shipments.csv', parse_dates=['Date'])
Note that this is a sparse table: if Quantity=0 for a particular (Origin,Dest,Date) pair then this row is not included in the table. As per example, on 2021-09-10 no items were sent from Atlanta to LA this row is not included in the data.
I would like to visualize this data using time series box and whisker plots. The x-axis of my graph should show the day, and Quantity should be on the y-axis. A boxplot should represent the various percentiles aggregated over all (origin-destination) pairs.
Similarly, would it be possible to create a graph which, instead of every day, only shows Monday-Sunday on the x-axis (and hence shows the results per day of the week)?
To generate the rows with missing data I used the following code:
table = pd.pivot_table(data=shipments, index='Date', columns=['Origin','Dest'], values='Quantity', fill_value=0)
idx = pd.date_range('2021-09-06','2021-10-10')
table = table.reindex(idx,fill_value=0)

You could transpose the table dataframe, and use that as input for a sns.boxplot. And you could create a similar table for the day of the week. Note that with many zeros, the boxplot might look a bit strange.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# first create some test data, somewhat similar to the given data
N = 1000
cities = ['Atlanta', 'LA', 'Chicago', 'Seattle', 'Newark']
shipments = pd.DataFrame({'Origin': np.random.choice(cities, N),
'Dest': np.random.choice(cities, N),
'Date': np.random.choice(pd.date_range('2021-09-06', '2021-10-10'), N),
'Quantity': (np.random.uniform(1, 4, N) ** 3).astype(int)})
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 5), gridspec_kw={'width_ratios': [3, 1]})
# create boxplots for each day
table_month = pd.pivot_table(data=shipments, index='Date', columns=['Origin', 'Dest'], values='Quantity', fill_value=0)
idx = pd.date_range('2021-09-06', '2021-10-10')
table_month = table_month.reindex(idx, fill_value=0)
sns.boxplot(data=table_month.T, ax=ax1)
labels = [day.strftime('%d\n%b %Y') if i == 0 or day.day == 1 else day.strftime('%d')
for i, day in enumerate(table_month.index)]
ax1.set_xticklabels(labels)
# create boxplots for each day of the week
table_dow = pd.pivot_table(data=shipments, index=shipments['Date'].dt.dayofweek, columns=['Origin', 'Dest'],
values='Quantity', fill_value=0)
table_dow = table_dow.reindex(range(7), fill_value=0)
sns.boxplot(data=table_dow.T, ax=ax2)
labels = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
ax2.set_xticklabels(labels)
ax2.set_xlabel('') # remove superfluous x label
fig.tight_layout()
plt.show()

Related

Creating a Pie Chart on a single row but Multiple Columns in Matplotlib

Issue
I have cumulative totals in row 751 in my dataframe
I want to create a pie chart with numbers and % on just line 751
This is my code
import matplotlib.pyplot as plt
%matplotlib notebook
data = pd.read_csv('cleaned_df.csv')
In my .csv I have the following Columns
A,B,C,D,E,F
Rows under Columns(Letters) Rows( Numbers )
A= 123456
B= 234567
C= 345678
D= 456789
E= 56789
F= 123454
Lets say I want to create a pie chat with only Column B & D and F and the last row of numbers which would be row 6 (678994)
How do I go about that ?
Possible solution is the following:
import matplotlib.pyplot as plt
import pandas as pd
# set test data and create dataframe
data = {"Date": ["01/01/2022", "01/02/2022", "01/03/2022", "01/04/2022", ], "Male": [1, 2, 3, 6], "Female": [2, 2, 3, 7], "Unknown": [3, 2, 4, 9]}
df = pd.DataFrame(data)
Returns (where 3 is the target row for chart)
# set target row index, use 751 in your case
target_row_index = 3
# make the pie circular by setting the aspect ratio to 1
plt.figure(figsize=plt.figaspect(1))
# specify data for chart
values = df.iloc[target_row_index, 1:]
labels = df.columns[1:]
# define function to format values on chart
def make_autopct(values):
def my_autopct(pct):
total = sum(values)
val = int(round(pct*total/100.0))
return '{p:.2f}% ({v:d})'.format(p=pct,v=val)
return my_autopct
plt.pie(values, labels=labels, autopct=make_autopct(values))
plt.show()
Shows

Overlaying boxplots on the relative bin of a histogram

Taking the dataset 'tip' as an example
total_bill
tip
smoker
day
time
size
16.99
1.01
No
Sun
Dinner
2
10.34
1.66
No
Sun
Dinner
3
21.01
3.50
No
Sun
Dinner
3
23.68
3.31
No
Sun
Dinner
2
24.59
3.61
No
Sun
Dinner
4
what I'm trying to do is represent the distribution of the variable 'total_bill' and relate each of its bins to the distribution of the variable 'tip' linked to it. In this example, this graph is meant to answer the question: "What is the distribution of tips left by customers as a function of the bill they paid?"
I have more or less achieved the graph I wanted to obtain (but there is a problem. At the end I explain what it is).
And the procedure I adopted is this:
Dividing 'total_bill' into bins.
tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
tips.head()
total_bill
tip
smoker
day
time
size
bins_total_bill
16.99
1.01
No
Sun
Dinner
2
(12.618, 17.392]
10.34
1.66
No
Sun
Dinner
3
(7.844, 12.618]
21.01
3.50
No
Sun
Dinner
3
(17.392, 22.166]
23.68
3.31
No
Sun
Dinner
2
(22.166, 26.94]
24.59
3.61
No
Sun
Dinner
4
(22.166, 26.94]
Creation of a pd.Series with:
Index: pd.interval of total_cost bins
Values: n° of occurrences
s = tips['bins_total_bill'].value_counts(sort=False)
s
(3.022, 7.844] 7
(7.844, 12.618] 42
(12.618, 17.392] 68
(17.392, 22.166] 51
(22.166, 26.94] 31
(26.94, 31.714] 19
(31.714, 36.488] 12
(36.488, 41.262] 7
(41.262, 46.036] 3
(46.036, 50.81] 4
Name: bins_total_bill, dtype: int64
Combine barplot and poxplot together
fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()
sns.barplot(ax=ax1, x = s.index, y = s.values)
sns.boxplot(ax=ax2, x='bins_total_bill', y='tip', data=tips)
sns.stripplot(ax=ax2, x='bins_total_bill', y='tip', data=tips, size=5, color="yellow", edgecolor='red', linewidth=0.3)
#Title and axis labels
ax1.tick_params(axis='x', rotation=90)
ax1.set_ylabel('Number of bills')
ax2.set_ylabel('Tips [$]')
ax1.set_xlabel("Mid value of total_bill bins [$]")
ax1.set_title("Tips ~ Total_bill distribution")
#Reference lines average(tip) + add yticks + Legend
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip))
ax2.legend(loc='best')
#Set labels axis x
ax1.set_xticklabels(list(map(lambda s: round(s.mid,2), s.index)))
It has to be said that this graph has a problem! As the x-axis is categorical, I cannot, for example, add a vertical line at the mean value of 'total_bill'.
How can I fix this to get the correct result?
I also wonder if there is a correct and more streamlined approach than the one I have adopted.
I thought of this method, which is more compact than the previous one (it can probably be done better) and overcomes the problem of scaling on the x-axis.
I split 'total_bill' into bins and add the column to Df
tips['bins_total_bill'] = pd.cut(tips.total_bill, 10)
Group column 'tip' by previously created bins
obj_gby_tips = tips.groupby('bins_total_bill')['tip']
gby_tip = dict(list(obj_gby_tips))
Create dictionary with:
keys: midpoint of each bins interval
values: gby tips for each interval
mid_total_bill_bins = list(map(lambda bins: bins.mid, list(gby_tip.keys())))
gby_tips = gby_tip.values()
tip_gby_total_bill_bins = dict(zip(mid_total_bill_bins, gby_tips))
Create chart by passing to each rectangle of the boxplot the
centroid of each respective bins
fig, ax1 = plt.subplots(dpi=200)
ax2 = ax1.twinx()
bp_values = list(tip_gby_total_bill_bins.values())
bp_pos = list(tip_gby_total_bill_bins.keys())
l1 = sns.histplot(tips.total_bill, bins=10, ax=ax1)
l2 = ax2.boxplot(bp_values, positions=bp_pos, manage_ticks=False, patch_artist=True, widths=2)
#Average tips as hline
avg_tip = np.mean(tips.tip)
ax2.axhline(y=avg_tip, color='red', linestyle="--", label="avg tip")
ax2.set_yticks(list(ax2.get_yticks() + avg_tip)) #add value of avg(tip) to y-axis
#Average total_bill as vline
avg_total_bill=np.mean(tips.total_bill)
ax1.axvline(x=avg_total_bill, color='orange', linestyle="--", label="avg tot_bill")
then the result.

Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)

I am looking to calculate summary statistics on subsets of a dataframe but related to a specific values within the row.
For example, I have a dataframe that has latitude and longitude and number of people.
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
I want to know the total people within .05 miles from each row. This can be easily created with a loop, but as the space starts to increase this becomes unusable.
Current/Sample:
from geopy.distance import distance
def distance_calc (row, focus_lat, focus_long):
start = (row['latitude'], row['longitude'])
stop = (focus_lat, focus_long)
return distance(start, stop).miles
df['total_people_within_05'] = 0
df['total_rows_within_05'] = 0
for index, row in df.iterrows():
focus_lat = df['latitude'][index]
focus_long = df['longitude'][index]
new_df = df.copy()
new_df['distance'] = new_df.apply (lambda row: (distance_calc(row, focus_lat, focus_long)),axis=1)
df.at[index, 'total_people_within_05'] = new_df.loc[new_df.distance<=.05]['people'].sum()
df.at[index, 'total_rows_within_05'] = new_df.loc[new_df.distance<=.05].shape[0]
Is there any pythonic way to do this?
Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
calculate distance on each of these combinations
filter query() to distances required
groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
finally join() this back together and you have what you want
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude
longitude
people
nearby
index_y
0
40.9919
-106.049
1
6
[0, 1, 2]
1
40.992
-106.049
2
6
[0, 1, 2]
2
40.9916
-106.049
3
6
[0, 1, 2]
3
40.9899
-106.05
4
4
[3]
4
40.9878
-106.049
5
5
[4]
Update - use combinations instead of Cartesian product
It's been bugging me that a Cartesian product is a huge overhead, when all that is required is to calculate distances between valid combinations
make use of itertools.combinations() to make a list of valid combinations of indexes
calculate distances between this minimum set
filter down to only distances we're interested in
now build permutations of this smaller set to provide a simple join to actual data
join and aggregate
# get distances between all valid combinations
dfd = (pd.DataFrame(list(itertools.combinations(df.index, 2)))
.merge(df, left_on=0, right_index=True)
.merge(df, left_on=1, right_index=True, suffixes=("_0","_1"))
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_0,r.longitude_0),
(r.latitude_1,r.longitude_1)).miles, axis=1))
.loc[:,[0,1,"distance"]]
# filter down to close proximities
.query("distance <= 0.05")
)
# build all valid permuations of close by combinations
dfnppl = (pd.DataFrame(itertools.permutations(pd.concat([dfd[0],dfd[1]]).unique(), 2))
.merge(df.loc[:,"people"], left_on=1, right_index=True)
)
# bring it all together
df = (df.reset_index().rename(columns={"index":0}).merge(dfnppl, on=0, suffixes=("","_near"), how="left")
.groupby(0).agg({**{c:"first" for c in df.columns}, **{"people_near":"sum"}})
)
0
latitude
longitude
people
people_near
0
40.9919
-106.049
1
5
1
40.992
-106.049
2
4
2
40.9916
-106.049
3
3
3
40.9899
-106.05
4
0
4
40.9878
-106.049
5
0

Pandas df histo, format my x ticker and include empty

I got this pandas df:
index TIME
12:07 2019-06-03 12:07:28
10:04 2019-06-04 10:04:25
11:14 2019-06-09 11:14:25
...
I use this command to do an histogram to plot how much occurence for each 15min periods
df['TIME'].groupby([df["TIME"].dt.hour, df["TIME"].dt.minute]).count().plot(kind="bar")
my plot look like this:
How can I get x tick like 10:15 in lieu of (10, 15) and how manage to add x tick missing like 9:15, 9:30... to get a complet time line??
You can resample your TIME column to 15 mins intervalls and count the number of rows. Then plot a regular bar chart.
df = pd.DataFrame({'TIME': pd.to_datetime('2019-01-01') + pd.to_timedelta(pd.np.random.rand(100) * 3, unit='h')})
df = df[df.TIME.dt.minute > 15] # make gap
ax = df.resample('15T', on='TIME').count().plot.bar(rot=0)
ticklabels = [x.get_text()[-8:-3] for x in ax.get_xticklabels()]
ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(ticklabels))
(for details about formatting datetime ticklabels of pandas bar plots see this SO question)

How to plot a time serie having only business day without jump between the missing days [duplicate]

ax.plot_date((dates, dates), (highs, lows), '-')
I'm currently using this command to plot financial highs and lows using Matplotlib. It works great, but how do I remove the blank spaces in the x-axis left by days without market data, such as weekends and holidays?
I have lists of dates, highs, lows, closes and opens. I can't find any examples of creating a graph with an x-axis that show dates but doesn't enforce a constant scale.
One of the advertised features of scikits.timeseries is "Create time series plots with intelligently spaced axis labels".
You can see some example plots here. In the first example (shown below) the 'business' frequency is used for the data, which automatically excludes holidays and weekends and the like. It also masks missing data points, which you see as gaps in this plot, rather than linearly interpolating them.
Up to date answer (2018) with Matplotlib 2.1.2, Python 2.7.12
The function equidate_ax handles everything you need for a simple date x-axis with equidistant spacing of data points. Realised with ticker.FuncFormatter based on this example.
from __future__ import division
from matplotlib import pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import datetime
def equidate_ax(fig, ax, dates, fmt="%Y-%m-%d", label="Date"):
"""
Sets all relevant parameters for an equidistant date-x-axis.
Tick Locators are not affected (set automatically)
Args:
fig: pyplot.figure instance
ax: pyplot.axis instance (target axis)
dates: iterable of datetime.date or datetime.datetime instances
fmt: Display format of dates
label: x-axis label
Returns:
None
"""
N = len(dates)
def format_date(index, pos):
index = np.clip(int(index + 0.5), 0, N - 1)
return dates[index].strftime(fmt)
ax.xaxis.set_major_formatter(FuncFormatter(format_date))
ax.set_xlabel(label)
fig.autofmt_xdate()
#
# Some test data (with python dates)
#
dates = [datetime.datetime(year, month, day) for year, month, day in [
(2018,2,1), (2018,2,2), (2018,2,5), (2018,2,6), (2018,2,7), (2018,2,28)
]]
y = np.arange(6)
# Create plots. Left plot is default with a gap
fig, [ax1, ax2] = plt.subplots(1, 2)
ax1.plot(dates, y, 'o-')
ax1.set_title("Default")
ax1.set_xlabel("Date")
# Right plot will show equidistant series
# x-axis must be the indices of your dates-list
x = np.arange(len(dates))
ax2.plot(x, y, 'o-')
ax2.set_title("Equidistant Placement")
equidate_ax(fig, ax2, dates)
I think you need to "artificially synthesize" the exact form of plot you want by using xticks to set the tick labels to the strings representing the dates (of course placing the ticks at equispaced intervals even though the dates you're representing aren't equispaced) and then using a plain plot.
I will typically use NumPy's NaN (not a number) for values that are invalid or not present. They are represented by Matplotlib as gaps in the plot and NumPy is part of pylab/Matplotlib.
>>> import pylab
>>> xs = pylab.arange(10.) + 733632. # valid date range
>>> ys = [1,2,3,2,pylab.nan,2,3,2,5,2.4] # some data (one undefined)
>>> pylab.plot_date(xs, ys, ydate=False, linestyle='-', marker='')
[<matplotlib.lines.Line2D instance at 0x0378D418>]
>>> pylab.show()
I ran into this problem again and was able to create a decent function to handle this issue, especially concerning intraday datetimes. Credit to #Primer for this answer.
def plot_ts(ts, step=5, figsize=(10,7), title=''):
"""
plot timeseries ignoring date gaps
Params
------
ts : pd.DataFrame or pd.Series
step : int, display interval for ticks
figsize : tuple, figure size
title: str
"""
fig, ax = plt.subplots(figsize=figsize)
ax.plot(range(ts.dropna().shape[0]), ts.dropna())
ax.set_title(title)
ax.set_xticks(np.arange(len(ts.dropna())))
ax.set_xticklabels(ts.dropna().index.tolist());
# tick visibility, can be slow for 200,000+ ticks
xticklabels = ax.get_xticklabels() # generate list once to speed up function
for i, label in enumerate(xticklabels):
if not i%step==0:
label.set_visible(False)
fig.autofmt_xdate()
You can simply change dates to strings:
import matplotlib.pyplot as plt
import datetime
f = plt.figure(1, figsize=(10,5))
ax = f.add_subplot(111)
today = datetime.datetime.today().date()
yesterday = today - datetime.timedelta(days=1)
three_days_later = today + datetime.timedelta(days=3)
x_values = [yesterday, today, three_days_later]
y_values = [75, 80, 90]
x_values = [f'{x:%Y-%m-%d}' for x in x_values]
ax.bar(x_values, y_values, color='green')
plt.show()
scikits.timeseries functionality has largely been moved to pandas, so you can now resample a dataframe to only include the values on weekdays.
>>>import pandas as pd
>>>import matplotlib.pyplot as plt
>>>s = pd.Series(list(range(10)), pd.date_range('2015-09-01','2015-09-10'))
>>>s
2015-09-01 0
2015-09-02 1
2015-09-03 2
2015-09-04 3
2015-09-05 4
2015-09-06 5
2015-09-07 6
2015-09-08 7
2015-09-09 8
2015-09-10 9
>>> s.resample('B', label='right', closed='right').last()
2015-09-01 0
2015-09-02 1
2015-09-03 2
2015-09-04 3
2015-09-07 6
2015-09-08 7
2015-09-09 8
2015-09-10 9
and then to plot the dataframe as normal
s.resample('B', label='right', closed='right').last().plot()
plt.show()
Just use mplfinance
https://github.com/matplotlib/mplfinance
import mplfinance as mpf
# df = 'ohlc dataframe'
mpf.plot(df)