Unexplained "drops" in Savgol smoothing with higher polynomial for trends, stock, energy data (all kinds of time series basically!) - smoothing

I have been trying to smooth curves with Savgol (scikit) and, in several of my attempt, raising the polynomial degree resulted in "drops" like the one I show below. This example is from Google trends data, but I had similar problems with stock data and electricity consumption data. Any lead as to why it behaves like it or how to solve it (and be able to raise the polynomial degree) would be highly appreciated.
Image below: "Sample output".
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from pytrends.request import TrendReq
pytrends = TrendReq(hl='en-US', tz=360)
from scipy.signal import savgol_filter
kw_list = ["Carbon footprint"]
pytrends.build_payload(kw_list, timeframe='2004-12-14 2019-12-25', geo='', gprop='')
da1 = pytrends.interest_over_time()
#(drop last one for Savgol as need odd number, used to have 196 records)
Y3 = da1["Carbon footprint"]
fig = plt.figure(figsize=(18,9))
l = Y3.shape[0]
l = l if l%2 == 1 else l-1
# window = odd number closest to size of data
ax1 = plt.subplot(2,1,1)
ax1 = sns.lineplot(data=Y3, color="navy")
#Savgol with polynomial order = 7 is fine (but misses the initial plateau)
Y3_smooth = savgol_filter(Y3,l, 7)
ax1 = sns.lineplot(x=da1.index.to_pydatetime(),y=Y3_smooth, color="red")
plt.title(f"red = with Savgol, polynomial order = 7, window = {l}", fontsize=18)
ax2 = plt.subplot(2,1,2)
ax2 = sns.lineplot(data=Y3, color="navy")
#Savgol with polynomial order = 9 or more has a weird drop
Y3_smooth = savgol_filter(Y3,l, 10)
ax2 = sns.lineplot(x=da1.index.to_pydatetime(),y=Y3_smooth, color="red")
plt.title(f"red = with Savgol, polynomial order = 10, window = {l}", fontsize=18)
Sample output

If anyone is interested, I found this workaround using a different way to smooth. It works well including in the beginning and end, and allows a fine tuning of the degree of smoothing.
from scipy.ndimage.filters import gaussian_filter1d
def smooth(y, sigma=2):
y_smooth = gaussian_filter1d(y, sigma)
return y_smooth

Related

Equivalent of Hist()'s Layout hyperparameter in Sns.Pairplot?

Am trying to find hist()'s figsize and layout parameter for sns.pairplot().
I have a pairplot that gives me nice scatterplots between the X's and y. However, it is oriented horizontally and there is no equivalent layout parameter to make them vertical to my knowledge. 4 plots per row would be great.
This is my current sns.pairplot():
sns.pairplot(X_train,
x_vars = X_train.select_dtypes(exclude=['object']).columns,
y_vars = ["SalePrice"])
This is what I would like it to look like: Source
num_mask = train_df.dtypes != object
num_cols = train_df.loc[:, num_mask[num_mask == True].keys()]
num_cols.hist(figsize = (30,15), layout = (4,10))
plt.show()
What you want to achieve isn't currently supported by sns.pairplot, but you can use one of the other figure-level functions (sns.displot, sns.catplot, ...). sns.lmplot creates a grid of scatter plots. For this to work, the dataframe needs to be in "long form".
Here is a simple example. sns.lmplot has parameters to leave out the regression line (fit_reg=False), to set the height of the individual subplots (height=...), to set its aspect ratio (aspect=..., where the subplot width will be height times aspect ratio), and many more. If all y ranges are similar, you can use the default sharey=True.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# create some test data with different y-ranges
np.random.seed(20230209)
X_train = pd.DataFrame({"".join(np.random.choice([*'uvwxyz'], np.random.randint(3, 8))):
np.random.randn(100).cumsum() + np.random.randint(100, 1000) for _ in range(10)})
X_train['SalePrice'] = np.random.randint(10000, 100000, 100)
# convert the dataframe to long form
# 'SalePrice' will get excluded automatically via `melt`
compare_columns = X_train.select_dtypes(exclude=['object']).columns
long_df = X_train.melt(id_vars='SalePrice', value_vars=compare_columns)
# create a grid of scatter plots
g = sns.lmplot(data=long_df, x='SalePrice', y='value', col='variable', col_wrap=4, sharey=False)
g.set(ylabel='')
plt.show()
Here is another example, with histograms of the mpg dataset:
import matplotlib.pyplot as plt
import seaborn as sns
mpg = sns.load_dataset('mpg')
compare_columns = mpg.select_dtypes(exclude=['object']).columns
mpg_long = mpg.melt(value_vars=compare_columns)
g = sns.displot(data=mpg_long, kde=True, x='value', common_bins=False, col='variable', col_wrap=4, color='crimson',
facet_kws={'sharex': False, 'sharey': False})
g.set(xlabel='')
plt.show()

Plotting audio data properties over long time periods

Using Python matplotlib I would like to plot sensor data over a period of several hours. The signal arrives via an audio card and gets sampled over short chunks of data. In the example below amplitude and RMS is plotted.
In order to plot RMS and other properties over much larger time periods than shown here, perhaps down sampling is needed. I am not sure how to accomplish that and would appreciate any further advice. The intention is to run the code on a Raspberry Pi.
Update 1. A very minimal example is shown for getting a longer time view of RMS.
Noticable is a considerable delay in response to audio signals in particular when adding more plots to the figure.
I also tried using Funcanimation without blitting because I would like to show a real-time axis and this is equally slow. Using PyQT should give better results.
import pyaudio
import struct
import matplotlib.pyplot as plt
import numpy as np
mic = pyaudio.PyAudio()
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
CHUNK = int(RATE/20)
stream = mic.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True,
output=True,
frames_per_buffer=CHUNK)
fig = plt.figure()
ax1 = fig.add_subplot(2, 1, 1)
ax2 = fig.add_subplot(2, 1, 2)
ax1.set_xlabel("Samples = 2*Chunk length ")
ax1.set_ylabel("Amplitude")
ax1.set_title('Audio example')
fig.tight_layout(pad=3.0)
x = np.arange(0, 2 * CHUNK, 2)
ax1.set_ylim(-10e3, 10e3)
ax1.set_xlim(0, CHUNK)
line1, = ax1.plot(x, np.random.rand(CHUNK))
line2, = ax2.plot(x, np.random.rand(CHUNK))
ts = []
rs = []
while True:
data = stream.read(CHUNK)
data = np.frombuffer(data, np.int16)
d = np.frombuffer(data, np.int16).astype(np.float)
rms2 = np.sqrt( np.mean(d**2) )
#print(rms2)
# Add x and y to lists
ts.append(dt.datetime.now())
rs.append(rms2)
#Draw x and y lists
ax2.clear()
ax2.plot(ts,rs,color= 'black')
# Format plot
ax2.set_xlabel("Time in UTC")
ax2.set_ylabel("RMS values")
ax2.set_title('RMS')
line1.set_ydata(data)
line2.set_ydata(rms2)
plt.setp(ax2.get_xticklabels(), ha="right", rotation=45)
fig.gca().relim()
fig.gca().autoscale_view()
#fig.canvas.draw()
#fig.canvas.flush_events()
plt.pause(0.01)

Time series plot of categorical or binary variables in pandas or matplotlib

I have data that represent a time series of categorical variables. I want to display the transitions in categories below a traditional line plot of related continuous time series to show off context as time evolves. I'd like to know the best way to do this. My attempt was in terms of Rectangles. The appearance is a bit weird, and importantly the axis labels for the x axis don't render as dates.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from pandas.plotting import register_matplotlib_converters
import matplotlib.dates as mdates
register_matplotlib_converters()
t0 = pd.DatetimeIndex(["2017-06-01 00:00","2017-06-17 00:00","2017-07-03 00:00","2017-08-02 00:00","2017-08-09 00:00","2017-09-01 00:00"])
t1 = pd.DatetimeIndex(["2017-06-01 00:00","2017-08-15 00:00","2017-09-01 00:00"])
df0 = pd.DataFrame({"cat":[0,2,1,2,0,1]},index = t0)
df1 = pd.DataFrame({"op":[0,1,0]},index=t1)
# Create new plot
fig,ax = plt.subplots(1,figsize=(8,3))
data_layout = {
"cat" : {0: ('bisque','Low'),
1: ('lightseagreen','Medium'),
2: ('rebeccapurple','High')},
"op" : {0: ('darkturquoise','Open'),
1: ('tomato','Close')}
}
vars =("cat","op")
dfs = [df0,df1]
all_ticks = []
leg = []
for j,(v,d) in enumerate(zip(vars,dfs)):
dvals = d[v][:].astype("d")
normal = mpl.colors.Normalize(vmin=0, vmax=2.)
colors = plt.cm.Set1(0.75*normal(dvals.as_matrix()))
handles = []
for i in range(d.count()-1):
s = d[v].index.to_pydatetime()
level = d[v][i]
base = d[v].index[i]
w = s[i+1] - s[i]
patch=mpl.patches.Rectangle((base,float(j)),width=w,color=data_layout[v][level][0],height=1,fill=True)
ax.add_patch(patch)
for lev in data_layout[v]:
print data_layout[v][level]
handles.append(mpl.patches.Patch(color=data_layout[v][lev][0],label=data_layout[v][lev][1]))
all_ticks.append(j+0.5)
leg.append( plt.legend(handles=handles,loc = (3-3*j+1)))
plt.axhline(y=1.,linewidth=3,color="gray")
plt.xlim(pd.Timestamp(2017,6,1).to_pydatetime(),pd.Timestamp(2017,9,1).to_pydatetime())
plt.ylim(0,2)
ax.add_artist(leg[0]) # two legends on one axis
ax.format_xdata = mdates.DateFormatter('%Y-%m-%d') # This fails
plt.yticks(all_ticks,vars)
plt.show()
which produces this with no dates and has jittery lines:. How do I fix this? Is there a better way entirely?
This is a way to display dates on x-axis:
In your code substitute the line that fails with this one:
ax.xaxis.set_major_formatter((mdates.DateFormatter('%Y-%m-%d')))
But I don't remember how it should look like, can you show us the end-result again?

Python keeps overwriting hist on previous plot but doesn't save it with the desired plot

I am saving two separate figures, that each should contain 2 plots together.
The problem is that the first figure is ok, but the second one, does not gets overwritten on the new plot but on the previous one, but in the saved figure, I only find one of the plots :
This is the first figure , and I get the first figure correctly :
import scipy.stats as s
import numpy as np
import os
import pandas as pd
import openpyxl as pyx
import matplotlib
matplotlib.rcParams["backend"] = "TkAgg"
#matplotlib.rcParams['backend'] = "Qt4Agg"
#matplotlib.rcParams['backend'] = "nbAgg"
import matplotlib.pyplot as plt
import math
data = [336256, 620316, 958846, 1007830, 1080401]
pdf = array([ 0.00449982, 0.0045293 , 0.00455894, 0.02397463,
0.02395788, 0.02394114])
fig, ax = plt.subplots();
fig = plt.figure(figsize=(40,30))
x = np.linspace(np.min(data), np.max(data), 100);
plt.plot(x, s.exponweib.pdf(x, *s.exponweib.fit(data, 1, 1, loc=0, scale=2)))
plt.hist(data, bins = np.linspace(data[0], data[-1], 100), normed=True, alpha= 1)
text1= ' Weibull'
plt.savefig(text1+ '.png' )
datar =np.asarray(data)
mu, sigma = datar.mean() , datar.std() # mean and standard deviation
normal_std = np.sqrt(np.log(1 + (sigma/mu)**2))
normal_mean = np.log(mu) - normal_std**2 / 2
hs = np.random.lognormal(normal_mean, normal_std, 1000)
print(hs.max()) # some finite number
print(hs.mean()) # about 136519
print(hs.std()) # about 50405
count, bins, ignored = plt.hist(hs, 100, normed=True)
x = np.linspace(min(bins), max(bins), 10000)
pdfT = [];
for el in range (len(x)):
pdfTmp = (math.exp(-(np.log(x[el]) - normal_mean)**2 / (2 * normal_std**2)))
pdfT += [pdfTmp]
pdf = np.asarray(pdfT)
This is the second set :
fig, ax = plt.subplots();
fig = plt.figure(figsize=(40,40))
plt.plot(x, pdf, linewidth=2, color='r')
plt.hist(data, bins = np.linspace(data[0], data[-1], 100), normed=True, alpha= 1)
text= ' Lognormal '
plt.savefig(text+ '.png' )
The first plot saves the histogram together with curve. instead the second one only saves the curve
update 1 : looking at This Question , I found out that clearing the plot history will help the figures don't mixed up , but still my second set of plots, I mean the lognormal do not save together, I only get the curve and not the histogram.
This is happening, because you have set normed = True, which means that area under the histogram is normalized to 1. And since your bins are very wide, this means that the actual height of the histogram bars are very small (in this case so small that they are not visible)
If you use
n, bins, _ = plt.hist(data, bins = np.linspace(data[0], data[-1], 100), normed=True, alpha= 1)
n will contain the y-value of your bins and you can confirm this yourself.
Also have a look at the documentation for plt.hist.
So if you set normed to False, the histogram will be visible.
Edit: number of bins
import numpy as np
import matplotlib.pyplot as plt
rand_data = np.random.uniform(0, 1.0, 100)
fig = plt.figure()
ax_1 = fig.add_subplot(211)
ax_1.hist(rand_data, bins=10)
ax_2 = fig.add_subplot(212)
ax_2.hist(rand_data, bins=100)
plt.show()
will give you two plots similar (since its random) to:
which shows how the number of bins changes the histogram.
A histogram visualises the distribution of your data along one dimension, so not sure what you mean by number of inputs and bins.

Pandas boxplot side by side for different DataFrame

Even though there are nice examples online about plotting side by side boxplots. With the way my data is set in two different pandas DataFrames and allready having sum subplots I have not been able to manage getting my boxplots next to each other in stead of overlapping.
my code is as follows:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
mpl.use('agg')
fig, axarr = plt.subplots(3,sharex=True,sharey=True,figsize=(9,6))
month = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']
percentiles = [90,95,98]
nr = 0
for p in percentiles:
future_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
present_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
Future = future_data.as_matrix()
Present = present_data.as_matrix()
pp = axarr[nr].boxplot(Present,patch_artist=True, showfliers=False)
fp = axarr[nr].boxplot(Future, patch_artist=True, showfliers=False)
nr += 1
The results looks as follows:
Overlapping Boxplots
Could you help me out in how to makes sure the boxes are next to each other so I can compare them without being bothered by the overlap?
Thank you!
EDIT: I have reduced the code somewhat so it can run like this.
You need to position your bars manually, i.e. providing the positions as array to the position argument of boxplot. Here it makes sense to shift one by -0.2 and the other by +0.2 to their integer position. You can then adjust the width of them to sum up to something smaller than the difference in positions.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, axarr = plt.subplots(3,sharex=True,sharey=True,figsize=(9,6))
month = ['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec']
percentiles = [90,95,98]
nr = 0
for p in percentiles:
future_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
present_data = pd.DataFrame(np.random.randint(0,30,size=(30,12)),columns = month)
Future = future_data.as_matrix()
Present = present_data.as_matrix()
pp = axarr[nr].boxplot(Present,patch_artist=True, showfliers=False,
positions=np.arange(Present.shape[1])-.2, widths=0.4)
fp = axarr[nr].boxplot(Future, patch_artist=True, showfliers=False,
positions=np.arange(Present.shape[1])+.2, widths=0.4)
nr += 1
axarr[-1].set_xticks(np.arange(len(month)))
axarr[-1].set_xticklabels(month)
axarr[-1].set_xlim(-0.5,len(month)-.5)
plt.show()