IPython / pandas: Is there an canonical way to detect rapid changes in a timeseries? - pandas

Noob data analyst, analyzing some gas concentrations over a timeseries of a couple of thousand points (so small). I graphed it with Matplotlib, and there are some easy to see points where things change rapidly.
What is the canonical / easiest way to home in on those points?

import pandas as pd
from numpy import diff, concatenate
ff = pd.DataFrame( #acquire data here
columns=('Year','Recon'))
fd = diff(ff['Recon'], axis=-1)
ff['diff'] = concatenate([[0],fd],axis=0)
ff['rolling10'] = pd.rolling_mean(ff['diff'],10)
ff['rolling5'] = pd.rolling_mean(ff['diff'],5)
ff.plot('Year',['rolling5','rolling10'],subplots=False)
But note! my test data was evenly sampled. Looks like rolling_* don't apply to irregular time series yet, though there are some workarounds: Pandas: rolling mean by time interval

Related

How to average data for a variable over a number of timesteps

I was wondering if anyone could shed some light into how I can average this data:
I have a .nc file with data (dimensions: 2029,64,32) which relates to time, latitude and longitude. Using these commands I can plot individual timesteps:
timestep = data.variables['precip'][0]
plt.imshow(timestep)
plt.colorbar()
plt.show()
Giving a graph in this format for the 0th timestep:
I was wondering if there was any way to average this first dimension (the snapshots in time).
If you are looking to take a mean over all times, try using np.mean where you use the axis keyword to say which axis you want to average.
time_avaraged = np.mean(data.variables['precip'], axis = 0)
If you have NaN values then np.mean will give NaN for that lon/lat point. If you'd rather ignore them then use np.nanmean.
If you want to do specific times only, e.g. the first 1000 time steps, then you could do
time_avaraged = np.mean(data.variables['precip'][:1000,:,:], axis = 0)
I think if you're using pandas and numpy this may help you.Look for more details
import pandas as pd
import numpy as np
data = np.array([10,5,8,9,15,22,26,11,15,16,18,7])
d = pd.Series(data)
print(d.rolling(4).mean())

Exponential moving average on pandas

I was having a bit of trouble making an exponential moving average for a pandas data frame. I managed to make a simple moving average but I'm not sure how I can make one that is exponential. I was wondering if there's a function in pandas or maybe another module that can help with this. Ideally the exponential moving average would be in another column in my data frame. This is my code below:
import pandas as pd
import datetime as dt
import yfinance as yf
#Get initial paramaters
start = dt.date(2020,1,1)
end = dt.date.today()
ticker = 'SPY'
#Get df data
df = yf.download(ticker,start,end,progress=False)
#Make simple moving average
df['SMA'] = df['Adj Close'].rolling(window=75,min_periods=1).mean()
Thanks
Use the ewm method:
df['SMA'] = df['Adj Close'].ewm(span=75, min_periods=1).mean()
NB. check carefully the parameters' documentation as there is no more window, you should use one of com, span, halflife or alpha instead

Fastest way to find nearest nonzero value in array from columns in pandas dataframe

I am looking for the nearest nonzero cell in a numpy 3d array based on the i,j,k coordinates stored in a pandas dataframe. My solution below works, but it is slower than I would like. I know my optimization skills are lacking, so I am hoping someone can help me find a faster option.
It takes 2 seconds to find the nearest non-zero for a 100 x 100 x 100 binary array, and I have hundreds of files, so any speed enhancements would be much appreciated!
a=np.random.randint(0,2,size=(100,100,100))
# df with i,j,k of interest
df=pd.DataFrame(np.random.randint(100,size=(100,3)).tolist(),
columns=['i','j','k'])
def find_nearest(a,df):
import numpy as np
import pandas as pd
import time
t0=time.time()
nzi = np.nonzero(a)
for i,r in df.iterrows():
dist = ((r['k'] - nzi[0])**2 + \
(r['i'] - nzi[1])**2 + \
(r['j'] - nzi[2])**2)
nidx = dist.argmin()
df.loc[i,['nk','ni','nj']]=(nzi[0][nidx],
nzi[1][nidx],
nzi[2][nidx])
print(time.time()-t0)
return(df)
The problem that you are trying to solve looks like a nearest-neighbor search.
The worst-case complexity of the current code is O(n m) with n the number of point to search and m the number of neighbour candidates. With n = 100 and m = 100**3 = 1,000,000, this means about hundreds of million iterations. To solve this efficiently, one can use a better algorithm.
The common way to solve this kind of problem consists in putting all elements in a BSP-Tree data structure (such as Quadtree or Octree. Such a data structure helps you to locate the nearest elements near a location in a O(log(m)) time. As a result, the overall complexity of this method is O(n log(m))! SciPy already implement KD-trees.
Vectorization generally also help to speed up the computation.
def find_nearest_fast(a,df):
from scipy.spatial import KDTree
import numpy as np
import pandas as pd
import time
t0=time.time()
candidates = np.array(np.nonzero(a)).transpose().copy()
tree = KDTree(candidates, leafsize=1024, compact_nodes=False)
searched = np.array([df['k'], df['i'], df['j']]).transpose()
distances, indices = tree.query(searched)
nearestPoints = candidates[indices,:]
df[['nk', 'ni', 'nj']] = nearestPoints
print(time.time()-t0)
return df
This implementation is 16 times faster on my machine. Note the results differ a bit since there are multiple nearest points for a given input point (with the same distance).

Graph csv data that is represented horizontally rather than vertical - Python Pandas CSV

Context: I have combined numerous CSV's into one representing use case vs usage over a period of time.
The way the data is represented currently is attached.
What I am trying to do is, for each usecase, graph across row A(1, 1.1, 1.9, 4.0.11435, 4.1.11436 and so on...) - creating a linear plot to show progression over time
What I have so far:
import matplotlib.pyplot as plt
plot_df = pd.read_csv("results.csv")
milestones = plot_df.columns[1:]
row = plot_df.iloc[0]
row.plot(kind='line')
plt.show()
Any help is appreciated.
Thank you

Histogram in Bokeh charts takes a looong time

I am trying to move from matplotlib to bokeh. However, I am finding some annoying features. Last I encountered was that it took several minutes to make an histogram of about 1.5M entries - it would have taken a fraction of a second with Matplotlib. Is that normal? And if so, what's the reason?
from bokeh.charts import Histogram, output_file, show
import pandas as pd
output_notebook()
jd1 = pd.read_csv("somefile.csv")
p = Histogram(jd1['QTY'], bins=50)
show(p)
I'm not sure offhand what might be going on with Histogram in your case. Without the data file it's impossible to try and reproduce or debug. But in any case bokeh.charts does not really have a maintainer at the moment, so I would actually just recommend using bokeh.plotting to create your historgam. The bokeh.plotting API is stable (for several years now) and extensively documented. It's a few more lines of code but not many:
import numpy as np
from bokeh.plotting import figure, show, output_notebook
output_notebook()
# synthesize example data
measured = np.random.normal(0, 0.5, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)
p = figure(title="Normal Distribution (μ=0, σ=0.5)")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color=None)
show(p)
As you can see that takes (on my laptop) ~half a second for a 10 million point histogram, including generating synthetic data and binning it.