Issues with Decomposing Trend, Seasonal, and Residual Time Series Elements - dataframe

i am quite a newbie to Time Series Analysis and this might be a stupid question.
I am trying to generate the trend, seasonal, and residual time series elements, however, my timestamps index are actually strings (lets say 'window1', 'window2', 'window3'). Now, when i try to apply seasonal_decompose(data, model='multiplicative'), it returns an error as, Index' object has no attribute 'inferred_freq' and which is pretty understandable.
However, how to go around this issue by keeping strings as time series index?

Basically here you need to specify freq parameter.
Suppose you have following dataset
s = pd.Series([102,200,322,420], index=['window1', 'window2', 'window3','window4'])
s
>>>window1 102
window2 200
window3 322
window4 420
dtype: int64
Now specify freq parameter,in this case I used freq=1
plt.style.use('default')
plt.figure(figsize = (16,8))
import statsmodels.api as sm
sm.tsa.seasonal_decompose(s.values,freq=1).plot()
result = sm.tsa.stattools.adfuller(s,maxlag=1)
plt.show()
I am not allowed to post image ,but I hope this code will solve your problem.Also here maxlag by default give an error for my dataset ,therefore I used maxlag=1.If you are not sure about its values ,do use default value for maxlag.

Related

How do you deal with datetime obj when applying ANN models?

How do you deal with datetime obj when applying ANN models? I have thought of writing function which iterates through the column but there has to be a cleaner way to do so, right?
dataset.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 299 non-null int64
1 ZIP 299 non-null int64
2 START_TIME 299 non-null datetime64[ns]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
float() argument must be a string or a number, not 'Timestamp'
With attempt:
TypeError: float() argument must be a string or a number, not 'datetime.time' in relation with a scatter plot
could not convert string to float: '2022-03-16 11:55:00'
I would suggest doing the following steps:
converting string to datetime.datetime objects
from datetime import datetime
t = datetime.strptime("2022-03-16 11:55:00","%Y-%m-%d %H:%M:%S")
Then extract the necessary components to pass as inputs to the network:
x1,x2,x3 = t.month, t.hour, t.minute
As an aside, I noticed you are directly scaling the time components. Rather, do some different pre-processing depending on the problem. For example, extracting sine and cosine information of the time components rather than using them directly or scaling them. sine and cosine components preserve the distance between time points.
import numpy as np
hour_cos = np.cos(t.hour)
hour_sin = np.sin(t.hour)
extract other periodic components as necessary for the problem
e.g. if you are looking at weather variable: sine and cosine of hour, month are typically useful. If you are looking at sales, day of month, month, day of week sine and cosine are useful
Update: from the comments I noticed you mentioned that you are predicting decibel levels. Assuming, you are already factoring in spatial inputs variables, you should definitely try something like a sine/cosine transformation assuming the events generating sounds exhibit a periodic pattern. Again, this is an assumption and might not be completely true.
dataset['START_TIME'] = pd.to_datetime(dataset['START_TIME']).apply(lambda x: x.value)
Seems like a clean way of doing so, but I'm still open to alternatives.

how to get the number of periods used by pd.corr()

How can I figure out the quality of a correlation obtained with pd.corr()?
And by quality, I mean how much of the data is overlapping and thus used by .corr(). Let's illustrate this with a short example:
Lets set up a sample dataframe:
time1 = pd.date_range(start='2020-01-01', end='2020-01-5', freq='D')
time2 = pd.date_range(start='2020-01-04', end='2020-01-8', freq='D')
time3 = pd.date_range(start='2020-01-01', end='2020-01-8', freq='D')
df1 = pd.DataFrame(np.random.randn(len(time1)), index=time1, columns=['first'])
df2 = pd.DataFrame(np.random.randn(len(time2)), index=time2, columns=['second'])
df3 = pd.DataFrame(np.random.randn(len(time3)), index=time3, columns=['third'])
df4 = pd.DataFrame(np.random.randn(len(time3)), index=time3, columns=['forth'])
df = pd.concat([df1,df2,df3,df4], axis=1)
AS you can see, first and second share only two data points, whereas third and forth have data for the whole period, and thus share all of that between them, and 5 data points with first and second respectively.
Or in a graphic:
Obviously, it'll be different each time, because of np.random.randn, but in this special case, it nicely illustrates my point. If we look at the correlation matrix, df.corr()
first second third forth
first 1.000000 1.000000 0.034076 -0.023059
second 1.000000 1.000000 -0.021810 0.928713
third 0.034076 -0.021810 1.000000 0.458744
forth -0.023059 0.928713 0.458744 1.000000
We not only see the expected perfect correlation of the diagonal, but apparently, first and second also have a perfect correlation (same gradient in the plot). Now, mathematically, this seems to be true, however, assuming all three columns observe the same kind of phenomena, I would call the correlations for the others, especially third and forth better, since I have much more data available.
Whether first and second really have a better correlation than the others will show when I have the same amount of data.
Now with such a simple dataframe, it's quite trivial to obtain this information, but let's assume a dataframe with a few hundred entries and times starting in the 1930s and this simple visual identification becomes tedious, if not impossible.
How do I get the information of the overlaps of the correlation matrix in a straightforward and simple way?
pd.corr() has the option min_periods, so I could set this to min_periods = 3 to discard the low overlap correlation in this example, but real data, I'd rather not exclude anything, but simply have information about the number of periods that were used to obtain the correlation.
I haven't yet found a quick and easy way to get the numbers, but for my application, a quick visual overview also works.
With the test data above, this is not really needed, but lets keep using it for illustrative purposes.
A simple plt.imshow(df, aspect='auto') works quite good. But since I'm just interested in data vs. no data (so, any number vs. NaN) multiplying it with zero works better (NaN gets left as is and is painted white by imshow, any number becomes zero and thus gets a color. So:
plt.imshow((df*0), aspect='auto')
Or with some real data:
Not perfect (so I'm not going to accept this answer), but a good first step to identify which data are good and which are less so.

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]

Adding statsmodels 'predict' results to a Pandas dataframe

It is common to want to append the results of predictions to the dataset used to make the predictions, but the statsmodels predict function returns (non-indexed) results of a potentially different length than the dataset on which predictions are based.
For example, if the test dataset, test, contains any null entries, then
mod_fit = sm.Logit.from_formula('Y ~ A B C', train).fit()
press = mod_fit.predict(test)
will produce an array that is shorter than the length of test, and cannot be usefully appended with
test['preds'] = preds
And since the result of predict is not indexed, there is no way to recover the rows to which the results should be attached.
What is the idiom for associating predict results to the rows from which they were generated? Is there, perhaps, a way to get predict to return a dataframe that preserves the indices of its argument?
Predict shouldn't drop any rows. Can you post a minimal working example where this happens? Preserving the pandas index is on my radar and should be fixed in master soon.
https://github.com/statsmodels/statsmodels/issues/1501
Edit: Nevermind. This is a known issue. https://github.com/statsmodels/statsmodels/issues/1352