NaN in data frame: when first observation of time series is NaN, frontfill with first available, otherwise carry over last / previous observation - pandas

I am performing an ADF-test from statsmodels. The value series can have missing obversations. In fact, I am dropping the analysis if the fraction of NaNs is larger than c. However, if the series makes it through the I get the problem, that the adfuller cannot deal with missing data. Since this is training data with a minimum framesize, I would like to do:
1) if x(t=0) = NaN, then find the next non-NaN value (t>0)
2) otherwise if x(t) = NaN, then x(t) = x(t-1)
So I am compromising here my first value, but making sure the input data has always the same dimension. Alternatively, I could fill if the first value is missing with 0 making use of the limit option from dropna.
From the documentation the different option are not 100% clear to me:
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series pad / ffill: propagate last valid observation forward to next valid backfill /
bfill: use NEXT valid observation to fill gap
pad / ffill: does that mean I carry over the previous value?
backfill / bfill: does that mean I the value is taken from a valid one in the future?
df.dropna(method = 'bfill', limit 1, inplace = True)
df.dropna(method = 'ffill', inplace = True)
Would that work with limit? The documentation uses 'limit = 1' but has predetermined a value to be filled.

1) if x(t=0) = NaN, then find the next non-NaN value (t>0) 2) otherwise if x(t) = NaN, then x(t) = x(t-1)
To front-fill all observations except for (possibly) the first ones, which should be backfilled, you can chain two calls to fillna, the first with method='ffill' and the second with method='fill':
df = pd.DataFrame({'a': [None, None, 1, None, 2, None]})
>>> df.fillna(method='ffill').fillna(method='bfill')
a
0 1.0
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Not able to understand the plotting of 2-Dimensional graph in python matplotlib

The data corresponds to 3 rows where the first row is the marks of Exam number one of a particular student and row number two is the marks in Exam number 2 of the student. The third row corresponds to 0 or 1 indicating his probability to enter a particular University.
Here is the code given for plotting the graph which I am not able to understand.
# Find Indices of Positive and Negative Examples
pos = y == 1
neg = y == 0
# Plot Examples
pyplot.plot(X[pos, 0], X[pos, 1], 'k*', lw=2, ms=10)
pyplot.plot(X[neg, 0], X[neg, 1], 'ko', mfc='y', ms=8, mec='k', mew=1)
The output is the image given below:
Any help in explaining the code is appreciated.
This code consists two different data, put together into one plot. They are all done with 'matplotlib' as you can read documentation here.
First plot is plotting only positive examples, marked as a star.
X[pos,0] is x-axis (first row, only positive examples) and X[pos,1] is y-axis (second row, only positive examples).
Rest of the arguments: k* means the style will be "stars", lw stands for "linewidth" and ms for "markersize", how big each start is.
Second plot is the same, only now for the circle which are negative. First two arguments are the same, only with negative examples. ko means to represent each dot a circle (hence o). mfc, mec, mew are for choosing the color of the marker.
Let's understand by example.
Here Y must be matrix which stores 0 and 1 value.
[0. 0. 0. 1. ]
So when you write below code
pos = y == 1
neg = y == 0
Matrix comparison happens, so,
-wherever row is having value 1 marked as True
-Wherever row is having value 0 marked as False
Hence you will get matrix like below
pos = [False False False,True]
neg = [ True True True False]
Hence this line of code
X[pos, 0]--gives 4th row of first column in the matrix X. Because Row 4 is having true.
X[neg, 0]-- gives 3 rows values, because first 3 rows values of neg matrix are True

Understanding pandas.DataFrame.corrwith method for spearman rank correlation calculation column-wise and row-wise

I have two dataframes like so :
preds_df = pd.DataFrame.from_records ([[ 0.8224], [ 0.7982]])
tgts_df = pd.DataFrame.from_records ([[0.8889], [1.0000]])
and want to compute spearman rank correlation values both across columns and across rows:
col_wise = preds_df.corrwith(tgts_df,method='spearman',axis=0).values.tolist()
row_wise = preds_df.corrwith(tgts_df,method='spearman',axis=1).values.tolist()
Printing those values gives:
print(col_wise)
[-0.9999999999999999]
print(row_wise)
[nan, nan]
Question 1: col_wise produced some result but how come row_wise produce nan for each row given that each row contained exactly one column and the value obtained for col_wise is not nan?
If I further extend these datasets (keep the original column but add two more columns) such that
preds_df = pd.DataFrame.from_records ([[0.8224, 0.5371, 0.1009], [0.7982, 0.5890, 0.0962]])
tgts_df = pd.DataFrame.from_records ([[0.8889, 0.5556, 0.0000], [1.0000, 0.7778, 0.0000]])
the values obtained are:
col_wise = preds_df.corrwith(tgts_df,method='spearman',axis=0).values.tolist()
print(col_wise)
[-0.9999999999999999, 0.9999999999999999, nan]
row_wise = preds_df.corrwith(tgts_df,method='spearman',axis=1).values.tolist()
print(row_wise)
[1.0, 1.0]
Question 2: Why doesn't the row_wise contain nan even though one of the columns (the third one) making each row has produced nan in col_wise?
Question 3: In general, why are nan values obtained? My input dataframes all have real numbers in them.
Question 1:
Note that when you want to calculate the Spearman correlation coefficient row-wise, you get two one-element samples from both frames (0.8224, 0.8889) corresponding to the first element in the list of coefficients and (0.7982,1.0000) corresponding to the other. Now look at the formula for the coefficient. Because you have one observation in both samples, the denominator equals zero and that is why you get NaN value.
Question 2 and 3:
The above issue does not apply to your second example, but you have observations with the same value (0.0) in the last column in tgts_df which results in the so called tied ranks (see more here). There are generally three situations when you are going to get NaN values:
1. You have samples with only one element in each group.
2. There are ties in the data (observations with the same values).
3. The shape of two dataframe objects is not the same.
If you have any further issues/questions, feel free to ask a question on CrossValidated.

How to perform matching between two sequences?

I have two mini-batch of sequences :
a = C.sequence.input_variable((10))
b = C.sequence.input_variable((10))
Both a and b have variable-length sequences.
I want to do matching between them where matching is defined as: match (eg. dot product) token at each time step of a with token at every time step of b .
How can I do this?
I have mostly answered this on github but to be consistent with SO rules, I am including a response here. In case of something simple like a dot product you can take advantage of the fact that it factorizes nicely, so the following code works
axisa = C.Axis.new_unique_dynamic_axis('a')
axisb = C.Axis.new_unique_dynamic_axis('b')
a = C.sequence.input_variable(1, sequence_axis=axisa)
b = C.sequence.input_variable(1, sequence_axis=axisb)
c = C.sequence.broadcast_as(C.sequence.reduce_sum(a), b) * b
c.eval({a: [[1, 2, 3],[4, 5]], b: [[6, 7], [8]]})
[array([[ 36.],
[ 42.]], dtype=float32), array([[ 72.]], dtype=float32)]
In the general case you need the following steps
static_b, mask = C.sequence.unpack(b, neutral_value).outputs
scores = your_score(a, static_b)
The first line will convert the b sequence into a static tensor with one more axis than b. Because of packing, some elements of this tensor will be invalid and those will be indicated by the mask. The neutral_value will be placed as a dummy value in the static_b tensor wherever data was missing. Depending on your score you might be able to arrange for the neutral_value to not affect the final score (e.g. if your score is a dot product a 0 would be a good choice, if it involves a softmax -infinity or something close to that would be a good choice). The second line can now have access to each element of a and all the elements of b as the first axis of static_b. For a dot product static_b is a matrix and one element of a is a vector so a matrix vector multiplication will result in a sequence whose elements are all inner products between the corresponding element of a and all elements of b.

efficiently setting values on a subset of rows

I am wondering about the best way to change values in a subset of rows in a dataframe.
Let's say I want to double the values in column value in rows where selected is true.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'value': [1, 2, 3, 4], 'selected': [False, False, True, True]})
In [3]: df
Out[3]:
selected value
0 False 1
1 False 2
2 True 3
3 True 4
There are several ways to do this:
# 1. Subsetting with .loc on left and right hand side:
df.loc[df['selected'], 'value'] = df.loc[df['selected'], 'value'] * 2
# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2
# 3. Using where()
df['value'] = (df['value'] * 2).where(df['selected'], df['value'])
If I only subset on the left hand side (option 2), would Pandas actually make the calculation for all rows and then discard the result for all but the selected rows?
In terms of evaluation, is there any difference between using loc and where?
Your #2 option is the most standard and recommended way to do this. Your #1 option is fine also, but the extra code is unnecessary because ix/loc/iloc are designed to pass the boolean selection through and do the necessary alignment to make sure it applies only to your desired subset.
# 2. Subsetting with .loc on left hand side:
df.loc[df['selected'], 'value'] = df['value'] * 2
If you don't use ix/loc/iloc on the left hand side, problems can arise that we don't want to get into in a simple answer. Hence, using ix/loc/iloc is generally the safest and most recommened way to go. There is nothing wrong with your option #3, but it is the least readable of the three.
One faster and acceptable alternative you should know about is numpy's where() function:
df['value'] = np.where( df['selected'], df['value'] * 2, df['value'] )
The first argument is the selection or mask, the second is the value to assign if True, and third is the value to assign if false. It's especially useful if you want to also create or change the value if the selection is False.