prevent pandas.interpolate() from extrapolation - pandas

I'm having difficulty in preventing pd.DataFrame.interpolate(method='index') from extrapolation.
Specifically:
>>> df = pd.DataFrame({1: range(1, 5), 2: range(2, 6), 3 : range(3, 7)}, index = [1, 2, 3, 4])
>>> df = df.reindex(range(6)).reindex(range(5), axis=1)
>>> df.iloc[3, 2] = np.nan
>>> df
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 1.0 2.0 3.0 NaN
2 NaN 2.0 3.0 4.0 NaN
3 NaN 3.0 NaN 5.0 NaN
4 NaN 4.0 5.0 6.0 NaN
5 NaN NaN NaN NaN NaN
So df is just a block of data surrounded by NaN, with an interior missing point at iloc[3, 2]. Now when I apply .interpolate() (along either the horizontal or vertical axis), my goal is to have ONLY that interior point filled, leaving the surrounding NaNs untouched. But somehow I'm not able to get it to work.
I tried:
>>> df.interpolate(method='index', axis=0, limit_area='inside')
0 1 2 3 4
0 NaN NaN NaN NaN NaN
1 NaN 1.0 2.0 3.0 NaN
2 NaN 2.0 3.0 4.0 NaN
3 NaN 3.0 4.0 5.0 NaN
4 NaN 4.0 5.0 6.0 NaN
5 NaN 4.0 5.0 6.0 NaN
Note the last row got filled, which is undesirable. (btw, I'd think the fill value should be linear extrapolation based on index, but it is just padding the last value, which is highly undesirable.)
I also tried combination of limit and limit_direction to no avail.
What would be the correct argument setting to get the desired result? Hopefully without some contorted masking (but that would work too). Thx.

Ok, turns out I'm running this on Pandas 0.21, hence the limit_area argument is silently failing. Looks like starting from 0.24 this is fixed. Case closed.

Related

How do I append an uneven column to an existing one?

I am having trouble appending later values from column C to column A within the same df using pandas. I have tried .append and .concat with ignore_index=True, still not working.
import pandas as pd
d = {'a':[1,2,3,None, None], 'b':[7,8,9, None, None], 'c':[None, None, None, 5, 6]}
df = pd.DataFrame(d)
df['a'] = df['a'].append(df['c'], ignore_index=True)
print(df)
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 NaN NaN 5.0
4 NaN NaN 6.0
Desired:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0
Thank you for updating that, this is what I would do:
df['a'] = df['a'].fillna(df['c'])
print(df)
Output:
a b c
0 1.0 7.0 NaN
1 2.0 8.0 NaN
2 3.0 9.0 NaN
3 5.0 NaN 5.0
4 6.0 NaN 6.0

How to perform a rolling window on a pandas DataFrame, whereby each row consists nan values that should not be replaced?

I have the following dataframe:
df = pd.DataFrame([[0, 1, 2, 4, np.nan, np.nan, np.nan,1],
[0, 1, 2 ,np.nan, np.nan, np.nan,np.nan,1],
[0, 2, 2 ,np.nan, 2, np.nan,1,1]])
With output:
0 1 2 3 4 5 6 7
0 0 1 2 4 NaN NaN NaN 1
1 0 1 2 NaN NaN NaN NaN 1
2 0 2 2 NaN 2 NaN 1 1
with dtypes:
df.dtypes
0 int64
1 int64
2 int64
3 float64
4 float64
5 float64
6 float64
7 int64
Then the underneath rolling summation is applied:
df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
And the output is as follows:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 4.0 4.0 4.0 4.0 4.0
1 0.0 1.0 3.0 NaN NaN NaN NaN 4.0
2 0.0 2.0 4.0 NaN 2.0 2.0 3.0 5.0
I notice that the rolling window stops and starts again whenever the dtype of the next column is different.
I however have a dataframe whereby all columns are of the same object type.
df = df.astype('object')
which has output:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 7.0 7.0 7.0 8.0
1 0.0 1.0 3.0 3.0 3.0 3.0 3.0 4.0
2 0.0 2.0 4.0 4.0 6.0 6.0 7.0 8.0
My desired output however, stops and starts again after a nan value appears. This would look like:
0 1 2 3 4 5 6 7
0 0.0 1.0 3.0 7.0 NaN NaN NaN 8.0
1 0.0 1.0 3.0 NaN NaN NaN Nan 4.0
2 0.0 2.0 4.0 NaN 6.0 NaN 7.0 8.0
I figured there must be a way that NaN values are not considered but also not filled in with values obtained from the rolling window.
Anything would help!
Workaround is:
Where are the nan-values located:
nan = df.isnull()
Apply the rolling window.
df = df.rolling(window = 7, min_periods =1, axis = 'columns').sum()
Only show values labeled as false.
df[~nan]

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f

Using pandas reindex with floats: interpolation

Can you explain this bizarre behaviour?
df=pd.DataFrame({'year':[1986,1987,1988],'bomb':arange(3)}).set_index('year')
In [9]: df.reindex(arange(1986,1988.125,.125))
Out[9]:
bomb
1986.000 0
1986.125 NaN
1986.250 NaN
1986.375 NaN
1986.500 NaN
1986.625 NaN
1986.750 NaN
1986.875 NaN
1987.000 1
1987.125 NaN
1987.250 NaN
1987.375 NaN
1987.500 NaN
1987.625 NaN
1987.750 NaN
1987.875 NaN
1988.000 2
In [10]: df.reindex(arange(1986,1988.1,.1))
Out[10]:
bomb
1986.0 0
1986.1 NaN
1986.2 NaN
1986.3 NaN
1986.4 NaN
1986.5 NaN
1986.6 NaN
1986.7 NaN
1986.8 NaN
1986.9 NaN
1987.0 NaN
1987.1 NaN
1987.2 NaN
1987.3 NaN
1987.4 NaN
1987.5 NaN
1987.6 NaN
1987.7 NaN
1987.8 NaN
1987.9 NaN
1988.0 NaN
When the increment is anything other than .125, I find that the new index values do not "find" the old rows that have matching values. ie there is a precision problem that is not being overcome. This is true even if I force the index to be a float before I try to interpolate. What is going on and/or what is the right way to do this?
I've been able to get it to work with increment of 0.1 by using
reindex( np.array(map(round,arange(1985,2010+dt,dt)*10))/10.0 )
By the way, I'm doing this as the first step in linearly interpolating a number of columns (e.g. "bomb" is one of them). If there's a nicer way to do that, I'd happily be set straight.
You are getting what you ask for. The reindex method only tries to for the data onto the new index that you provide. As mentioned in comments you are probably looking for dates in the index. I guess you were expecting the reindex method to do this though(interpolation):
df2 =df.reindex(arange(1986,1988.125,.125))
pd.Series.interpolate(df2['bomb'])
1986.000 0.000
1986.125 0.125
1986.250 0.250
1986.375 0.375
1986.500 0.500
1986.625 0.625
1986.750 0.750
1986.875 0.875
1987.000 1.000
1987.125 1.125
1987.250 1.250
1987.375 1.375
1987.500 1.500
1987.625 1.625
1987.750 1.750
1987.875 1.875
1988.000 2.000
Name: bomb
the second example you use is inconsistency is probably because of floating point accuracies. Stepping by 0.125 is equal to 1/8 which can be exactly done in binary. stepping by 0.1 is not directly mappable to binary so 1987 is probably out by a fraction.
1987.0 == 1987.0000000001
False
I think you are better off doing something like this by using PeriodIndex
In [39]: df=pd.DataFrame({'bomb':np.arange(3)})
In [40]: df
Out[40]:
bomb
0 0
1 1
2 2
In [41]: df.index = pd.period_range('1986','1988',freq='Y').asfreq('M')
In [42]: df
Out[42]:
bomb
1986-12 0
1987-12 1
1988-12 2
In [43]: df = df.reindex(pd.period_range('1986','1988',freq='M'))
In [44]: df
Out[44]:
bomb
1986-01 NaN
1986-02 NaN
1986-03 NaN
1986-04 NaN
1986-05 NaN
1986-06 NaN
1986-07 NaN
1986-08 NaN
1986-09 NaN
1986-10 NaN
1986-11 NaN
1986-12 0
1987-01 NaN
1987-02 NaN
1987-03 NaN
1987-04 NaN
1987-05 NaN
1987-06 NaN
1987-07 NaN
1987-08 NaN
1987-09 NaN
1987-10 NaN
1987-11 NaN
1987-12 1
1988-01 NaN
In [45]: df.iloc[0,0] = -1
In [46]: df['interp'] = df['bomb'].interpolate()
In [47]: df
Out[47]:
bomb interp
1986-01 -1 -1.000000
1986-02 NaN -0.909091
1986-03 NaN -0.818182
1986-04 NaN -0.727273
1986-05 NaN -0.636364
1986-06 NaN -0.545455
1986-07 NaN -0.454545
1986-08 NaN -0.363636
1986-09 NaN -0.272727
1986-10 NaN -0.181818
1986-11 NaN -0.090909
1986-12 0 0.000000
1987-01 NaN 0.083333
1987-02 NaN 0.166667
1987-03 NaN 0.250000
1987-04 NaN 0.333333
1987-05 NaN 0.416667
1987-06 NaN 0.500000
1987-07 NaN 0.583333
1987-08 NaN 0.666667
1987-09 NaN 0.750000
1987-10 NaN 0.833333
1987-11 NaN 0.916667
1987-12 1 1.000000
1988-01 NaN 1.000000