Imputing NAN values by pandas forward fill method with set pattern - pandas

Suppose I am working on a Dataset where there is a column name "F_N" containing numeric values in a sequence like 10, 20, 30, nan, 50, nan, 70. Here I want these null places to fill by 40, and 60 in the respective place with pandas' help. I am aware of fillna(method=ffill), but it will give us 30 and 50 exact values, not patterns.

Use linear interpolation with interpolate:
df['F_N'] = df['F_N'].interpolate()
>>> df
F_N
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 60.0
6 70.0

You describe a sequence with missing values. fillna() can take a series. Hence simplest is to fill with expected values. Code below demonstrates this:
import pandas as pd
import numpy as np
df = pd.DataFrame({"F_N":range(0,101,10)})
df.loc[np.random.choice(df.index,5)] = np.nan
df["fill"] = df["F_N"].fillna(pd.Series(range(0,101,10)))
output
F_N
fill
0
nan
0
1
10
10
2
20
20
3
30
30
4
nan
40
5
nan
50
6
60
60
7
70
70
8
nan
80
9
90
90
10
100
100

Related

Filling values based on column name

I have this simple data frame
import numpy as np
import pandas as pd
data = {'Name':['Karan','Rohit','Sahil','Aryan'],'Age':[23,22,21,23]}
df = pd.DataFrame(data)
I would like to create a new columns based on value of column age and insert 1 if column name fits with value in column Age
like this
Name Age 21 22 23
0 Karan 23 None None 1
1 Rohit 22 None 1 None
2 Sahil 21 1 None None
3 Aryan 23 None None 1
I have tried
def data_categorical_check(df, column_cat):
unique_val = np.unique(np.array(df.iloc[:, [column_cat]]))
x = None
for i in range(len(unique_val)):
x = str(unique_val[i])
df[x] = None
df[x]=[ int(i == unique_val[i]) for i in df["age"]]
return df
This makes columns OK, but I am not able to correctly insert values.
I am looking for general solution.
I would like to define column to check in argument 'column cat'.
Simple..Encode the values using get_dummies then mask the zeros and join back with original dataframe
s = pd.get_dummies(df['Age'])
df.join(s[s != 0])
Name Age 21 22 23
0 Karan 23 NaN NaN 1.0
1 Rohit 22 NaN 1.0 NaN
2 Sahil 21 1.0 NaN NaN
3 Aryan 23 NaN NaN 1.0
Use pd.crosstab:
>>> pd.concat([df, pd.crosstab(df.index, df.Age)], axis=1)
Name Age 21 22 23
0 Karan 23 0 0 1
1 Rohit 22 0 1 0
2 Sahil 21 1 0 0
3 Aryan 23 0 0 1
# OR
>>> pd.concat([df, pd.crosstab(df.index, df.Age).mask(lambda x: x==0)], axis=1)
Name Age 21 22 23
0 Karan 23 NaN NaN 1.0
1 Rohit 22 NaN 1.0 NaN
2 Sahil 21 1.0 NaN NaN
3 Aryan 23 NaN NaN 1.0
You can do it by creating a function thats return the row with the new column created:
def data_categorical_check(row):
row[str(row["Age"])]=1
return row
And applying it by using "apply" method:
df.apply(lambda x: data_categorical_check(x), axis=1)

How to round a value to make it divisible by a specific number in Pandas?

Months_of_inventory = 2
Case_Qty = 20
df['Order'] = np.where(((df['Avg_Sales'] * Months_of_inventory) - df['On_Hand_Inventory']) <= 0,0, ((df['Avg_Sales'] * Months_of_inventory) - df['On_Hand_Inventory']))
The ceiling of (df['Avg_Sales'] * Months_of_inventory) - df['On_Hand_Inventory']) has to be the Case_Qty, meaning the result should be divisible by 20
I was trying to nest np.where to check if the result is or not divisible by the Case_Qty and if not to make it divisible by rounding it up (like in excel when using ceiling function) but I could not find the way of doing it in Pandas. Thank you!
You can use numpy.ceil:
import numpy as np
df = pd.DataFrame({'Qty': [0, 1, 19, 20, 21]})
Case_Qty = 20
df['Num_Case'] = np.ceil(df['Qty'].div(Case_Qty)).astype(int)
df['Total_space'] = np.ceil(df['Qty'].div(Case_Qty)).mul(Case_Qty)
df['Space_left'] = np.ceil(df['Qty'].div(Case_Qty)).mul(Case_Qty).sub(df['Qty'])
print(df)
output:
Qty Num_Case Total_space Space_left
0 0 0 0.0 0.0
1 1 1 20.0 19.0
2 19 1 20.0 1.0
3 20 1 20.0 0.0
4 21 2 40.0 19.0

pandas aggregating frames by largest common column denominator and filling missing values

I have been struggling with this issue for a bit and even though there are some workarounds i would assume, I would love to know if there is an elegant way to achieve this result:
import pandas as pd
import numpy as np
data = np.array([
[1,10],
[2,12],
[4,13],
[5,14],
[8,15]])
df1 = pd.DataFrame(data=data, index=range(0,5), columns=['x','a'])
data = np.array([
[2,100,101],
[3,120,122],
[4,130,132],
[7,140,142],
[9,150,151],
[12,160,152]])
df2 = pd.DataFrame(data=data, index=range(0,6), columns=['x','b','c'])
Now I would like to have a data frame that concatenate those 2 and fill the missing values with the previous value
or the first value otherwise. Both data frames can have differnet sizes, what we are interested in here is the unique column x.
That would be my desired output frame df_result.
x is the aggregated unique "x" between the 2 frames
x a b c
0 1 10 100 101
1 2 12 100 101
2 3 12 120 122
3 4 13 130 132
4 5 14 130 132
5 7 14 140 142
6 8 15 140 142
7 9 15 150 151
8 12 15 160 152
Any help or hint would be much appreciated, thank you very much
You can simply use merge operation on 2 dataframes, after that you can apply a sorting, forward fill and backward filling for null values fillling.
df1.merge(df2,on='x',how='outer').sort_values('x').ffill().bfill()
Out:
x a b c
0 1 10.0 100.0 101.0
1 2 12.0 100.0 101.0
5 3 12.0 120.0 122.0
2 4 13.0 130.0 132.0
3 5 14.0 130.0 132.0
6 7 14.0 140.0 142.0
4 8 15.0 140.0 142.0
7 9 15.0 150.0 151.0
8 12 15.0 160.0 152.0

Interpolate proportionally with duplicate index

I have a table like
df = pd.DataFrame([1,np.nan,3,1,np.nan,3,50,np.nan,52], index=[7, 8, 9, 7, 12, 27, 7, 8, 9]):
index values
7 1
8 NaN
9 3
7 1
12 NaN
27 3
7 50
8 NaN
9 52
Rows are correctly sorted. However, index here is not ordered, and has duplicates by design.
How to interpolate values here proportionally to index (method="index")?
If I try to interpolate using index, resulting Series is messed up because of duplicate index:
df.interpolate(method='index'):
index values desired actual
7 1 1 1
8 NaN 2 2
9 3 3 3
7 1 1 1
12 NaN 1.5 52 <-- wat
27 3 3 3
7 50 50 50
8 NaN 51 1.1 <-- wat
9 52 52 52
If not reproducible: Pandas 0.23.3, Numpy: 1.14.5, Python: 3.6.5
Try to add a grouping the dataframe based on index:
df.groupby(df.index.to_series().diff().lt(0).cumsum())\
.apply(lambda x: x.interpolate(method='index'))
Output:
0
7 1.0
8 2.0
9 3.0
7 1.0
12 1.5
27 3.0
7 50.0
8 51.0
9 52.0
More complicated way if you have situation like I mentioned above in scott 's comment
np.where(df['values'].isnull(),df['values'].shift()+(df['values'].shift(-1)-df['values'].shift())*(df['index']-df['index'].shift())/(df['index'].shift(-1)-df['index'].shift()),df['values'])
Out[219]: array([ 1. , 2. , 3. , 1. , 1.5, 3. , 50. , 51. , 52. ])
This is to check the distance of each null value between two valid value , and fill the value with the distance of index(different).
tolerance : only one missing value between two values

Pandas rolling function with specific numeric span?

As of Pandas 0.18.0, it is possible to have a variable rolling window size for time-series by specifying a time span. For example, the code for summation over a 2-second window in dataframe dft looks like this:
dft.rolling('2s').sum()
It is possible to do the same with non-datetime spans?
For example, given a dataframe that looks like this:
A B
0 1 1
1 2 2
2 3 3
3 5 5
4 6 6
5 7 7
6 10 10
Is it possible to specify a window span of say 3 on column 'A' and have the sum of column 'B' calculated, so that the output looks something like:
A B
0 1 NaN
1 2 NaN
2 3 5
3 5 10
4 6 14
5 7 18
6 10 17
Not with rolling(). See the documentation for the window argument:
[A variable-sized window] is only valid for datetimelike indexes.
Full text:
window : int, or offset
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes.
Here's a workaround if you're interested.
df = pd.DataFrame({'A' : np.arange(10),
'B' : np.arange(10,20)},
index=[1,2,3,5,8,9,11,14,19,20])
def var_window(df, size, min_periods=None):
"""Operates on the index."""
result = []
df = df.sort_index()
for i in df.index:
start = i - size + 1
res = df.loc[start:i].sum().tolist()
result.append(res)
result = pd.DataFrame(result, index=df.index)
if min_periods:
result.loc[:min_periods - 1] = np.nan
return result
print(var_window(df, size=3, min_periods=3, inclusive=True))
0 1
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 5.0 25.0
8 4.0 14.0
9 9.0 29.0
11 11.0 31.0
14 7.0 17.0
19 8.0 18.0
20 17.0 37.0
Explanation: loop through the index. At each value, truncate the DataFrame to the trailing window size. Here 'size' is not a count, but rather a range as you have defined it.
In the above, at the index value of 8, you're summing the values of A for which the index is 8, 7, or 6. (I.e. > 8 - 3 + 1). The only index value that falls within that range is 8, so the sum is simply the value from the original frame. Comparatively, for the index value of 11, the sum will include values for 9 and 11 (5 + 6 = 11, the resulting sum for A).
Compare this with standard rolling ops:
print(df.rolling(window=3).sum())
A B
1 NaN NaN
2 NaN NaN
3 3.0 33.0
5 6.0 36.0
8 9.0 39.0
9 12.0 42.0
11 15.0 45.0
14 18.0 48.0
19 21.0 51.0
20 24.0 54.0
If I'm misinterpreting your question, let me know how. It's admittedly significantly slower:
%timeit df.rolling(window=3).sum()
1000 loops, best of 3: 627 µs per loop
%timeit var_window(df, size=3, min_periods=3)
100 loops, best of 3: 3.59 ms per loop