Dataframe forward-fill till column-specific last valid index - pandas

How do I go from:
[In]: df = pd.DataFrame({
'col1': [100, np.nan, np.nan, 100, np.nan, np.nan],
'col2': [np.nan, 100, np.nan, np.nan, 100, np.nan]
})
df
[Out]: col1 col2
0 100 NaN
1 NaN 100
2 NaN NaN
3 100 NaN
4 NaN 100
5 NaN NaN
To:
[Out]: col1 col2
0 100 NaN
1 100 100
2 100 100
3 100 100
4 NaN 100
5 NaN NaN
My current approach is a to apply a custom method that works on one column at a time:
[In]: def ffill_last_valid(s):
last_valid = s.last_valid_index()
s = s.ffill()
s[s.index > last_valid] = np.nan
return s
df.apply(ffill_last_valid)
But it seems like an overkill to me. Is there a one-liner that works on the dataframe directly?
Note on accepted answer:
See the accepted answer from mozway below.
I know it's a tiny dataframe but:

You can ffill, then keep only the values before the last stretch of NaN with a combination of where and notna/reversed-cummax:
out = df.ffill().where(df[::-1].notna().cummax())
variant:
out = df.ffill().mask(df[::-1].isna().cummin())
Output:
col1 col2
0 100.0 NaN
1 100.0 100.0
2 100.0 100.0
3 100.0 100.0
4 NaN 100.0
5 NaN NaN
interpolate:
In theory, df.interpolate(method='ffill', limit_area='inside') should work, but while both options work as expected separately, for some reason it doesn't when combined (pandas 1.5.2). This works with df.interpolate(method='zero', limit_area='inside'), though.

Related

In pandas, replace table column with Series while joining indexes

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.
pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN
You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN

Drop a part of rows based on condition & insert them into a new rows of pandas data frame

I want to drop the relevant rows for the first 3 columns where the "Match1" column contains value "No" and insert them into new rows of the same data frame.
df2 = pd.DataFrame({ 'Name':['John', 'Tom', 'Tom' ,'Ole','Ole','Tom'],
'SomeQty':[100, 200, 300, 500,600, 400],
'Match':['Yes', 'No', 'Yes','No','No','No'],
'SomeValue':[100, 200, 200, 500, 600, 200],
'Match1':['Yes', 'Yes','Yes', 'No','No', 'Yes'],
})
My expected result is;
The way I followed to do this is;
# Define a intermediary dataframe
df4 = pd.DataFrame(columns=['Name','SomeQty','Match','Match1','SomeValue'])
# Copy the relevant data in order to drop and assign
df4 = df4.append(df2.loc[df2['Name']== 'Ole',['Name','SomeQty','Match','Match1']].copy())
# Drop the data from main table
df2.iloc[:, 0:3] = df2.iloc[:, 0:3].drop(df2[df2['Name']== 'Ole'].index)
# Append the relevant data from intermediary table
df2 = df2.append([df4], ignore_index=True, sort=False)
del df4
I like to know a better way to achieve this. TIA
A simpler version using a boolean mask would be:
cols = ['Name','SomeQty','Match']
mask = df2['Match1'].eq('No')
out = pd.concat(
[df2.mask(mask, df2.drop(cols, axis=1)),
df2.loc[mask, cols]
], ignore_index=True)
Output:
Name SomeQty Match SomeValue Match1
0 John 100.0 Yes 100.0 Yes
1 Tom 200.0 No 200.0 Yes
2 Tom 300.0 Yes 200.0 Yes
3 NaN NaN NaN 500.0 No
4 NaN NaN NaN 600.0 No
5 Tom 400.0 No 200.0 Yes
6 Ole 500.0 No NaN NaN
7 Ole 600.0 No NaN NaN

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10
you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)
We can use insert o create the rows based on the unique ID values ​​obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values ​​is used to sort the values ​​based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN
A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

Pandas dataframe creating multiple rows at once via .loc

I can create a new row in a dataframe using .loc():
>>> df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
>>> df
a b
1 10 100
2 20 200
>>> df.loc[3, 'a'] = 30
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
But how can I create more than one row using the same method?
>>> df.loc[[4, 5], 'a'] = [40, 50]
...
KeyError: '[4 5] not in index'
I'm familiar with .append() but am looking for a way that does NOT require constructing a new row into a Series before having it appended to df.
Desired input:
>>> df.loc[[4, 5], 'a'] = [40, 50]
Desired output
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Where last 2 rows are newly added.
Admittedly, this is a very late answer, but I have had to deal with a similar problem and think my solution might be helpful to others as well.
After recreating your data, it is basically a two-step approach:
Recreate data:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index='1 2'.split())
df.loc[3, 'a'] = 30
Extend the df.index using .reindex:
idx = list(df.index)
new_rows = list(map(str, range(4, 6))) # easier extensible than new_rows = ["4", "5"]
idx.extend(new_rows)
df = df.reindex(index=idx)
Set the values using .loc:
df.loc[new_rows, "a"] = [40, 50]
giving you
>>> df
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
Example data
>>> data = pd.DataFrame({
'a': [10, 6, -3, -2, 4, 12, 3, 3],
'b': [6, -3, 6, 12, 8, 11, -5, -5],
'id': [1, 1, 1, 1, 6, 2, 2, 4]})
Case 1 Note that range can be altered to whatever it is that you desire.
>>> for i in range(10):
... data.loc[i, 'a'] = 30
...
>>> data
a b id
0 30.0 6.0 1.0
1 30.0 -3.0 1.0
2 30.0 6.0 1.0
3 30.0 12.0 1.0
4 30.0 8.0 6.0
5 30.0 11.0 2.0
6 30.0 -5.0 2.0
7 30.0 -5.0 4.0
8 30.0 NaN NaN
9 30.0 NaN NaN
Case 2 Here we are adding a new column to a data frame that had 8 rows to begin with. As we extend our new column c to be of length 10 the other columns are extended with NaN.
>>> for i in range(10):
... data.loc[i, 'c'] = 30
...
>>> data
a b id c
0 10.0 6.0 1.0 30.0
1 6.0 -3.0 1.0 30.0
2 -3.0 6.0 1.0 30.0
3 -2.0 12.0 1.0 30.0
4 4.0 8.0 6.0 30.0
5 12.0 11.0 2.0 30.0
6 3.0 -5.0 2.0 30.0
7 3.0 -5.0 4.0 30.0
8 NaN NaN NaN 30.0
9 NaN NaN NaN 30.0
Also somewhat late, but my solution was similar to the accepted one:
import pandas as pd
df = pd.DataFrame({'a':[10, 20], 'b':[100,200]}, index=[1,2])
# single index assignment always works
df.loc[3, 'a'] = 30
# multiple indices
new_rows = [4,5]
# there should be a nicer way to add more than one index/row at once,
# but at least this is just one extra line:
df = df.reindex(index=df.index.append(pd.Index(new_rows))) # note: Index.append() doesn't accept non-Index iterables?
# multiple new rows now works:
df.loc[new_rows, "a"] = [40, 50]
print(df)
... which yields:
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 40.0 NaN
5 50.0 NaN
This also works now (useful when performance on aggregating dataframes matters):
# inserting whole rows:
df.loc[new_rows] = [[41, 51], [61,71]]
print(df)
a b
1 10.0 100.0
2 20.0 200.0
3 30.0 NaN
4 41.0 51.0
5 61.0 71.0

Counting null as percentage

Is there a fast way to automatically generate the null percentage for each columns, and output as a table?
e.g., if a column has 40 row, with 10 null values, it will be 10/40
I use the following code but now work (no values shown):
You could use df.count()
In [56]: df
Out[56]:
a b
0 1.0 NaN
1 2.0 1.0
2 NaN NaN
3 NaN NaN
4 5.0 NaN
In [57]: 1 - df.count()/len(df.index)
Out[57]:
a 0.4
b 0.8
dtype: float64
Timings, count is decently faster than isnull.sum()
In [68]: df.shape
Out[68]: (50000, 2)
In [69]: %timeit 1 - df.count()/len(df.index)
1000 loops, best of 3: 542 µs per loop
In [70]: %timeit df.isnull().sum()/df.shape[0]
100 loops, best of 3: 2.87 ms per loop
IIUC then you can use isnull with sum and then divide by the number of rows:
In [12]:
df = pd.DataFrame({'a':[1,2,np.NaN,np.NaN,5], 'b':[np.NaN,1,np.NaN,np.NaN,np.NaN]})
df
Out[12]:
a b
0 1.0 NaN
1 2.0 1.0
2 NaN NaN
3 NaN NaN
4 5.0 NaN
In [14]:
df.isnull().sum()/df.shape[0]
Out[14]:
a 0.4
b 0.8
dtype: float64