Pandas: Replace every value in every column matching a pattern with a value from another column in that row - pandas

I have a dataframe with 1000 columns. I want to replace every -9 value in every column with that row's df['a'] value.
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
What I want is
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
I have tried
df.replace(-9, df['a'], inplace = True)
And
df.replace(-9, np.nan, inplace = True)
df.fillna(df.a, inplace = True)
But they don't change the df.
My solution right now is to use a for loop:
df.replace(-9, np.nan, inplace = True)
col_list = list(df)
for i in col_list:
df[i].fillna(df['a'], inplace = True)
This solution works, but it also replaces any np.nan values. Any ideas as to how I can replace just the -9 values without first converting it into np.nan? Thanks.

I think need mask:
df = df.mask(df == -9, df['a'], axis=0)
print (df)
a b c
0 1 6.0 1
1 2 2.0 19
2 3 8.0 3
3 4 NaN 4
4 5 5.0 5
Or:
df = pd.DataFrame(np.where(df == -9, df['a'].values[:, None], df), columns=df.columns)
print (df)
a b c
0 1.0 6.0 1.0
1 2.0 2.0 19.0
2 3.0 8.0 3.0
3 4.0 NaN 4.0
4 5.0 5.0 5.0

you can also do something like this
import numpy as np
import pandas as pd
df_tar = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
df.loc[df['b']==-9,'b']=df.loc[df['b']==-9,'a']
df.loc[df['c']==-9,'c']=df.loc[df['c']==-9,'a']

Related

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

Convert tabular pandas DataFrame into nested pandas DataFrame

Supposing that i have a simple pd.DataFrame like so:
d = {'col1': [1, 20], 'col2': [3, 40], 'col3': [5, 50]}
df = pd.DataFrame(data=d)
df
col1 col2 col4
0 1 3 5
1 20 40 60
is there a way to convert this to nasted pandas Dataframe (df_new) , so as when i call df_new.values[0] taking as ouptut:
array(
[0 1
1 3
2 5
Length: 3, dtype: int], dtype=object)
I still don't think I understand the exact requirement, but here is something:
One way of getting the desired output is this:
>>> pd.Series(df.T[0].values)
0 1
1 3
2 5
dtype: int64
If you want to have these as 2d arrays:
>>> np.array(pd.DataFrame(df.T[0].values).reset_index())
array([[0, 1],
[1, 3],
[2, 5]])
>>> np.array(pd.DataFrame(df.T[1].values).reset_index())
array([[ 0, 20],
[ 1, 40],
[ 2, 50]])

pandas: most elegant way to pivot table on pattern in name of columns

Given the following DataFrame:
pd.DataFrame({
'x': [0, 1],
'y': [0, 1],
'a_idx': [0, 1],
'a_val': [2, 3],
'b_idx': [4, 5],
'b_val': [6, 7],
})
What is the cleanest way to pivot the DataFrame based on the prefix of the idx and val columns if you have an indeterminate amount of unique prefixes (a, b, ... n), so as to obtain the following DataFrame?
pd.DataFrame({
'x': [0, 1, 0, 1],
'y': [0, 1, 0, 1],
'key': ['a','a','b','b'],
'idx': [0, 1, 4, 5],
'val': [2, 3, 6, 7]
})
I am not very knowledgeable in pandas, so my easiest solution was to go earlier in the data generation process and generate a subset of the result DataFrame for each prefix in SQL, and then concat the result sets into a final DataFrame. I'm curious however if there is a simple way to do this using the API of pandas.DataFrame. Is there such a thing?
Let's try wide_to_long with extras:
(pd.wide_to_long(df,stubnames=['a','b'],
i=['x','y'],
j='key',
sep='_',
suffix='\\w+'
)
.unstack('key').stack(level=0).reset_index()
)
Or manually with melt:
out = df.melt(['x', 'y'])
out = (out.join(out['variable'].str.split('_', expand=True))
.rename(columns={0: 'key'})
.pivot_table(index=['x', 'y', 'key'], columns=[1], values='value')
.reset_index()
)
Output:
key x y level_2 idx val
0 0 0 a 0 2
1 0 0 b 4 6
2 1 1 a 1 3
3 1 1 b 5 7

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]

Sort a dictionary in a column in pandas

I have a dataframe as shown below.
user_id Recommended_modules Remaining_modules
1 {A:[5,11], B:[4]} {A:2, B:1}
2 {A:[8,4,2], B:[5], C:[6,8]} {A:7, B:1, C:2}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[8,4,2], B:[5,1,2], C:[6]} {A:3, B:4, C:1}
Brief about the dataframe:
In the column Recommended_modules A, B and C are courses and the numbers inside the list are modules.
Key(Remaining_modules) = Course name
value(Remaining_modules) = Number of modules remaining in that course
From the above I would like to reorder the recommended_modules column based on the values in the Remaining_modules as shown below.
Expected Output:
user_id Ordered_Recommended_modules Ordered_Remaining_modules
1 {B:[4], A:[5,11]} {B:1, A:2}
2 {B:[5], C:[6,8], A:[8,4,2]} {B:1, C:2, A:7}
3 {B:[8], A:[2,3,9]} {B:1, A:5}
4 {C:[6], A:[8,4,2], B:[5,1,2]} {C:1, A:3, B:4}
Explanation:
For user_id = 2, Remaining_modules = {A:7, B:1, C:2}, sort like this {B:1, C:2, A:7}
similarly arrange Recommended_modules also in the same order as shown below
{B:[5], C:[6,8], A:[8,4,2]}.
It is possible, only need python 3.6+:
def f(x):
#https://stackoverflow.com/a/613218/2901002
d1 = {k: v for k, v in sorted(x['Remaining_modules'].items(), key=lambda item: item[1])}
L = d1.keys()
#https://stackoverflow.com/a/21773891/2901002
d2 = {key:x['Recommended_modules'][key] for key in L if key in x['Recommended_modules']}
x['Remaining_modules'] = d1
x['Recommended_modules'] = d2
return x
df = df.apply(f, axis=1)
print (df)
user_id Recommended_modules \
0 1 {'B': [4], 'A': [5, 11]}
1 2 {'B': [5], 'C': [6, 8], 'A': [8, 4, 2]}
2 3 {'B': [8], 'A': [2, 3, 9]}
3 4 {'C': [6], 'A': [8, 4, 2], 'B': [5, 1, 2]}
Remaining_modules
0 {'B': 1, 'A': 2}
1 {'B': 1, 'C': 2, 'A': 7}
2 {'B': 1, 'A': 5}
3 {'C': 1, 'A': 3, 'B': 4}