Pandas:set order of new generated column - pandas

I am working on a data and write a code which will basically split the data of column (COL) with respect to (comma:,) and print the split data into new columns. Now, what I want is that my code is able to generate the new columns in given manner (desired output). The code is attached below. Thank you in advance.
Input
X1 COL Y1
----------------
A X,Y,Z 146#12
B Z 223#13
C Y,X 725#14
Current output:
X1 Y1 COL-0 COL-1 COL-2
-----------------------------
A 146#12 X Y Z
B 223#13 Z NaN NaN
C 725#14 Y X NaN
Desired output:
X1 COL-1 COL-2 COL-3 Y1
------------------------------
A X Y Z 146#12
B Z - - 223#13
C Y X - 725#14
Script
import pandas as pd
import numpy as np
df = pd.read_csv(r"<PATH TO YOUR CSV>")
for row, item in enumerate(df["COL"]):
l = item.split(",")
for idx, elem in enumerate(l):
col = "COL-%s" % idx
if col not in df.columns:
df[col] = np.nan
df[col][row] = elem
df = df.drop(columns=["COL"])
print(df)

Use DataFrame.pop:
df['Y1'] = df.pop('Y1')
Also solution should be changed with Series.str.split:
df = (df.join(df.pop('COL').str.split(',', expand=True)
.fillna('-')
.rename(columns = lambda x: f'COL-{x+1}')))
df['Y1'] = df.pop('Y1')
print (df)
X1 COL-1 COL-2 COL-3 Y1
0 A X Y Z 146#12
1 B Z - - 223#13
2 C Y X - 725#14

If you wish to replace the NaN values with dashes you can use fillna() and, to keep the columns in order you specified you can simply define a dataframe with that column order.
df_output = df[['X1','COL-1','COL-2','COL-3','Y1']].fillna(value='-')

Not the most elegant of methods, but this should handle your real data and intended result :
import re
cols = df.filter(like='COL').columns.tolist()
pat = '(\w+)'
new_cols = [(f'{re.match(pat,col).group(0)} {i}') for i,col in enumerate(cols,1)]
df.rename(columns=dict(zip(cols,new_cols)),inplace=True)
df['Y1'] = df.pop('Y1')
out:
X1 COL 1 COL 2 COL 3 Y1
0 A X Y Z 146#12
1 B Z NaN NaN 223#13
2 C Y X NaN 725#14

Related

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4

Pandas Dataframe transformation - Understanding problems with functions I should use and logic I should opt for

I've got a hard problem with transforming a dataframe into another one.
I don't know what functions I should use to do what I want. I had some ideas that didn't work at all.
For example, I do not understand how I should use append (or if I should use it or something else) to do what I want.
Here is my original dataframe:
df1 = pd.DataFrame({
'Key': ['K0', 'K1', 'K2'],
'X0': ['a','b','a'],
'Y0': ['c','d','c'],
'X1': ['e','f','f'],
'Y1': ['g','h','h']
})
Key X0 Y0 X1 Y1
0 K0 a c e g
1 K1 b d f h
2 K2 a c f h
This dataframe represents every links attached to an ID in column Key. Links are following each other : X0->Y0 is the father of X1->Y1.
It's easy to read, and the real dataframe I'm working with has 6500 rows by 21 columns that represents a tree of links. So this dataframe has an end to end links logic.
I want to transform it into another one that has a unitary links and ID logic (because it's a tree of links, some unitary links may be part of multiple end to end links)
So I want to get each individual links into X->Y and I need to get the list of the Keys attached to each unitary links into Keys.
And here is what I want :
df3 = pd.DataFrame({
'Key':[['K0','K2'],'K1','K0',['K1','K2']],
'X':['a','b','e','f'],
'Y':['c','d','g','h']
})
Key X Y
0 [K0, K2] a c
1 K1 b d
2 K0 e g
3 [K1, K2] f h
To do this, I first have to combine X0 and X1 into a unique X column, idem for Y0 and Y1 into a unique Y column. At the same time I need to keep the keys attached to the links. This first transformation leads to a new dataframe, containing all the original information with duplicates which I will deal with after to obtain df3.
Here is the transition dataframe :
df2 = pd.DataFrame({
'Key':['K0','K1','K2','K0','K1','K2'],
'X':['a','b','a','e','f','f'],
'Y':['c','d','c','g','h','h']
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Transition from df1 to df2
For now, I did this to put X0,X1 and Y0,Y1 into X and Y :
Key = pd.Series(dtype=str)
X = pd.Series(dtype=str)
Y = pd.Series(dtype=str)
for i in df1.columns:
if 'K' in i:
Key = Key.append(df1[i], ignore_index=True)
elif 'X' in i:
X = X.append(df1[i], ignore_index=True)
elif 'Y' in i:
Y = Y.append(df1[i], ignore_index=True)
0 K0
1 K1
2 K2
dtype: object
0 a
1 b
2 a
3 e
4 f
5 f
dtype: object
0 c
1 d
2 c
3 g
4 h
5 h
dtype: object
But I do not know how to get the keys to keep them in front of the right links.
Also, I do this to construct df2, but it's weird and I do not understand how I should do it :
df2 = pd.DataFrame({
'Key':Key,
'X':X,
'Y':Y
})
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 NaN e g
4 NaN f h
5 NaN f h
I tried to use append, to combine the X0,X1 and Y0,Y1 columns directly into df2, but it turns out to be a complete mess, not filling df2 columns with df1 columns content. I also tried to use append to put the Series Key, X and Y directly into df2, but it gave me X and Y in rows instead of columns.
In short, I'm quite lost with it. I know there may be a lot to program to take df1, turn in into df2 and then into df3. I'm not asking for you to solve the problem for me, but I really need help about the functions I should use or the logic I should put in place to achieve my goal.
To convert df1 to df2 you want to look into pandas.wide_to_long.
https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html
>>> df2 = pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
>>> df2
X Y
Key num
K0 0 a c
K1 0 b d
K2 0 a c
K0 1 e g
K1 1 f h
K2 1 f h
You can drop the unwanted level "num" from the index using droplevel and turn the index level "Key" into a column using reset_index. Chaining everything:
>>> df2 = (
pd.wide_to_long(df1, stubnames=['X','Y'], i='Key', j='num')
.droplevel(level='num')
.reset_index()
)
>>> df2
Key X Y
0 K0 a c
1 K1 b d
2 K2 a c
3 K0 e g
4 K1 f h
5 K2 f h
Finally, to get df3 you just need to group df2 by "X" and "Y", and aggregate the "Key" groups into lists.
>>> df3 = df2.groupby(['X','Y'], as_index=False).agg(list)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d [K1]
2 e g [K0]
3 f h [K1, K2]
If you don't want single keys to be lists you can do this instead
>>> df3 = (
df2.groupby(['X','Y'], as_index=False)
.agg(lambda g: g.iloc[0] if len(g) == 1 else list(g))
)
>>> df3
X Y Key
0 a c [K0, K2]
1 b d K1
2 e g K0
3 f h [K1, K2]

Pandas replacing multiple words

df
col
a,b
b
c
b,c
Goal
a→x, b→y, c→z
col
x,y
y
z
y,z
Try
df['col']=df['col'].replace({'a':'x','b':'y','c':'z'})
It only works for one word, but multiple words like x,y failed.
Or you could try using this piece of code:
>>> df['col'].str.split(',', expand=True).fillna('').replace({'a':'x','b':'y','c':'z'}).apply(','.join, axis=1).str.rstrip(',')
0 x,y
1 y
2 z
3 y,z
dtype: object
>>>
Add parameter regex=True for subtrings replacement:
df['col']=df['col'].replace({'a':'x','b':'y','c':'z'}, regex=True)
print (df)
col
0 x,y
1 y
2 z
3 y,z
Another idea with dictionary.get for replace by splitted values and if no match get original value, last join back by ,:
d = {'a':'x','b':'y','c':'z'}
df['col']=df['col'].apply(lambda x: ','.join(d.get(y, y) for y in x.split(',')))
print (df)
col
0 x,y
1 y
2 z
3 y,z

In a pandas dataframe with a MultiIndex, how to conditionally fill missing values with group means?

Setup:
# create a MultiIndex
dfx = pd.MultiIndex.from_product([
list('ab'),
list('cd'),
list('xyz'),
], names=['idx1', 'idx2', 'idx3'])
# create a dataframe that fits the index
df = pd.DataFrame([None, .9, -.08, -2.11, 1.09, .38, None, None, -.37, -.86, 1.51, -.49], columns=['random_data'])
df.set_index(dfx, inplace=True)
Output:
random_data
idx1 idx2 idx3
a c x NaN
y 0.90
z -0.08
d x -2.11
y 1.09
z 0.38
b c x NaN
y NaN
z -0.37
d x -0.86
y 1.51
z -0.49
Within this index hierarchy, I am trying to accomplish the following:
When a value is missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1, idx2]
When multiple values are missing within [idx1, idx2, idx3], fill NaN with the group mean of [idx1]
I have tried df.apply(lambda col: col.fillna(col.groupby(by='idx1').mean())) as a way to solve #2, but I haven't been able to get it to work.
UPDATE
OK, so I have this solved in parts, but still at a loss about how to apply these conditionally:
For case #1:
df.unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().
I verified that the correct value was filled by looking at this:
df.groupby(by=['idx1', 'idx2']).mean(),
but it also replaces the missing values that I am trying to handle differently in case #2.
Similarly for #2:
df.unstack().unstack().apply(lambda col: col.fillna(col.mean()), axis=1).stack().stack()
verified the values replaced were correct by looking at
df.groupby(by=['idx1']).mean()
but it also applies to case #1, which I don't want.
I'm sure there is a more elegant way of doing this, but the following should achieve your desired result:
def get_null_count(df, group_levels, column):
result = (
df.loc[:, column]
.groupby(group_levels)
.transform(lambda x: x.isnull().sum())
).astype("int")
return result
def fill_groups(
df,
count_group_levels,
column,
missing_count_idx_map
):
null_counts = get_null_count(
df, count_group_levels, column
)
condition_masks = {
count: ((null_counts == count) & df[col].isnull()).to_numpy()
for count in missing_count_idx_map.keys()
}
condition_values = {
count: df.loc[:, column]
.groupby(indicies)
.transform("mean")
.to_numpy()
for count, indicies in missing_count_idx_map.items()
}
# Defaults
condition_masks[0] = (~df[col].isnull()).to_numpy()
condition_values[0] = df[col].to_numpy()
sorted_keys = sorted(missing_count_idx_map.keys()) + [0]
conditions = [
condition_masks[count]
for count in sorted_keys
]
values = [
condition_values[count]
for count in sorted_keys
]
result = np.select(conditions, values)
return result
col = "random_data"
missing_count_idx_map = {
1: ['idx1', "idx2"],
2: ['idx1']
}
df["filled"] = fill_groups(
df, ['idx1', 'idx2'], col, missing_count_idx_map
)
df then looks like:
random_data filled
idx1 idx2 idx3
a c x NaN -0.20
y 1.16 1.16
z -1.56 -1.56
d x 0.47 0.47
y -0.54 -0.54
z -0.30 -0.30
b c x NaN -0.40
y NaN -0.40
z 0.29 0.29
d x 0.98 0.98
y -0.41 -0.41
z -2.46 -2.46
IIUC, you may try this. Get mean of levelidx1 and mean of level [idx1, idx2]. Fillna use mean of [idx1,idx2]. Next, use mask to assign rows of groups having more than 1 NaN by mean of idx1
Sample `df`:
random_data
idx1 idx2 idx3
a c x NaN
y -0.09
z -0.01
d x -1.30
y -0.11
z 1.33
b c x NaN
y NaN
z 0.74
d x -1.44
y 0.50
z -0.61
df1_m = df.mean(level='idx1')
df12_m = df.mean(level=['idx1', 'idx2'])
m = df.isna().groupby(level=['idx1', 'idx2']).transform('sum').gt(1)
df_filled = df.fillna(df12_m).mask(m & df.isna(), df1_m)
Out[110]:
random_data
idx1 idx2 idx3
a c x -0.0500
y -0.0900
z -0.0100
d x -1.3000
y -0.1100
z 1.3300
b c x -0.2025
y -0.2025
z 0.7400
d x -1.4400
y 0.5000
z -0.6100
OK, solved it.
First, I made a dataframe containing counts by group of non-missing values:
truth_table = df.apply(lambda row: row.count(), axis = 1).groupby(by=['idx1', 'idx2']).sum()
>> truth_table
idx1 idx2
a c 2
d 3
b c 1
d 3
dtype: int64
Then set up a dataframe (one for each case I'm trying to resolve) containing the group means:
means_ab = x.groupby(by=['idx1']).mean()
>> means_ab
idx1
a 0.0360
b -0.0525
means_abcd = x.groupby(by=['idx1', 'idx2']).mean()
>> means_abcd
idx1 idx2
a c 0.410000
d -0.213333
b c -0.370000
d 0.053333
Given the structure of my data, I know:
Case #1 is analogous to truth_table having exactly one missing value in a given index grouping of [idx1, idx2] (e.g., these are the NaN values I want to replace with values from means_abcd)
Case #2 is analogous to truth_table having more than one missing value in a given index grouping of [idx1, idx2] (e.g., these are the NaN values I want to replace with values from means_ab
fix_case_2 = df.combine_first(df[truth_table > 1].fillna(means_ab, axis=1))
>> fix_case_2
idx1 idx2 idx3
a c x NaN
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900
df = fix_case_2.combine_first(df[truth_table == 1].fillna(means_abcd, axis=1))
>> df
idx1 idx2 idx3
a c x 0.4100 *
y 0.9000
z -0.0800
d x -2.1100
y 1.0900
z 0.3800
b c x -0.0525 *
y -0.0525 *
z -0.3700
d x -0.8600
y 1.5100
z -0.4900

Replace cell values in df based on complex condition

Hello friends,
I would like to iterate trough all the numeric columns in the df (in a generic way).
For each unique df["Type"] group in each numeric column:
Replace all values that are greater than each column mean + 2 standard
deviation values with "nan"
df = pd.DataFrame(data=d)
df = pd.DataFrame(data=d)
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
Sample df:
PRODUCT Test1 Test2 Type
A 7 99 Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C 90 87 X
Expected output:
RODUCT Test1 Test2 Type
A 7 nan Y
B 1 10 X
C 2 13 X
A 5 12 Y
B 1 11 Y
C nan nan X
Logically, it can go like this:
test_cols = ['Test1', 'Test2']
# calculate mean and std with groupby
groups = df.groupby('Type')
test_mean = groups[test_cols].transform('mean')
test_std = groups[test_cols].transform('std')
# threshold
thresh = test_mean + 2 * test_std
# thresholding
df[test_cols] = np.where(df[test_cols]>thresh, np.nan, df[test_cols])
However, from your sample data set, thresh is:
Test1 Test2
0 10.443434 141.707912
1 133.195890 123.898159
2 133.195890 123.898159
3 10.443434 141.707912
4 10.443434 141.707912
5 133.195890 123.898159
So, it wouldn't change anything.
You can get this through a groupby and transform:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['Product'] = ['A', 'B', 'C', 'A', 'B', 'C']
df['Test1']=[7,1,2,5,1,90]
df['Test2']=[99,10,13,12,11,87]
df['Type']=['Y','X','X','Y','Y','X']
df = df.set_index('Product')
def nan_out_values(type_df):
type_df[type_df > type_df.mean() + 2*type_df.std()] = np.nan
return type_df
df[['Test1', 'Test2']] = df.groupby('Type').transform(nan_out_values)