Create columns in python data frame based on existing column-name and column-values

Create columns in python data frame based on existing column-name and column-values - pandas

I have a dataframe in pandas:
import pandas as pd
# assign data of lists.
data = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Employment': ['R','U', 'E','R','U', 'E','R','U', 'E','R','U', 'E'],
'Age': ['Y','M', 'O','Y','M', 'O','Y','M', 'O','Y','M', 'O']
}
# Create DataFrame
df = pd.DataFrame(data)
df
What I want is to create for each category of each existing column a new column with the following format:
Gender_M -> for when the gender equals M
Gender_F -> for when the gender equal F
Employment_R -> for when employment equals R
Employment_U -> for when employment equals U
and so on...
So far, I have created the below code:
for i in range(len(df.columns)):
curent_column=list(df.columns)[i]
col_df_array = df[curent_column].unique()
for j in range(col_df_array.size):
new_col_name = str(list(df.columns)[i])+"_"+col_df_array[j]
for index,row in df.iterrows():
if(row[curent_column] == col_df_array[j]):
df[new_col_name] = row[curent_column]
The problem is that even though I have managed to create successfully the column names, I am not able to get the correct column values.
For example the column Gender should be as below:
data2 = {'Gender': ['M', 'F', 'M', 'F','M', 'F','M', 'F','M', 'F','M', 'F'],
'Gender_M': ['M', 'na', 'M', 'na','M', 'na','M', 'na','M', 'na','M', 'na'],
'Gender_F': ['na', 'F', 'na', 'F','na', 'F','na', 'F','na', 'F','na', 'F']
}
df2 = pd.DataFrame(data2)
Just to say, the na can be anything such as blanks or dots or NAN.

You're looking for pd.get_dummies.
>>> pd.get_dummies(df)
Gender_F Gender_M Employment_E Employment_R Employment_U Age_M Age_O Age_Y
0 0 1 0 1 0 0 0 1
1 1 0 0 0 1 1 0 0
2 0 1 1 0 0 0 1 0
3 1 0 0 1 0 0 0 1
4 0 1 0 0 1 1 0 0
5 1 0 1 0 0 0 1 0
6 0 1 0 1 0 0 0 1
7 1 0 0 0 1 1 0 0
8 0 1 1 0 0 0 1 0
9 1 0 0 1 0 0 0 1
10 0 1 0 0 1 1 0 0
11 1 0 1 0 0 0 1 0

If you are trying to get the data in a format like your df2 example, I believe this is what you are looking for.
ndf = pd.get_dummies(df)
df.join(ndf.mul(ndf.columns.str.split('_').str[-1]))
Output:
Old Answer
df[['Gender']].join(pd.get_dummies(df[['Gender']]).mul(df['Gender'],axis=0).replace('',np.NaN))
Output:
Gender Gender_F Gender_M
0 M NaN M
1 F F NaN
2 M NaN M
3 F F NaN
4 M NaN M
5 F F NaN
6 M NaN M
7 F F NaN
8 M NaN M
9 F F NaN
10 M NaN M
11 F F NaN

If you are okay with 0s and 1s in your new columns, then using get_dummies (as suggested by #richardec) should be the most straightforward.
However, if want a specific letter in each of your new columns, then another method is to loop through the current columns and the specific categories within each column, and create a new column from this information using apply.
for col in data.keys():
categories = list(df[col].unique())
for category in categories:
df[f"{col}_{category}"] = df[col].apply(lambda x: category if x==category else float("nan"))
Result:
>>> df
Gender Employment Age Gender_M Gender_F Employment_R Employment_U Employment_E Age_Y Age_M Age_O
0 M R Y M NaN R NaN NaN Y NaN NaN
1 F U M NaN F NaN U NaN NaN M NaN
2 M E O M NaN NaN NaN E NaN NaN O
3 F R Y NaN F R NaN NaN Y NaN NaN
4 M U M M NaN NaN U NaN NaN M NaN
5 F E O NaN F NaN NaN E NaN NaN O
6 M R Y M NaN R NaN NaN Y NaN NaN
7 F U M NaN F NaN U NaN NaN M NaN
8 M E O M NaN NaN NaN E NaN NaN O
9 F R Y NaN F R NaN NaN Y NaN NaN
10 M U M M NaN NaN U NaN NaN M NaN
11 F E O NaN F NaN NaN E NaN NaN O

Related

Group/merge rows when defined columns match and sum up values

How Do I group/merge rows, where multiple defined columns have the same value and display the sums in other columns not relevant for grouping/merging?
In the below example: If rows have the same values in columns "OrgA" to "OrgF" (text – this refers to an org. structure with departments and sub-departments), group/merge rows and add up the numbers in columns "numA" and "numB".
import pandas as pd
import numpy as np
data = {'orgA': ['A','C','A','C','A','C','A','A','A','L'],
'orgB': ['B',np.nan,'E',np.nan,'B',np.nan,'E','E','E','C'],
'orgC': ['C',np.nan,'D',np.nan,'C',np.nan,'H','D','H','B'],
'orgD': ['D',np.nan,np.nan,np.nan,'D',np.nan,'F',np.nan,'F','S'],
'orgE': ['E',np.nan,np.nan,np.nan,'E',np.nan,np.nan,np.nan,np.nan,'F'],
'orgF': ['F',np.nan,np.nan,np.nan,'F',np.nan,np.nan,np.nan,np.nan,np.nan],
'numA': [1,1,1,1,1,1,1,1,1,1],
'numB': [2,2,2,2,2,2,2,2,2,2]}
df = pd.DataFrame(data)
print(df)
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 1 2
1 C NaN NaN NaN NaN NaN 1 2
2 A E D NaN NaN NaN 1 2
3 C NaN NaN NaN NaN NaN 1 2
4 A B C D E F 1 2
5 C NaN NaN NaN NaN NaN 1 2
6 A E H F NaN NaN 1 2
7 A E D NaN NaN NaN 1 2
8 A E H F NaN NaN 1 2
9 L C B S F NaN 1 2
The output is supposed to look as follows:
orgA orgB orgC orgD orgE orgF numA numB
0 A B C D E F 2 4
1 C NaN NaN NaN NaN NaN 3 6
2 A E D NaN NaN NaN 2 4
3 A E H F NaN NaN 3 6
4 L C B S F NaN 1 2
Many thanks for your ideas in advance!

You can pass a list of column names to groupby, and set dropna to False so that rows containing nans are not dropped. You can also specify sort=False if it is not important to sort the group keys. Applying this to your example, as in
df.groupby(
['orgA', 'orgB', 'orgC', 'orgD', 'orgE', 'orgF'],
dropna=False,
sort=False
).sum()
we get
numA numB
orgA orgB orgC orgD orgE orgF
A B C D E F 2 4
C NaN NaN NaN NaN NaN 3 6
A E D NaN NaN NaN 2 4
H F NaN NaN 2 4
L C B S F NaN 1 2

How do I make the pandas index of a pivot table part of the column names?

I'm trying to pivot two columns out by another flag column with out multi-indexing. I would like to have the column names be a part of the indicator itself. Take for example:
import pandas as pd
df_dict = {'fire_indicator':[0,0,1,0,1],
'cost':[200, 300, 354, 456, 444],
'value':[1,1,2,1,1],
'id':['a','b','c','d','e']}
df = pd.DataFrame(df_dict)
If I do the following:
df.pivot_table(index = 'id', columns = 'fire_indicator', values = ['cost','value'])
I get the following:
cost value
fire_indicator 0 1 0 1
id
a 200.0 NaN 1.0 NaN
b 300.0 NaN 1.0 NaN
c NaN 354.0 NaN 2.0
d 456.0 NaN 1.0 NaN
e NaN 444.0 NaN 1.0
What I'm trying to do is the following:
id fire_indicator_0_cost fire_indicator_1_cost fire_indicator_0_value fire_indicator_0_value
a 200 0 1 0
b 300 0 1 0
c 0 354 0 2
d 456 0 1 0
e 0 444 0 1
I know there is a way in SAS. Is there a way in python pandas?

Just rename and re_index:
out = df.pivot_table(index = 'id', columns = 'fire_indicator', values = ['cost','value'])
out.columns = [f'fire_indicator_{y}_{x}' for x,y in out.columns]
# not necessary if you want `id` be the index
out = out.reset_index()
Output:
id fire_indicator_0_cost fire_indicator_1_cost fire_indicator_0_value fire_indicator_1_value
-- ---- ----------------------- ----------------------- ------------------------ ------------------------
0 a 200 nan 1 nan
1 b 300 nan 1 nan
2 c nan 354 nan 2
3 d 456 nan 1 nan
4 e nan 444 nan 1

In pandas replace consecutive 0s with NaN

I want to clean some data by replacing only CONSECUTIVE 0s in a data frame
Given:
import pandas as pd
import numpy as np
d = [[1,np.NaN,3,4],[2,0,0,np.NaN],[3,np.NaN,0,0],[4,np.NaN,0,0]]
df = pd.DataFrame(d, columns=['a', 'b', 'c', 'd'])
df
a b c d
0 1 NaN 3 4.0
1 2 0.0 0 NaN
2 3 NaN 0 0.0
3 4 NaN 0 0.0
The desired result should be:
a b c d
0 1 NaN 3 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
where column c & d are affected but column b is NOT affected as it only has 1 zero (and not consecutive 0s).
I have experimented with this answer:
Replacing more than n consecutive values in Pandas DataFrame column
which is along the right lines but the solution keeps the first 0 in a given column which is not desired in my case.

Let us do shift with mask
df=df.mask((df.shift().eq(df)|df.eq(df.shift(-1)))&(df==0))
Out[469]:
a b c d
0 1 NaN 3.0 4.0
1 2 0.0 NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN

Construct DataFrame from list of dicts

Trying to construct pandas DataFrame from list of dicts
List of dicts:
a = [{'1': 'A'},
{'2': 'B'},
{'3': 'C'}]
Pass list of dicts into pd.DataFrame():
df = pd.DataFrame(a)
Actual results:
1 2 3
0 A NaN NaN
1 NaN B NaN
2 NaN NaN C
pd.DataFrame(a, columns=['Key', 'Value'])
Actual results:
Key Value
0 NaN NaN
1 NaN NaN
2 NaN NaN
Expected results:
Key Value
0 1 A
1 2 B
2 3 C

try this,
from collections import ChainMap
data = dict(ChainMap(*a))
pd.DataFrame(data.items(), columns= ['Key','Value'])
O/P:
Key Value
0 1 A
1 2 B
2 3 C

Something like this with a list comprehension:
pd.DataFrame(([(x, y) for i in a for x, y in i.items()]),columns=['Key','Value'])
Key Value
0 1 A
1 2 B
2 3 C

Using scalar values in series as variables in user defined function

I want to define a function that is applied element wise for each row in a dataframe, comparing each element to a scalar value in a separate series. I started with the function below.
def greater_than(array, value):
g = array[array >= value].count(axis=1)
return g
But it is applying the mask along axis 0 and I need it to apply it along axis 1. What can I do?
e.g.
In [3]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [4]: df
Out[4]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
In [26]: s
Out[26]: array([ 1, 1000, 1000, 1000])
In [25]: greater_than(df,s)
Out[25]:
0 0
1 1
2 1
3 1
dtype: int64
In [27]: g = df[df >= s]
In [28]: g
Out[28]:
0 1 2 3
0 NaN NaN NaN NaN
1 4.0 NaN NaN NaN
2 8.0 NaN NaN NaN
3 12.0 NaN NaN NaN
The result should look like:
In [29]: greater_than(df,s)
Out[29]:
0 3
1 0
2 0
3 0
dtype: int64
as 1,2, & 3 are all >= 1 and none of the remaining values are greater than or equal to 1000.

Your best bet may be to do some transposes (no copies are made, if that's a concern)
In [164]: df = pd.DataFrame(np.arange(16).reshape(4,4))
In [165]: s = np.array([ 1, 1000, 1000, 1000])
In [171]: df.T[(df.T>=s)].T
Out[171]:
0 1 2 3
0 NaN 1.0 2.0 3.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
In [172]: df.T[(df.T>=s)].T.count(axis=1)
Out[172]:
0 3
1 0
2 0
3 0
dtype: int64
You can also just sum the mask directly, if the count is all you're after.
In [173]: (df.T>=s).sum(axis=0)
Out[173]:
0 3
1 0
2 0
3 0
dtype: int64

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create columns in python data frame based on existing column-name and column-values - pandas

Related

Group/merge rows when defined columns match and sum up values

How do I make the pandas index of a pivot table part of the column names?

In pandas replace consecutive 0s with NaN

Construct DataFrame from list of dicts

Using scalar values in series as variables in user defined function

Categories

Resources