adding a new column to data frame

adding a new column to data frame - pandas

I'm trying do something that should be really simple in pandas, but it seems anything but. I have two large dataframes
df1 has 243 columns which include:
ID2 K. C type
1 123 1. 2. T
2 132 3. 1. N
3 111 2. 1. U
df2 has 121 columns which include:
ID3 A B
1 123 0. 3.
2 111 2. 3.
3 132 1. 2.
df2 contains different information about the same ID (ID2=ID3) but in different order
I wanted to create a new column in df2 named (type) and match the type column in df1. If it's the same ID to the one in df1, it should copy the same type (T, N or U) from df1. In another word, I need it to look like the following data frame butwith all 121 columns from df2+type
ID3 A B type
123 0. 3. T
111 2. 3. U
132 1. 2. N
I tried
pd.merge and pd.join.
I also tried
df2['type'] = df1['ID2'].map(df2.set_index('ID3')['type'])
but none of them is working.
it shows KeyError: 'ID3'

As far as I can see, your last command is almost correct. Try this:
df2['type'] = df2['ID3'].map(df1.set_index('ID2')['type'])

join
df2.join(df1.set_index('ID2')['type'], on='ID3')
ID3 A B type
1 123 0.0 3.0 T
2 111 2.0 3.0 U
3 132 1.0 2.0 N
merge (take 1)
df2.merge(df1[['ID2', 'type']].rename(columns={'ID2': 'ID3'}))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
merge (take 2)
df2.merge(df1[['ID2', 'type']], left_on='ID3', right_on='ID2').drop('ID2', 1)
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
map and assign
df2.assign(type=df2.ID3.map(dict(zip(df1.ID2, df1['type']))))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?

s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0

Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B

Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Multi Level pivoting of dataframe

I have this dataframe:
Group
Feature 1
Feature 2
Class
First
5
4
1
Second
5
5
0
First
1
2
0
I want to do a multi level pivot in pandas to have something like this:
Group | Feature1 (class 1)| Feature (Class 2) | Feature2 (Class 1) | Feature 1(Class 2)
What if I want to select only one feature to work with?

Like this?
out = (df.assign(Class=df["Class"]+1)
.pivot(index="Group", columns="Class"))
print(out)
Feature 1 Feature 2
Class 1 2 1 2
Group
First 1.0 5.0 2.0 4.0
Second 5.0 NaN 5.0 NaN

Groupby with conditions

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

How to merge similar rows and split column into rows by values?

I have this data set for example:
Name Number Is true
0 Dani 2 yes
1 Dani 2 no
2 Jack 5 no
3 Jack 5 maybe
4 Dani 2 maybe
I want to create a new data set that combines similar rows and adds columns by column different values. This is the output I'm trying to get:
Name Number Is true1 Is true2 Is true3
0 Dani 2 yes no maybe
1 Jack 5 no maybe
I couldn't get it working from example 10 here:
How to pivot a dataframe
Would you be able to provide a specific example for this use case please?
Thanks.
Edit for respond:
Name yes no maybe
0 Dani 2 2 2
1 Jack NaN 5 5

With combination of pivot_table(...) and apply(...):
df.pivot_table(index=["Name", "Number"], values="Is true", aggfunc=list).apply(lambda x: pd.Series({f"Is true{id+1}": el for id, el in enumerate(x[0])}), axis=1).reset_index()
Output:
Name Number Is true1 Is true2 Is true3
0 Dani 2 yes no maybe
1 Jack 5 no maybe NaN
Edit
For your follow up. This might be something along the lines, what you're looking for:
df.pivot_table(index=["Name"], columns="Is true", values="Number", aggfunc=list).fillna('').apply(lambda x: pd.Series({f"{col}{id+1}": el for col in x.keys() for id, el in enumerate(x[col])}), axis=1).reset_index()
Output:
Name maybe1 no1 yes1
0 Dani 2.0 2.0 2.0
1 Jack 5.0 5.0 NaN

You can try this:
df2 = df.drop_duplicates(subset=['Name', 'Number Is'])
df2 = df2.reset_index(drop=True).assign(true= df.groupby('Number Is')['true'].agg(list).reset_index(drop=True) )
temp = df2['true'].apply(pd.Series).T
temp.index = temp.index+1
temp = temp.T
df2 = df2.assign(**temp.add_prefix('true').add_suffix(' Is')).drop(columns='true').fillna('')
output:
Name Number Is true1 Is true2 Is true3 Is
0 Dani 2 yes no maybe
1 Jack 5 no maybe

Binary operation broadcasting across multiindex

can anyone explain why broadcasting across a multiindexed series doesn't work? Might it be a bug in pandas (0.12.0)?
x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
'country':['A','A','B','B','A','A','B','B'],
'prod':[1,2,1,2,1,2,1,2],
'val':[10,20,15,25,20,30,25,35]})
x = x.set_index(['year','country','prod']).squeeze()
y = pd.DataFrame({'year':[1,1,2,2],'prod':[1,2,1,2],
'mul':[10,0.1,20,0.2]})
y = y.set_index(['year','prod']).squeeze()
From the description of matching/broadcasting behavior from the pandas docs I would expect to be able to multiply x and y and have the values of y broadcast across each country, giving:
>>> x.mul(y, level=['year','prod'])
year country prod
1 A 1 100.0
2 2.0
B 1 150.0
2 2.5
2 A 1 400.0
2 6.0
B 1 500.0
2 7.0
But instead, I get:
Exception: Join on level between two MultiIndex objects is ambiguous
(Note that this is a variation on the theme of this question.)

As discussed by me and #jreback in the issue opened to deal with this, a nice workaround to the problem involves doing the following:
Move the non-matching index level(s) to columns using unstack
Perform the multiplication/division
Put the non-matching index level(s) back using stack
Make sure the index levels are in the same order as they were before.
Here's how it works:
In [112]: x.unstack('country').mul(y, axis=0).stack('country').reorder_levels(x.index.names)
Out[112]:
year country prod
1 A 1 100.0
B 1 150.0
A 2 2.0
B 2 2.5
2 A 1 400.0
B 1 500.0
A 2 6.0
B 2 7.0
dtype: float64
I think that's rather good, and should be pretty efficient.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

adding a new column to data frame - pandas

As far as I can see, your last command is almost correct. Try this: df2['type'] = df2['ID3'].map(df1.set_index('ID2')['type'])

Related

pandas dataframe how to replace extreme outliers for all columns

Multi Level pivoting of dataframe

Groupby with conditions

How to merge similar rows and split column into rows by values?

Binary operation broadcasting across multiindex

Categories

Resources