adding a new column to data frame - pandas

I'm trying do something that should be really simple in pandas, but it seems anything but. I have two large dataframes
df1 has 243 columns which include:
ID2 K. C type
1 123 1. 2. T
2 132 3. 1. N
3 111 2. 1. U
df2 has 121 columns which include:
ID3 A B
1 123 0. 3.
2 111 2. 3.
3 132 1. 2.
df2 contains different information about the same ID (ID2=ID3) but in different order
I wanted to create a new column in df2 named (type) and match the type column in df1. If it's the same ID to the one in df1, it should copy the same type (T, N or U) from df1. In another word, I need it to look like the following data frame butwith all 121 columns from df2+type
ID3 A B type
123 0. 3. T
111 2. 3. U
132 1. 2. N
I tried
pd.merge and pd.join.
I also tried
df2['type'] = df1['ID2'].map(df2.set_index('ID3')['type'])
but none of them is working.
it shows KeyError: 'ID3'

As far as I can see, your last command is almost correct. Try this:
df2['type'] = df2['ID3'].map(df1.set_index('ID2')['type'])

join
df2.join(df1.set_index('ID2')['type'], on='ID3')
ID3 A B type
1 123 0.0 3.0 T
2 111 2.0 3.0 U
3 132 1.0 2.0 N
merge (take 1)
df2.merge(df1[['ID2', 'type']].rename(columns={'ID2': 'ID3'}))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
merge (take 2)
df2.merge(df1[['ID2', 'type']], left_on='ID3', right_on='ID2').drop('ID2', 1)
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
map and assign
df2.assign(type=df2.ID3.map(dict(zip(df1.ID2, df1['type']))))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N

Related

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Multi Level pivoting of dataframe

I have this dataframe:
Group
Feature 1
Feature 2
Class
First
5
4
1
Second
5
5
0
First
1
2
0
I want to do a multi level pivot in pandas to have something like this:
Group | Feature1 (class 1)| Feature (Class 2) | Feature2 (Class 1) | Feature 1(Class 2)
What if I want to select only one feature to work with?
Like this?
out = (df.assign(Class=df["Class"]+1)
.pivot(index="Group", columns="Class"))
print(out)
Feature 1 Feature 2
Class 1 2 1 2
Group
First 1.0 5.0 2.0 4.0
Second 5.0 NaN 5.0 NaN

Groupby with conditions

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

How to merge similar rows and split column into rows by values?

I have this data set for example:
Name Number Is true
0 Dani 2 yes
1 Dani 2 no
2 Jack 5 no
3 Jack 5 maybe
4 Dani 2 maybe
I want to create a new data set that combines similar rows and adds columns by column different values. This is the output I'm trying to get:
Name Number Is true1 Is true2 Is true3
0 Dani 2 yes no maybe
1 Jack 5 no maybe
I couldn't get it working from example 10 here:
How to pivot a dataframe
Would you be able to provide a specific example for this use case please?
Thanks.
Edit for respond:
Name yes no maybe
0 Dani 2 2 2
1 Jack NaN 5 5
With combination of pivot_table(...) and apply(...):
df.pivot_table(index=["Name", "Number"], values="Is true", aggfunc=list).apply(lambda x: pd.Series({f"Is true{id+1}": el for id, el in enumerate(x[0])}), axis=1).reset_index()
Output:
Name Number Is true1 Is true2 Is true3
0 Dani 2 yes no maybe
1 Jack 5 no maybe NaN
Edit
For your follow up. This might be something along the lines, what you're looking for:
df.pivot_table(index=["Name"], columns="Is true", values="Number", aggfunc=list).fillna('').apply(lambda x: pd.Series({f"{col}{id+1}": el for col in x.keys() for id, el in enumerate(x[col])}), axis=1).reset_index()
Output:
Name maybe1 no1 yes1
0 Dani 2.0 2.0 2.0
1 Jack 5.0 5.0 NaN
You can try this:
df2 = df.drop_duplicates(subset=['Name', 'Number Is'])
df2 = df2.reset_index(drop=True).assign(true= df.groupby('Number Is')['true'].agg(list).reset_index(drop=True) )
temp = df2['true'].apply(pd.Series).T
temp.index = temp.index+1
temp = temp.T
df2 = df2.assign(**temp.add_prefix('true').add_suffix(' Is')).drop(columns='true').fillna('')
output:
Name Number Is true1 Is true2 Is true3 Is
0 Dani 2 yes no maybe
1 Jack 5 no maybe

Binary operation broadcasting across multiindex

can anyone explain why broadcasting across a multiindexed series doesn't work? Might it be a bug in pandas (0.12.0)?
x = pd.DataFrame({'year':[1,1,1,1,2,2,2,2],
'country':['A','A','B','B','A','A','B','B'],
'prod':[1,2,1,2,1,2,1,2],
'val':[10,20,15,25,20,30,25,35]})
x = x.set_index(['year','country','prod']).squeeze()
y = pd.DataFrame({'year':[1,1,2,2],'prod':[1,2,1,2],
'mul':[10,0.1,20,0.2]})
y = y.set_index(['year','prod']).squeeze()
From the description of matching/broadcasting behavior from the pandas docs I would expect to be able to multiply x and y and have the values of y broadcast across each country, giving:
>>> x.mul(y, level=['year','prod'])
year country prod
1 A 1 100.0
2 2.0
B 1 150.0
2 2.5
2 A 1 400.0
2 6.0
B 1 500.0
2 7.0
But instead, I get:
Exception: Join on level between two MultiIndex objects is ambiguous
(Note that this is a variation on the theme of this question.)
As discussed by me and #jreback in the issue opened to deal with this, a nice workaround to the problem involves doing the following:
Move the non-matching index level(s) to columns using unstack
Perform the multiplication/division
Put the non-matching index level(s) back using stack
Make sure the index levels are in the same order as they were before.
Here's how it works:
In [112]: x.unstack('country').mul(y, axis=0).stack('country').reorder_levels(x.index.names)
Out[112]:
year country prod
1 A 1 100.0
B 1 150.0
A 2 2.0
B 2 2.5
2 A 1 400.0
B 1 500.0
A 2 6.0
B 2 7.0
dtype: float64
I think that's rather good, and should be pretty efficient.