Applying Transformation to Table (Pandas) - pandas

I have the following data frame
Index ID Wt Wt.1
0 4999 3.2 1.2
1 5012 1.1 3.4
2 5027 4.4 5.6
and I'm trying to apply a transformation in order to get a dataframe that looks like the following
Index ID Wt
0 4999 3.2
0 4999 1.2
1 5012 1.1
1 5012 3.4
2 5027 4.4
2 5027 5.6
Is there a simply way to do this? I have tried using melt, groupby, and pivot_table but with no luck. This seems like such a simple task so perhaps I am overthinking it.

One way would be to assign to an empty dataframe the 'ID' and 'Wt.1' columns to the empty dataframe as the target 'ID' and 'Wt' columns, this has a minor advantage in that you don't get a messy append at the end where you have NaN values and both 'Wt' and 'Wt.1' columns.
In [28]:
temp = pd.DataFrame()
temp[['ID','Wt']] = df[['ID','Wt.1']]
df1 = df[['ID','Wt']].append(temp)
df1
Out[28]:
ID Wt
Index
0 4999 3.2
1 5012 1.1
2 5027 4.4
0 4999 1.2
1 5012 3.4
2 5027 5.6
[6 rows x 2 columns]
You can call df1.reset_index(inplace=True) to correct the index afterwards.

You can probably do it in few lines, but I will show them step-by-step:
In [86]:
df2=df.set_index(['Index', 'ID'])
df3=df2.stack().reset_index()
df3=df3.ix[:,['Index', 'ID', 0]]
df3.columns=['Index', 'ID', 'Wt']
print df3
Index ID Wt
0 0 4999 3.2
1 0 4999 1.2
2 1 5012 1.1
3 1 5012 3.4
4 2 5027 4.4
5 2 5027 5.6

Related

How to add Multilevel Columns and create new column?

I am trying to create a "total" column in my dataframe
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
My dataframe
Room 1 Room 2 Room 3
on off on off on off
0 1 4 3 6 5 15
1 3 2 1 5 1 7
For each room, I want to create a total column and then a on% column.
I have tried the following, however, it does not work.
df.loc[:, slice(None), "total" ] = df.xs('on', axis=1,level=1) + df.xs('off', axis=1,level=1)
Let us try something fancy ~
df.stack(0).eval('total=on + off \n on_pct=on / total').stack().unstack([1, 2])
Room 1 Room 2 Room 3
off on total on_pct off on total on_pct off on total on_pct
0 4.0 1.0 5.0 0.2 6.0 3.0 9.0 0.333333 15.0 5.0 20.0 0.250
1 2.0 3.0 5.0 0.6 5.0 1.0 6.0 0.166667 7.0 1.0 8.0 0.125
Oof this was a roughie, but you can do it like this if you want to avoid loops. Worth noting it redefines your df twice because i need the total columns. Sorry about that, but is the best i could do. Also if you have any questions just comment.
df = pd.concat([y.assign(**{'Total {0}'.format(x+1): y.iloc[:,0] + y.iloc[:,1]})for x , y in df.groupby(np.arange(df.shape[1])//2,axis=1)],axis=1)
df = pd.concat([y.assign(**{'Percentage_Total{0}'.format(x+1): (y.iloc[:,0] / y.iloc[:,2])*100})for x , y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
print(df)
This groups by the column's first index (rooms) and then loops through each group to add the total and percent on. The final step is to reindex using the unique rooms:
import pandas as pd
idx = pd.MultiIndex.from_product([['Room 1','Room 2', 'Room 3'],['on','off']])
df = pd.DataFrame([[1,4,3,6,5,15], [3,2,1,5,1,7]], columns=idx)
for room, group in df.groupby(level=0, axis=1):
df[(room, 'total')] = group.sum(axis=1)
df[(room, 'pct_on')] = group[(room, 'on')] / df[(room, 'total')]
result = df.reindex(columns=df.columns.get_level_values(0).unique(), level=0)
Output:
Room 1 Room 2 Room 3
on off total pct_on on off total pct_on on off total pct_on
0 1 4 5 0.2 3 6 9 0.333333 5 15 20 0.250
1 3 2 5 0.6 1 5 6 0.166667 1 7 8 0.125

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Groupby with conditions

df = pd.DataFrame({'Category': ['A','B','B','B','C','C'],
'Subcategory': ['X','X','Y','Y','Z','Z'],
'Values': [1,2,3,4,5,6]})
which I use groupby to summarize -
`df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})`
size mean median
Category
A 1 1.0 1.0
B 3 3.0 3.0
C 2 5.5 5.5
Objective: In addition to the above, show additional groupby by subcategory 'X' to create below output:
ALL Subcategory Only Subcategory 'X'
size mean median size mean median
Category
A 1 1.0 1.0 1 1 1
B 3 3.0 3.0 1 2 2
C 2 5.5 5.5 0 0 0
My solution currently is to create two groupby, to_frame() then pd.merge them. Is there a better way? Thanks!
df.groupby('Category')['Values'].agg({np.size, np.mean, np.median})
df[df['Subcategory']=='X'].groupby('Category')['Values'].agg({np.size, np.mean, np.median})

adding a new column to data frame

I'm trying do something that should be really simple in pandas, but it seems anything but. I have two large dataframes
df1 has 243 columns which include:
ID2 K. C type
1 123 1. 2. T
2 132 3. 1. N
3 111 2. 1. U
df2 has 121 columns which include:
ID3 A B
1 123 0. 3.
2 111 2. 3.
3 132 1. 2.
df2 contains different information about the same ID (ID2=ID3) but in different order
I wanted to create a new column in df2 named (type) and match the type column in df1. If it's the same ID to the one in df1, it should copy the same type (T, N or U) from df1. In another word, I need it to look like the following data frame butwith all 121 columns from df2+type
ID3 A B type
123 0. 3. T
111 2. 3. U
132 1. 2. N
I tried
pd.merge and pd.join.
I also tried
df2['type'] = df1['ID2'].map(df2.set_index('ID3')['type'])
but none of them is working.
it shows KeyError: 'ID3'
As far as I can see, your last command is almost correct. Try this:
df2['type'] = df2['ID3'].map(df1.set_index('ID2')['type'])
join
df2.join(df1.set_index('ID2')['type'], on='ID3')
ID3 A B type
1 123 0.0 3.0 T
2 111 2.0 3.0 U
3 132 1.0 2.0 N
merge (take 1)
df2.merge(df1[['ID2', 'type']].rename(columns={'ID2': 'ID3'}))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
merge (take 2)
df2.merge(df1[['ID2', 'type']], left_on='ID3', right_on='ID2').drop('ID2', 1)
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N
map and assign
df2.assign(type=df2.ID3.map(dict(zip(df1.ID2, df1['type']))))
ID3 A B type
0 123 0.0 3.0 T
1 111 2.0 3.0 U
2 132 1.0 2.0 N

pandas dataframe transformation partial sums

I have a pandas dataframe
index A
1 3.4
2 4.5
3 5.3
4 2.1
5 4.0
6 5.3
...
95 3.4
96 1.2
97 8.9
98 3.4
99 2.7
100 7.6
from this I would like to create a dataframe B
1-5 sum(1-5)
6-10 sum(6-10)
...
96-100 sum(96-100)
Any ideas how to do this elegantly rather than brute-force?
Cheers, Mike
This will give you a series with the partial sums:
df['bin'] = df.index / 5
bin_sums = df.groupby('bin')['A'].sum()
Then, if you want to rename the index:
bin_sums.index = ['%s - %s' % (5*i, 5*(i+1)) for i in bin_sums.index]