How to find difference between rows in a pandas multiIndex, by level 1 - pandas

Suppose we have a DataFrame like this, only with many, many more index A values:
df = pd.DataFrame([[1,2,1,2],
[1,1,2,2],
[2,2,1,0],
[1,2,1,2],
[2,1,1,2] ], columns=['A','B','c1','c2'])
df.groupby(['A','B']).sum()
## result
c1 c2
A B
1 1 2 2
2 2 4
2 1 1 2
2 1 0
How can I get a data frame that consists of the difference between rows, by the second level of the index, level B?
The output here would be
A c1 c2
1 0 -2
2 0 2
Note In my particular use case, I have a lot of column A values, so I can write out the value for A explicitly.

Check diff and dropna
g = df.groupby(['A','B'])[['c1','c2']].sum()
g = g.groupby(level=0).diff().dropna()
g
Out[25]:
c1 c2
A B
1 2 0.0 2.0
2 2 0.0 -2.0

Assigning the first grouping to result variable:
result = df.groupby(['A','B']).sum()
You could use a pipe operation with nth:
result.groupby('A').pipe(lambda df: df.nth(0) - df.nth(-1))
c1 c2
A
1 0 -2
2 0 2
A simpler option, in my opinion, would be to use agg combined with numpy's ufunc reduce, as this covers scenarios where you have more than two rows:
result.groupby('A').agg(np.subtract.reduce)
c1 c2
A
1 0 -2
2 0 2

Related

How to merge two datasets on incomplete columns?

I want to merge two datasets on 'key1' and 'key2' columns so that in case of missing value, for example, in the 'key2' column, it would take all combinations of the second key that belong to the first key. Here is an example:
def merge_nan_as_any(mask, data, on, how)
...
mask = pd.DataFrame({'key1': [1,1,2,2],
'key2': [None,3,1,2],
'value2': [1,2,3,4]})
data = pd.DataFrame({'key1': [1,1,1,2,2,2],
'key2': [1,2,3,1,2,3],
'value1': [1,2,3,4,5,6]})
result = merge_nan_as_any(mask, data, on=['key1', 'key2'], how='left')
result = pd.DataFrame({'key1': [1,1,1,1,2,2],
'key2': [1,2,3,3,1,2],
'value2': [1,1,1,2,3,4],
'value1': [1,2,3,3,4,5]})
There is a missed value of the second key, so it takes all rows from the second dataset that satisfy the condition: key1 must equal to 1, key2 is any the second key value from the second dataset. How to do that?
The first obvious solution that came to my mind is to iterate over the first dataset and filter out combinations that satisfy the condition and the second one is to split the first dataset into several ones so that they have NaNs in the same columns and merge each of them on columns that have values.
But I don't like these solutions and guess there is more elegant way to do what I want.
I will appreciate for any help!
Simple approach, merge on key1/key2 for the non-NaN values, merge on key1 only for the NaN values and concat:
m = mask['key2'].notna()
result = pd.concat([data.merge(mask[~m].drop(columns='key2'), on='key1'),
data.merge(mask[m], on=['key1', 'key2']),
], ignore_index=True)
Output:
key1 key2 value1 value2
0 1 1 1 1
1 1 2 2 1
2 1 3 3 1
3 1 3 3 2
4 2 1 4 3
5 2 2 5 4
I would begin by filling the null values with a list of all unique values from the other dataframe. Then, explode it to get all possible combinations and transform back to numeric. Finally, merge them both achieving the expected output:
mask['key2'] = mask['key2'].fillna(' '.join([str(x) for x in data['key2'].unique()])).astype(str).str.split(' ')
mask = mask.explode('key2')
mask['key2'] = pd.to_numeric(mask['key2'])
pd.merge(mask,data,on=['key1','key2'],how='left')
Outputting:
key1 key2 value2 value1
0 1 1 1 1
1 1 2 1 2
2 1 3 1 3
3 1 3 2 3
4 2 1 3 4
5 2 2 4 5
use pandasql it will be easy:
mask.sql("""
select data.*,self.value2
from self left join data
on self.key1=data.key1 and (self.key2=data.key2 or self.key2 is null)
""",**globals())
out:
key1 key2 value1 value2
0 1 1 1 1
1 1 2 2 1
2 1 3 3 1
3 1 3 3 2
4 2 1 4 3
5 2 2 5 4

incompatible index of inserted column with frame index with group by and count

I have data that looks like this:
CHROM POS REF ALT ... is_sever_int is_sever_str is_sever_f encoding_str
0 chr1 14907 A G ... 1 1 one one
1 chr1 14930 A G ... 1 1 one one
These are the columns that I'm interested to perform calculations on (example) :
is_severe snp _id encoding
1 1 one
1 1 two
0 1 one
1 2 two
0 2 two
0 2 one
what I want to do is to count for each snp_id and severe_id how many ones and twos are in the encoding column :
snp_id is_svere encoding_one encoding_two
1 1 1 1
1 0 1 0
2 1 0 1
2 0 1 1
I tried this :
df.groupby(["snp_id","is_sever_f","encoding_str"])["encoding_str"].count()
but it gave the error :
incompatible index of inserted column with frame index
then i tried this:
df["count"]=df.groupby(["snp_id","is_sever_f","encoding_str"],as_index=False)["encoding_str"].count()
and it returned:
Expected a 1D array, got an array with shape (2532831, 3)
how can i fix this? thank you:)
Let's try groupby with whole columns and get size of each group then unstack the encoding index.
out = (df.groupby(['is_severe', 'snp_id', 'encoding']).size()
.unstack(fill_value=0)
.add_prefix('encoding_')
.reset_index())
print(out)
encoding is_severe snp_id encoding_one encoding_two
0 0 1 1 0
1 0 2 1 1
2 1 1 1 1
3 1 2 0 1
Try as follows:
Use pd.get_dummies to convert categorical data in column encoding into indicator variables.
Chain df.groupby and get sum to turn double rows per group into one row (i.e. [0,1] and [1,0] will become [1,1] where df.snp_id == 2 and df.is_severe == 0).
res = pd.get_dummies(data=df, columns=['encoding'])\
.groupby(['snp_id','is_severe'], as_index=False, sort=False).sum()
print(res)
snp_id is_severe encoding_one encoding_two
0 1 1 1 1
1 1 0 1 0
2 2 1 0 1
3 2 0 1 1
If your actual df has more columns, limit the assigment to the data parameter inside get_dummies. I.e. use:
res = pd.get_dummies(data=df[['is_severe', 'snp_id', 'encoding']],
columns=['encoding']).groupby(['snp_id','is_severe'],
as_index=False, sort=False)\
.sum()

pandas dataframe how to replace extreme outliers for all columns

I have a pandas dataframe with some very extreme value - more than 5 std.
I want to replace, per column, each value that is more than 5 std with the max other value.
For example,
df = A B
1 2
1 6
2 8
1 115
191 1
Will become:
df = A B
1 2
1 6
2 8
1 8
2 1
What is the best way to do it without a for loop over the columns?
s=df.mask((df-df.apply(lambda x: x.std() )).gt(5))#mask where condition applies
s=s.assign(A=s.A.fillna(s.A.max()),B=s.B.fillna(s.B.max())).sort_index(axis = 0)#fill with max per column and resort frame
A B
0 1.0 2.0
1 1.0 6.0
2 2.0 8.0
3 1.0 8.0
4 2.0 1.0
Per the discussion in the comments you need to decide what your threshold is. say it is q=100, then you can do
q = 100
df.loc[df['A'] > q,'A'] = max(df.loc[df['A'] < q,'A'] )
df
this fixes column A:
A B
0 1 2
1 1 6
2 2 8
3 1 115
4 2 1
do the same for B
Calculate a column-wise z-score (if you deem something an outlier if it lies outside a given number of standard deviations of the column) and then calculate a boolean mask of values outside your desired range
def calc_zscore(col):
return (col - col.mean()) / col.std()
zscores = df.apply(calc_zscore, axis=0)
outlier_mask = zscores > 5
After that it's up to you to fill the values marked with the boolean mask.
df[outlier_mask] = something

Pandas merge conflict rows by counts?

A conflict row is that two rows have same feature but with different label, like this:
feature label
a 1
a 0
Now, I want to merge these conflict rows to only one label getting from their counts. If I have more a 1, then a will be labeled as 1. Otherwise, a should be labeled as 0.
I can find these conflicts by df1=df.groupy('feature', as_index=Fasle).nunique(),df1 = df1[df1['label]==2]' , and their value counts by df2 = df.groupby("feature")["label"].value_counts().reset_index(name="counts").
But how to find these conflic rows and their counts in one Dataframe (df_conflict = ?), and then merge them by counts, (df_merged = merge(df))?
Lets take df = pd.DataFrame({"feature":['a','a','b','b','a','c','c','d'],'label':[1,0,0,1,1,0,0,1]}) as example.
feature label
0 a 1
1 a 0
2 b 0
3 b 1
4 a 1
5 c 0
6 c 0
7 d 1
df_conflict should be :
feature label counts
a 1 2
a 0 1
b 0 1
b 1 1
And df_merged will be:
feature label
a 1
b 0
c 0
d 1
I think you need first filter groups with count of unique values by DataFrameGroupBy.nunique with GroupBy.transform before SeriesGroupBy.value_counts:
df1 = df[df.groupby('feature')['label'].transform('nunique').gt(1)]
df_conflict = df1.groupby('feature')['label'].value_counts().reset_index(name='count')
print (df_conflict)
feature label count
0 a 1 2
1 a 0 1
2 b 0 1
3 b 1 1
For second get feature with labels by maximum occurencies:
df_merged = df.groupby('feature')['label'].agg(lambda x: x.value_counts().index[0]).reset_index()
print (df_merged)
feature label
0 a 1
1 b 0
2 c 0
3 d 1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2