Pandas pivot table or groupby absolute maximum of column - pandas

I have a dataframe df as:
Col1 Col2
A -5
A 3
B -2
B 15
I need to get the following:
Col1 Col2
A -5
B 15
Where the decision was made for each group in Col1 by selecting the absolute maximum from Col2. I am not sure how to proceed with this.

Use DataFrameGroupBy.idxmax with pass absolute values for indices and then select by DataFrame.loc:
df = df.loc[df['Col2'].abs().groupby(df['Col1']).idxmax()]
#alternative with reassign column
df = df.loc[df.assign(Col2 = df['Col2'].abs()).groupby('Col1')['Col2'].idxmax()]
print (df)
Col1 Col2
0 A -5
3 B 15

Related

Fix the length of some columns using Pandas

I am trying to add some columns to a pandas dataFrame, but I cannot set the character length of the columns.
I want to add the new fields as a string with a value of null and a length of two characters as the length of the field.
Any idea is welcome.
import pandas as pd
df[["Assess", "Operator","x", "y","z", "g"]]=None
If need fix length of columns in new DataFrame use:
from itertools import product
import string
#length of one character
letters = string.ascii_letters
#print(len(letters)) #52
#if need length of two characters
#print(len(letters)) #2704
#letters = [''.join(x) for x in product(letters,letters)]
df = pd.DataFrame({'col1':[4,5], 'col':[8,2]})
#threshold
N = 5
#get new columns names by difference with original columns length
#min is used if possible negative number after subraction, then is set 0
cols = list(letters[:max(0, N- len(df.columns))])
#added new columns filled by None
#filter by threshold (if possible more columns in original like `N`)
df = df.assign(**dict.fromkeys(cols, None)).iloc[:, :N]
print (df)
col1 col a b c
0 4 8 None None None
1 5 2 None None None
Test if more columns like N threshold:
df = pd.DataFrame({'col1':[4,5], 'col2':[8,2],'col3':[4,5],
'col4':[8,2], 'col5':[7,3],'col6':[9,0], 'col7':[5,1]})
print (df)
col1 col2 col3 col4 col5 col6 col7
0 4 8 4 8 7 9 5
1 5 2 5 2 3 0 1
N = 5
cols = list(letters[:max(0, N - len(df.columns))])
df = df.assign(**dict.fromkeys(cols, None)).iloc[:, :N]
print (df)
col1 col2 col3 col4 col5
0 4 8 4 8 7
1 5 2 5 2 3

How two dataframes in python and replace the null values from one dataframe column to another column in pyspark?

Suppose I have a df with 5 columns and a second df with 6 columns. I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?
I don't want to specify the name of the columns, hard code them. I want to make a robust logic that works even if in the future we need to replace rows for 7 cols instead of 6.
Sample Data:
df1=
col1 col2 col3 col5
1 null null 5
2 null 5 9
4 4 8 6
null 0 9 1
df2=
col1 col2 col3 col4
1 2 -3 5
null null 7 5
4 4 8 1
1 null 9 3
Final df=
col1 col2 col3 col5 col4
1 2 -3 5 5
2 null 5 9 5
4 4 8 6 1
1 0 9 1 3
Conditions:
The null rows of a column in df1 get replaced by a not null value in df2
if both data frames have different not null values on the same index, take the first one or second one. Doesn't matter.
if both of them are null, the final df will have null values on that very same index.
I don't want to specify the column names, just want to have a robust script that works for other data as well with different column names.
I want to join df1 with df2 such that the null rows of a column in df1 get replaced by a not null value in df2. How do I do this in python?
Just join and you can use coalesce to get the first non-null value
I don't want to specify the name of the columns, hard cord them.
You can access columns' name via df.columns, and access columns' datatypes via df.dtypes

Filter dataframe based on the quantile per group of values

Let's suppose that I have a dataframe like that:
import pandas as pd
df = pd.DataFrame({'col1':['A','A', 'A', 'B','B'], 'col2':[2, 4, 6, 3, 4]})
I want to keep from it only the rows which have values at col2 which are less than the x-th quantile of the values for each of the groups of values of col1 separately.
For example for the 60-th percentile then the dataframe should look like that:
col1 col2
0 A 2
1 A 4
2 B 3
How can I do this efficiently in pandas?
We have transform with quantile
df[df.col2.lt(df.groupby('col1').col2.transform(lambda x : x.quantile(0.6)))]
col1 col2
0 A 2
1 A 4
3 B 3

split string for a range of columns Pandas

How can I split the string to list for each column for the following Pandas dataframe with many columns?
col1 col2
0/1:9,12:21:99 0/1:9,12:22:99
0/1:9,12:23:99 0/1:9,15:24:99
Desired output:
col1 col2
[0/1,[9,12],21,99] [0/1,[9,12],22,99]
[0/1,[9,12],23,99] [0/1,[9,15],24,99]
I could do:
df['col1'].str.split(":", n = -1, expand = True)
df['col2'].str.split(":", n = -1, expand = True)
but I have many columns, I was wondering if I could do it in a more automated way?
I would then like to calculate the mean of the 2nd element of each list for every row, that is for the first row, get the mean of 21 and 22 and for the second row, get the mean of 23 and 24.
If the data is like your sample, you can make use of stack:
new_df = (df.iloc[:,0:2]
.stack()
.str.split(':',expand=True)
)
Then new_df is double indexed:
0 1 2 3
0 col1 0/1 9,12 21 99
col2 0/1 9,12 22 99
1 col1 0/1 9,12 23 99
col2 0/1 9,15 24 99
And say if you want the mean of 2nd numbers:
new_df[2].unstack(level=-1).astype(float).mean(axis=1)
gives:
0 21.5
1 23.5
dtype: float64

Drop all group rows when met a condition?

I have pandas data frame have two-level group based on 'col10' and 'col1'.All I want to do is, drop all group rows if a specified value in another column repeated or this value did not existed in the group (keep the group which the specified value existed once only) for example:
The original data frame:
df = pd.DataFrame( {'col0':['A','A','A','A','A','B','B','B','B','B','B','B','c'],'col1':[1,1,2,2,2,1,1,1,1,2,2,2,1], 'col2':[1,2,1,2,3,1,2,1,2,2,2,2,1]})
I need to keep the the rows for the group for example (['A',1],['A',2],['B',2]) in this original DF
The desired dataframe:
I tried this step:
df.groupby(['col0','col1']).apply(lambda x: (x['col2']==1).sum()==1)
where the result is
col0 col1
A 1 True
2 True
B 1 False
2 True
c 1 False
dtype: bool
How to create the desired Df based on this bool?
You can do this as below:
m=(df.groupby(['col0','col1'])['col2'].
transform(lambda x: np.where((x.eq(1)).sum()==1,x,np.nan)).dropna().index)
df.loc[m]
Or:
df[df.groupby(['col0','col1'])['col2'].transform(lambda x: x.eq(1).sum()==1)]
col0 col1 col2
0 A 1 1
1 A 1 2
2 A 2 1
3 A 2 2
4 A 2 3
12 c 1 1