Pandas variable rounding of column - pandas

>>> print(df)
item value1
0 a 1.121
1 a 1.510
2 a 0.110
3 b 3.322
4 b 4.811
5 c 5.841
This is my dummy pandas df.
Below is how I truncate/round my column value1.
decimals = 2
df['value1'] = df['value1'].apply(lambda x: round(x, decimals))
>>> print(df)
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.32
4 b 4.81
5 c 5.84
This truncate all the two column to two decimal point after decimal. Is it possible to have variable rounding w dictionary. So in below we see 'a' = two places post decimal, 'b': 3 post decimal....default(value not convered....default to 2). My expected df below. Not sure if this is possible. (More of thought experimentation)
dec_dict = {'a' : 2, 'b': 3, 'l':3, 'default': 2}
>>> print(df)
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.322
4 b 4.811
5 c 5.84

Given the fact that trailing zeros are not significant, the best approach should be:
dec_dict = {'a' : 2, 'b': 3, 'l':3, 'default': 2}
df['value1'] = (df.groupby('item')['value1']
.apply(lambda g: g.round(dec_dict.get(g.name, dec_dict['default']))
)
output:
item value1
0 a 1.120
1 a 1.510
2 a 0.110
3 b 3.322
4 b 4.811
5 c 5.840

df1.assign(value1=df1.assign(col1=df1.item.map(dec_dict).fillna(dec_dict['default']).astype(int))\
.apply(lambda ss:str(round(ss.value1, ss.col1)),axis=1))
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.322
4 b 4.811
5 c 5.84

You can set the index then round it with dict by column only, before that we need to update you dict with those missing value
update_dict = {**dec_dict,**dict.fromkeys(df.item[~df.item.isin(dec_dict.keys())],2)}
update_dict
{'a': 2, 'b': 3, 'l': 3, 'default': 2, 'c': 2}
out = df.set_index('item').T.round(update_dict).astype(object).T.reset_index()
out
item value1
0 a 1.12
1 a 1.51
2 a 0.11
3 b 3.322
4 b 4.811
5 c 5.84

Related

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

Drop a column based on the existence of another column

I'm actually trying to figure out how to drop a column based on the existence of another column. Here is my problem :
I start with this DataFrame. Each "X" column is associated with a "Y" column using a number. (X_1,Y_1 / X_2,Y_2 ...)
Index X_1 X_2 Y_1 Y_2
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
I drop NaN values using pd.dropna(). The result I get is this DataFrame :
Index X_1 X_2 Y_1
1 4 0 A
2 7 0 A
3 6 0 B
4 2 0 B
5 8 0 A
The problem is that I want to delete the "X" column associated to the "Y" column that just got dropped. I would like to use a condition that basically says :
"If Y_2 is not in the DataFrame, drop the X_2 column"
I used a for loop combined to if, but it doesn't seem to work. Any ideas ?
Thanks and have a good day.
Setup
>>> df
CHA_COEXPM1_COR CHA_COEXPM2_COR CHA_COFMAT1_COR CHA_COFMAT2_COR
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Solution
Identify the columns having NaN values in any row
Group the identified columns using the numeric identifier and transform using any
Filter the columns using the boolean mask created in the previous step
m = df.isna().any()
m = m.groupby(m.index.str.extract(r'(\d+)_')[0]).transform('any')
Result
>>> df.loc[:, ~m]
CHA_COEXPM1_COR CHA_COFMAT1_COR
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
Slightly modified example to be closer to actual DataFrame:
df = pd.DataFrame({
'Index': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'X_V1_C': {0: 4, 1: 7, 2: 6, 3: 2, 4: 8},
'X_V2_C': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'Y_V1_C': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A'},
'Y_V2_C': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}
})
Index X_V1_C X_V2_C Y_V1_C Y_V2_C
0 1 4 0 A NaN
1 2 7 0 A NaN
2 3 6 0 B NaN
3 4 2 0 B NaN
4 5 8 0 A NaN
set_index on any columns to be "saved"
Extract the numbers from the columns and create a MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
0 1 2 1 2 # Numbers Extracted From Columns
X_V1_C X_V2_C Y_V1_C Y_V2_C
Index
1 4 0 A NaN
2 7 0 A NaN
3 6 0 B NaN
4 2 0 B NaN
5 8 0 A NaN
Check where There are groups with all NaN columns with DataFrame.isna all on axis=0 (columns) then any relative to level=0 (the number that was extracted)
col_mask = ~df.isna().all(axis=0).any(level=0)
0
1 True # Keep 1 Group
2 False # Don't Keep 2 Group
dtype: bool
4.filter the DataFrame with the mask using loc then droplevel on the added number level
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
All Together
df = df.set_index('Index')
df.columns = pd.MultiIndex.from_arrays([df.columns.str.extract(r'(\d+)')[0],
df.columns])
col_mask = ~df.isna().all(axis=0).any(level=0)
df = df.loc[:, col_mask.index[col_mask]].droplevel(axis=1, level=0)
df:
X_V1_C Y_V1_C
Index
1 4 A
2 7 A
3 6 B
4 2 B
5 8 A
drop nas
df.dropna(axis=1, inplace=True)
compute suffixes and columns with both suffixes
suffixes = [i[2:] for i in df.columns]
cols = [c for c in df.columns if suffixes.count(c[2:]) == 2]
filter columns
df[cols]
full code:
df = df.set_index('Index').dropna(axis=1)
suffixes = [i[2:] for i in df2.columns]
df[[c for c in df2.columns if suffixes.count(c[2:]) == 2]]

Pandas - groupby one column and get mean of all other columns

I have a dataframe, with columns:
cols = ['A', 'B', 'C']
If I groupby one column, say, 'A', like so:
df.groupby('A')['B'].mean()
It works.
But I need to groupby one column and then get the mean of all other columns. I've tried:
df[cols].groupby('A').mean()
But I get the error:
KeyError: 'A'
What am I missing?
Please try:
df.groupby('A').agg('mean')
sample data
B C A
0 1 4 K
1 2 6 S
2 4 7 K
3 6 3 K
4 2 1 S
5 7 3 K
6 8 9 K
7 9 3 K
print(df.groupby('A').agg('mean'))
B C
A
K 5.833333 4.833333
S 2.000000 3.500000
You can use df.groupby('col').mean(). For example to calcualte mean for columns 'A', 'B' and 'C':
A B C D
0 1 NaN 1 1
1 1 2.0 2 1
2 2 3.0 1 1
3 1 4.0 1 1
4 2 5.0 2 1
df[['A', 'B', 'C']].groupby('A').mean()
or
df.groupby('A')[['A', 'B', 'C']].mean()
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000
If you need mean for all columns:
df.groupby('A').mean()
Output:
B C D
A
1 3.0 1.333333 1.0
2 4.0 1.500000 1.0
Perhaps the missing column is string rather than numeric?
df = pd.DataFrame({
'A': ['big', 'small','small', 'small'],
'B': [1,0,0,0],
'C': [1,1,1,0],
'D': ['1','0','0','0']
})
df.groupby(['A']).mean()
Output:
A
B
C
big
1.0
1.0
small
0.0
0.6666666666666666
Here, converting the column to a numeric type such as int or float produces the desired result:
df.D = df.D.astype(int)
df.groupby(['A']).mean()
Output:
A
B
C
D
big
1.0
1.0
1.0
small
0.0
0.6666666666666666
0.0

Pandas: Combine two dataframe columns in a sorted column

Suppose that I have this dataframe:
import pandas as pd
def creatingDataFrame():
raw_data = {'Region1': ['A', 'A', 'C', 'B' , 'A', 'B'],
'Region2': ['B', 'C', 'A', 'A' , 'B', 'A'],
'var-1': [20, 30, 40 , 50, 10, 20],
'var-2': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['Region1', 'Region2','var-1', 'var-2'])
return df
I want to generate this column:
df['segment']=['A-B','A-C','A-C','A-B','A-B','A-B']
Note that it is using columns 'Region1' and 'Region2' but in a sorted order. I have no clue how to do that using pandas. The only solution that I have in mind is to use a list as intermediary step:
Regions=df[['Region1','Region2']].values.tolist()
segments=[]
for i in range(np.shape(Regions)[0]):
auxRegions=sorted(Regions[i][:])
segments.append(auxRegions[0]+'-'+auxRegions[1])
df['segments']=segments
To get:
>>> df['segments']
0 A-B
1 A-C
2 A-C
3 A-B
4 A-B
5 A-B
You need:
df['segments'] = ['-'.join(sorted(tup)) for tup in zip(df['Region1'], df['Region2'])]
Output:
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
np.sort
v = np.sort(df.iloc[:, :2], axis=1).T
df['segments'] = [f'{i}-{j}' for i, j in zip(v[0], v[1])] # '{}-{}'.format(i, j)
df
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
DataFrame.agg + str.join
df['segments'] = pd.DataFrame(
np.sort(df.iloc[:, :2], axis=1)).agg('-'.join, axis=1)
df
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
(One above's faster.)

converting a dictionary with with multi values for each key to dataframe

I am new to Python and pandas. I am trying to convert a key with multiple values to dataframe. Below is the example data.
Out[]: {a: [1, 2, 3], b: [11, 22, 33],
c: [111, 222, 333], d: [1111, 2222, 3333, 4444}
I have tried following pieces of code:
df = pd.DataFrame.from_dict(dict_name, orient = "index")
df1
Or
df = pd.DataFrame(dict_name)
But, I am not getting the out of what I want, I think I need to loop through the values or something, please help. The output I exepect:
col_name1 col_name2
0 a 1
1 a 2
2 a 3
3 b 11
4 b 22
5 b 33
6 c 111
and so on...
Thanks for any kind of help.
You can use stack for reshaping, if necessary sort_index and convert to int, last create column from MultiIndex by double reset_index:
df = pd.DataFrame.from_dict(dict_name, orient = "index")
.sort_index()
.stack()
.astype(int)
.reset_index(level=1, drop=True)
.reset_index()
df.columns = ['col_name1','col_name2']
print (df)
col_name1 col_name2
0 a 1
1 a 2
2 a 3
3 b 11
4 b 22
5 b 33
6 c 111
7 c 222
8 c 333
9 d 1111
10 d 2222
11 d 3333
12 d 4444