formatting numbers in pandas - pandas

for a pandas.DataFrame: df
min max mean
a 0.0 2.300000e+04 6.450098e+02
b 0.0 1.370000e+05 1.651754e+03
c 218.0 1.221550e+10 3.975262e+07
d 1.0 5.060000e+03 2.727708e+02
e 0.0 6.400000e+05 6.560047e+03
I would like to format the display such as numbers show in the
":,.2f" format (that is ##,###.##) and remove the exponents.
I tried: df.style.format("{:,.2f}")which gives: <pandas.io.formats.style.Styler object at 0x108b86f60> that i have no idea what to do with.
Any lead please?

try this young Pandas apprentice
pd.options.display.float_format = '{:,.2f}'.format

Related

How to perform calculations on pandas groupby dataframes

I have a dataframe which for the sake of showing a minimal example, I simplified to this:
df = pd.DataFrame({'pnl':[+10,+23,+15,-5,+20],'style':['obs','obs','obs','bf','bf']})
I would like to do the following:
Group the dataframe by style
Count the positive entries of pnl and divide by the total of entries of that same style.
For example, style 'bf' has 2 entries, one positive and one negative, so 1/2 (total) = 0.5.
This should yield the following result:
style win_rate
bf 0.5
osb 1
dtype: float64
I thought of having a list of the groups, iterate over them and build a new df... But it seems to me like an antipattern. I am pretty sure there is an easier / more pythonic solution.
Thanks.
You can do groupby on df['pnl'].gt(0) series with df['style'] and calculate mean
In [14]: df['pnl'].gt(0).groupby(df['style']).mean()
Out[14]:
style
bf 0.5
obs 1.0
Name: pnl, dtype: float64
You can try pd.crosstab, which uses groupby in the background, to get percentage of both positive and non-positive:
pd.crosstab(df['style'], df['pnl'].gt(0), normalize='index')
Output:
pnl False True
style
bf 0.5 0.5
obs 0.0 1.0

categorical variables to binary variables

I have a DataFrame that looks like this :
initial dataframe
I have different tags in the 'Concepts_clean' column and I want to automatically fill the other ones like so : resulting dataframe
For example: fourth row, column 'Concepts_clean" I have ['Accueil Amabilité', 'Tarifs'] then I want to fill the columns 'Accueil Amabilité' and 'Tarifs' with ones and all the others with zeros.
What is the most effective way to do it?
Thank you
It's more of a n-hot encoding problem -
>>> def change_df(x):
... for i in x['Concepts_clean'].replace('[','').replace(']','').split(','):
... x[i.strip()] = 1
... return x
...
>>> df.apply(change_df, axis=1)
Example Output
Concepts_clean Ecoute Informations Tarifs
[Tarifs] 0.0 0.0 1.0
[] 0.0 0.0 0.0
[Ecoute] 1.0 0.0 0.0
[Tarifs, Informations] 0.0 1.0 1.0

Mathematical operations with dataframe column names

In general terms, the problem I'm having is that I have numerical column names for a dataframe and struggling to use them.
I have a dataframe (df1) like this:
3.2 5.4 1.1
1 1.6 2.8 4.0
2 3.5 4.2 3.2
I want to create another (df2) where each value is:
(the corresponding value in df1 minus the value to the left) /
(the column number in df1 minus the column number to the left)
This means that the first column of df2 is nan and, for instance, the second row, second column is: (4.2-3.5)/(5.4-3.2)
I think maybe this is problematic because the column names aren't of the appropriate type: I've searched elsewhere but haven't found anything on how to use the column names in the way required.
Any and all help appreciated, even if it involves a workaround!
v = np.diff(df1.values, axis=1) / np.diff(df1.columns.values.astype(float))
df2 = pd.DataFrame(v, df1.index, df1.columns[1:]).reindex_like(df1)
df2
3.2 5.4 1.1
1 NaN 0.545455 -0.279070
2 NaN 0.318182 0.232558
You can first transpose the DF and get the rowwise diff. Then divide each column with the column diff. Finally transpose the DF back.
df2 = df.T.assign(c=lambda x: x.index.astype(float)).diff()
df2.apply(lambda x: x.div(df2.c)).drop('c',1).T
Out[367]:
3.2 5.4 1.1
1 NaN 0.545455 -0.279070
2 NaN 0.318182 0.232558

sum vs np.nansum weirdness while summing columns with same name on a pandas dataframe - python

taking inspiration from this discussion here on SO (Merge Columns within a DataFrame that have the Same Name), I tried the method suggested and, while it works while using the function sum() it doesn't when I am using np.nansum :
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100,4), columns=['a', 'a','b','b'], index=pd.date_range('2011-1-1', periods=100))
print(df.head(3))
sum() case:
print(df.groupby(df.columns, axis=1).apply(sum, axis=1).head(3))
a b
2011-01-01 1.328933 1.678469
2011-01-02 1.878389 1.343327
2011-01-03 0.964278 1.302857
np.nansum() case:
print(df.groupby(df.columns, axis=1).apply(np.nansum, axis=1).head(3))
a [1.32893299939, 1.87838886222, 0.964278430632,...
b [1.67846885234, 1.34332662587, 1.30285727348, ...
dtype: object
any idea why?
The issue is that np.nansum converts its input to a numpy array, so it effectively loses the column information (sum doesn't do this). As a result, the groupby doesn't get back any column information when constructing the output, so the output is just a Series of numpy arrays.
Specifically, the source code for np.nansum calls the _replace_nan function. In turn, the source code for _replace_nan checks if the input is an array, and converts it to one if it's not.
All hope isn't lost though. You can easily replicate np.nansum with Pandas functions. Specifically use sum followed by fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
The sum should ignore NaN's and just sum the non-null values. The only case you'll get back a NaN is if all the values attempting to be summed are NaN, which is why fillna is required. Note that you could also do the fillna before the groupby, i.e. df.fillna(0).groupby....
If you really want to use np.nansum, you can recast as pd.Series. This will likely impact performance, as constructing a Series can be a relatively expensive, and you'll be doing it multiple times:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
Example Computations
For some example computations, I'll be using the following simple DataFrame, which includes NaN values (your example data doesn't):
df = pd.DataFrame([[1,2,2,np.nan,4],[np.nan,np.nan,np.nan,3,3],[np.nan,np.nan,-1,2,np.nan]], columns=list('aaabb'))
a a a b b
0 1.0 2.0 2.0 NaN 4.0
1 NaN NaN NaN 3.0 3.0
2 NaN NaN -1.0 2.0 NaN
Using sum without fillna:
df.groupby(df.columns, axis=1).sum()
a b
0 5.0 4.0
1 NaN 6.0
2 -1.0 2.0
Using sum and fillna:
df.groupby(df.columns, axis=1).sum().fillna(0)
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0
Comparing to the fixed np.nansum method:
df.groupby(df.columns, axis=1).apply(lambda x: pd.Series(np.nansum(x, axis=1), x.index))
a b
0 5.0 4.0
1 0.0 6.0
2 -1.0 2.0

How to count number of index or Null values in Pandas dataframe group

Its always the things that seem easy that bug me. I am trying to get a count of the number of non-null values of some variables in a Dataframe grouped by month and year. So I can do this which works fine
counts_by_month=df[variable1, variable2].groupby([lambda x: x.year,lambda x: x.month]).count()
But I REALLY want to know is how many of those values in each group are NaNs. So I want to count the Nans in each variable too so that I can calculate the percentage data missing in each group. I can not find a function to do this.
or
maybe I could get to the same end by counting the total items in the group. Then the NaNs would be Total - 'Non-Null values'
I have been trying to find out if I can somehow count the index values but I haven't been able to do so. Any assistance on this greatly appreciated.
Best wishes
Jason
df.isnull().sum()
Faster, and doesn't need a custom function :)
In [279]: df
Out[279]:
A B C D E
a foo NaN 1.115320 -0.528363 -0.046242
b bar 0.991114 -1.978048 -1.204268 0.676268
c bar 0.293008 -0.708600 NaN -0.388203
d foo 0.408837 -0.012573 1.019361 1.774965
e foo 0.127372 NaN NaN NaN
In [280]: def count_missing(frame):
return (frame.shape[0] * frame.shape[1]) - frame.count().sum()
.....:
In [281]: df.groupby('A').apply(count_missing)
Out[281]:
A
bar 1
foo 4
dtype: int64