How to control the display precision of a NumPy float64 scalar? - numpy

I'm writing a teaching document that uses lots of examples of Python code and includes the resulting numeric output. I'm working from inside IPython and a lot of the examples use NumPy.
I want to avoid print statements, explicit formatting or type conversions. They clutter the examples and detract from the principles I'm trying to explain.
What I know:
From IPython I can use %precision to control the displayed precision of any float results.
I can use np.set_printoptions() to control the displayed precision of elements within a NumPy array.
What I'm looking for is a way to control the displayed precision of a NumPy float64 scalar which doesn't respond to either of the above. These get returned by a lot of NumPy functions.
>>> x = some_function()
Out[2]: 0.123456789
>>> type(x)
Out[3]: numpy.float64
>>> %precision 2
Out[4]: '%.2f'
>>> x
Out[5]: 0.123456789
>>> float(x) # that precision works for regular floats
Out[6]: 0.12
>>> np.set_printoptions(precision=2)
>>> x # but doesn't work for the float64
Out[8]: 0.123456789
>>> np.r_[x] # does work if it's in an array
Out[9]: array([0.12])
What I want is
>>> # some formatting command
>>> x = some_function() # that returns a float64 = 0.123456789
Out[2]: 0.12
but I'd settle for:
a way of telling NumPy to give me float scalars by default, rather than float64.
a way of telling IPython how to handling a float64, kind of like what I can do with a repr_pretty for my own classes.

IPython has formatters (core/formatters.py) which contain a dict that maps a type to a format method. There seems to be some knowledge of NumPy in the formatters but not for the np.float64 type.
There are a bunch of formatters, for HTML, LaTeX etc. but text/plain is the one for consoles.
We first get the IPython formatter for console text output
plain = get_ipython().display_formatter.formatters['text/plain']
and then set a formatter for the float64 type, we use the same formatter as already exists for float since it already knows about %precision
plain.for_type(np.float64, plain.lookup_by_type(float))
Now
In [26]: a = float(1.23456789)
In [28]: b = np.float64(1.23456789)
In [29]: %precision 3
Out[29]: '%.3f'
In [30]: a
Out[30]: 1.235
In [31]: b
Out[31]: 1.235
In the implementation I also found that %precision calls np.set_printoptions() with a suitable format string. I didn't know it did this, and potentially problematic if the user has already set this. Following the example above
In [32]: c = np.r_[a, a, a]
In [33]: c
Out[33]: array([1.235, 1.235, 1.235])
we see it is doing the right thing for array elements.
I can do this formatter initialisation explicitly in my own code, but a better fix might to modify IPython code/formatters.py line 677
#default('type_printers')
def _type_printers_default(self):
d = pretty._type_pprinters.copy()
d[float] = lambda obj,p,cycle: p.text(self.float_format%obj)
# suggested "fix"
if 'numpy' in sys.modules:
d[numpy.float64] = lambda obj,p,cycle: p.text(self.float_format%obj)
# end suggested fix
return d
to also handle np.float64 here if NumPy is included. Happy for feedback on this, if I feel brave I might submit a PR.

Related

When does matplotlib (or which matplotlib api's) automatically convert Pandas timestamps to matplotlib dates?

Looking for confirmation or correction. It appears to me, that as long as I do this:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
then I can pass a Pandas DatetimeIndex (which contains Pandas dates of type timestamps.Timestamp) directly in as the x-coordinate for Axes.plot like this:
In [4]: df
Out[4]:
Open High Low Close Volume AdjClose
Date
2015-12-24 2063.52 2067.36 2058.73 2060.99 1411860000 2060.99
2015-12-23 2042.20 2064.73 2042.20 2064.29 3484090000 2064.29
2015-12-22 2023.15 2042.74 2020.49 2038.97 3520860000 2038.97
2015-12-21 2010.27 2022.90 2005.93 2021.15 3760280000 2021.15
2015-12-18 2040.81 2040.81 2005.33 2005.55 6683070000 2005.55
In [5]: type(df.index)
Out[5]: pandas.core.indexes.datetimes.DatetimeIndex
In [6]: type(df.index[0])
Out[6]: pandas._libs.tslibs.timestamps.Timestamp
...
In [11]: ax.plot( df.index, df['Close'] )
and the x-axis dates work fine. But when building a plot directly using
ax.add_line(lines)
where lines[] is a list of
Line2D(xdata,ydata)
items, (for example, as can be seen here: https://github.com/matplotlib/mpl_finance/blob/master/mpl_finance.py#L133-L154 ) then the xdata must already be converted to matplotlib dates (floats as number of days since 01/01/01) doing something like this:
xdata = mdates.date2num(df.index.to_pydatetime())
Is this correct that the ax.plot() automatically converts Pandas dates, but the lower level APIs do not? Or am I missing something?
Also, to add something to this (based on the first couple comments) ...
If I don't register the converters, I get this warning:
In [7]: ax.plot( df.index, df['Close'])
/anaconda3/lib/python3.6/site-packages/pandas/plotting/_converter.py:129: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.
To register the converters:
>>> from pandas.plotting import register_matplotlib_converters
>>> register_matplotlib_converters()
warnings.warn(msg, FutureWarning)
Out[7]: [<matplotlib.lines.Line2D at 0x12a437f98>]
I'm a little confused by the last line (Out[7]) ... does that mean that the code was inside Line2D when this warning was printed?

Set non-scientific float representation as default floating point format for pandas [duplicate]

How can one modify the format for the output from a groupby operation in pandas that produces scientific notation for very large numbers?
I know how to do string formatting in python but I'm at a loss when it comes to applying it here.
df1.groupby('dept')['data1'].sum()
dept
value1 1.192433e+08
value2 1.293066e+08
value3 1.077142e+08
This suppresses the scientific notation if I convert to string but now I'm just wondering how to string format and add decimals.
sum_sales_dept.astype(str)
Granted, the answer I linked in the comments is not very helpful. You can specify your own string converter like so.
In [25]: pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [28]: Series(np.random.randn(3))*1000000000
Out[28]:
0 -757322420.605
1 -1436160588.997
2 -1235116117.064
dtype: float64
I'm not sure if that's the preferred way to do this, but it works.
Converting numbers to strings purely for aesthetic purposes seems like a bad idea, but if you have a good reason, this is one way:
In [6]: Series(np.random.randn(3)).apply(lambda x: '%.3f' % x)
Out[6]:
0 0.026
1 -0.482
2 -0.694
dtype: object
Here is another way of doing it, similar to Dan Allan's answer but without the lambda function:
>>> pd.options.display.float_format = '{:.2f}'.format
>>> Series(np.random.randn(3))
0 0.41
1 0.99
2 0.10
or
>>> pd.set_option('display.float_format', '{:.2f}'.format)
You can use round function just to suppress scientific notation for specific dataframe:
df1.round(4)
or you can suppress is globally by:
pd.options.display.float_format = '{:.4f}'.format
If you want to style the output of a data frame in a jupyter notebook cell, you can set the display style on a per-dataframe basis:
df = pd.DataFrame({'A': np.random.randn(4)*1e7})
df.style.format("{:.1f}")
See the documentation here.
Setting a fixed number of decimal places globally is often a bad idea since it is unlikely that it will be an appropriate number of decimal places for all of your various data that you will display regardless of magnitude. Instead, try this which will give you scientific notation only for large and very small values (and adds a thousands separator unless you omit the ","):
pd.set_option('display.float_format', lambda x: '%,g' % x)
Or to almost completely suppress scientific notation without losing precision, try this:
pd.set_option('display.float_format', str)
I had multiple dataframes with different floating point, so thx to Allans idea made dynamic length.
pd.set_option('display.float_format', lambda x: f'%.{len(str(x%1))-2}f' % x)
The minus of this is that if You have last 0 in float, it will cut it. So it will be not 0.000070, but 0.00007.
Expanding on this useful comment, here is a solution setting the formatting options only to display the results without changing options permanently:
with pd.option_context('display.float_format', lambda x: f'{x:,.3f}'):
display(sum_sales_dept)
dept
value1 119,243,300.0
value2 129,306,600.0
value3 107,714,200.0
If you would like to use the values, say as part of csvfile csv.writer, the numbers can be formatted before creating a list:
df['label'].apply(lambda x: '%.17f' % x).values.tolist()

Using scientific notation in pandas

I found plenty of answers on how to suppress scientific notation in pandas, but how do I enable it? I found the option pd.set_option('precision', 2) but it does not turn large numbers into scientific notation.
For example, I would like the number 123066.14 to be formatted as 1.23E+5. I am using a pandas.DataFrame, and it would be useful to set the formatting for an entire column when exporting/printing.
OK, I figured this out, you can use set_option and pass a format string to option 'display.float_format':
In [76]:
pd.set_option('display.float_format', '{:.2g}'.format)
In [78]:
pd.Series(data=[0.00000001])
Out[78]:
0 1e-08
dtype: float64
EDIT
to match your desired output:
In [79]:
pd.set_option('display.float_format', '{:.2E}'.format)
pd.Series(data=[0.00000001])
Out[79]:
0 1.00E-08
dtype: float64

Inconsistent interface of Pandas Series; yielding access to underlying data

While the new Categorical Series support since pandas 0.15.0 is fantastic, I'm a bit annoyed with how they decided to make the underlying data inaccessible except through underscored variables. Consider the following code:
import numpy as np
import pandas as pd
x = np.empty(3, dtype=np.int64)
s = pd.DatetimeIndex(x, tz='UTC')
x
Out[17]: array([140556737562568, 55872352, 32])
s[0]
Out[18]: Timestamp('1970-01-02 15:02:36.737562568+0000', tz='UTC')
x[0] = 0
s[0]
Out[20]: Timestamp('1970-01-01 00:00:00+0000', tz='UTC')
y = s.values
y[0] = 5
x[0]
Out[23]: 5
s[0]
Out[24]: Timestamp('1970-01-01 00:00:00.000000005+0000', tz='UTC')
We can see that both in construction and when asked for underlying values, no deep copies are being made in this DatetimeIndex with regards to its underlying data. Not only is this potentially useful in terms of efficiency, but it's great if you are using a DataFrame as a buffer. You can easily get the numpy primitive containing the underlying data, from there get a pointer to the raw data, which some low level C routine can use to do a copy into from some block of memory.
Now lets look at the behavior of the new Categorical Series. The underlying data of course is not the levels, but the codes.
x2 = np.zeros(3, dtype=np.int64)
s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
s2
Out[27]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
x2[0] = 1
s2[0]
Out[29]: 'hello'
y2 = s2.codes
y2[0] = 1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-0366d645c98d> in <module>()
----> 1 y2[0] = 1
ValueError: assignment destination is read-only
y2 = s2._codes
y2[0] = 1
s2[0]
Out[34]: 'bye'
The net effect of this behavior is that as a developer, efficient manipulation of the underlying data for Categoricals is not part of the interface. Also as a user, the from_codes constructor is slow as it deep copies the codes, which may often be unnecessary. There should at least be an option for this.
But the fact that codes is a read only variable and _codes needs to be used strikes me as worse. Why wouldn't .codes give the same behavior as .values? Is there some justification for this beyond the concept that the codes are "private"? I'm hoping some of the pandas gurus on stackoverflow can shed some light on this.
The Categorical type is different from almost all other types in that it is a compound type that has a certain guarantee among its data. Namely that the codes provide a factorization of the levels.
So the argument against mutability is that it would be easy to break the codes-categories mapping, and it could be non-performant. Of course these could possibly be mitigated with checking on the setitem instead (but with some added code complexity).
The vast majority of users are not going to manipulate the codes/categories directly (and only use exposed methods) so this is really a protection against accidently breaking these guarantees.
If you need to efficiently manipulate the underlying data, best/easiest is simply to pull out the codes/categories. Mutate them, then create a new Categorical (which is cheap if codes/categories are already provided).
e.g.
In [3]: s2 = pd.Categorical.from_codes(x2, ["hello", "bye"])
In [4]: s2
Out[4]:
[hello, hello, hello]
Categories (2, object): [hello, bye]
In [5]: s2.codes
Out[5]: array([0, 0, 0], dtype=int8)
In [6]: pd.Categorical(s2.codes+1,s2.categories,fastpath=True)
Out[6]:
[bye, bye, bye]
Categories (2, object): [hello, bye]
Of course this is quite dangerous, if you added 2 to the expression would blow up. Manipulation of the codes directly is simply buyer-be-ware.

Why the difference between octave's prctile and numpy's percentile?

I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
95.0
>>> numpy.std(t)
29.157646512850626
>>> ml.prctile(t,95)
95.000000000000014
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'
Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check http://en.wikipedia.org/wiki/Percentile you will see there are a couple of ways to calculate percentiles.
It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
29.157646512850633
>>> numpy.std(t, ddof=1)
29.304537349375785