Using scientific notation in pandas - numpy

I found plenty of answers on how to suppress scientific notation in pandas, but how do I enable it? I found the option pd.set_option('precision', 2) but it does not turn large numbers into scientific notation.
For example, I would like the number 123066.14 to be formatted as 1.23E+5. I am using a pandas.DataFrame, and it would be useful to set the formatting for an entire column when exporting/printing.

OK, I figured this out, you can use set_option and pass a format string to option 'display.float_format':
In [76]:
pd.set_option('display.float_format', '{:.2g}'.format)
In [78]:
pd.Series(data=[0.00000001])
Out[78]:
0 1e-08
dtype: float64
EDIT
to match your desired output:
In [79]:
pd.set_option('display.float_format', '{:.2E}'.format)
pd.Series(data=[0.00000001])
Out[79]:
0 1.00E-08
dtype: float64

Related

Scipy reporting different spearman correlation than pandas

print(n1)
print(n2)
print(type(n1), type(n2))
print(scipy.stats.spearmanr(n1, n2))
print(n1.corr(n2, method="spearman"))
0 2317.0
1 2293.0
2 1190.0
3 972.0
4 1391.0
Name: r6000, dtype: float64
0.0 2317.0
1.0 2293.0
3.0 1190.0
4.0 972.0
5.0 1391.0
Name: 6000, dtype: float64
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
SpearmanrResult(correlation=0.9999999999999999, pvalue=1.4042654220543672e-24)
0.7999999999999999
The problem is that scipy was reporting a different correlation value than pandas.
Edit to add:
The issue is the indexes are off. Pandas does automatic intrinsic data alignment, but scipy doesn't. I've answered it below.
Pandas does't have a function that calculates p-values, so it is better to use SciPy to calculate correlation since it will give you both p-value and correlation coefficient. The other alternative is to calculate the p-value yourself....using Scipy. Note one thing: If you are calculating the correlation of a sample of your data using pandas, the risk is that the correlations change if you change your sample is high. This is why you need the p-value.
I made a copy and called reset_index() on the series before correlating them. That fixed it.
The issue is intrinsic automatic data alignment in pandas based on the indexes.
scipy library doesn't do automatic data alignment, likely just converts it to a numpy array.

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

How to control the display precision of a NumPy float64 scalar?

I'm writing a teaching document that uses lots of examples of Python code and includes the resulting numeric output. I'm working from inside IPython and a lot of the examples use NumPy.
I want to avoid print statements, explicit formatting or type conversions. They clutter the examples and detract from the principles I'm trying to explain.
What I know:
From IPython I can use %precision to control the displayed precision of any float results.
I can use np.set_printoptions() to control the displayed precision of elements within a NumPy array.
What I'm looking for is a way to control the displayed precision of a NumPy float64 scalar which doesn't respond to either of the above. These get returned by a lot of NumPy functions.
>>> x = some_function()
Out[2]: 0.123456789
>>> type(x)
Out[3]: numpy.float64
>>> %precision 2
Out[4]: '%.2f'
>>> x
Out[5]: 0.123456789
>>> float(x) # that precision works for regular floats
Out[6]: 0.12
>>> np.set_printoptions(precision=2)
>>> x # but doesn't work for the float64
Out[8]: 0.123456789
>>> np.r_[x] # does work if it's in an array
Out[9]: array([0.12])
What I want is
>>> # some formatting command
>>> x = some_function() # that returns a float64 = 0.123456789
Out[2]: 0.12
but I'd settle for:
a way of telling NumPy to give me float scalars by default, rather than float64.
a way of telling IPython how to handling a float64, kind of like what I can do with a repr_pretty for my own classes.
IPython has formatters (core/formatters.py) which contain a dict that maps a type to a format method. There seems to be some knowledge of NumPy in the formatters but not for the np.float64 type.
There are a bunch of formatters, for HTML, LaTeX etc. but text/plain is the one for consoles.
We first get the IPython formatter for console text output
plain = get_ipython().display_formatter.formatters['text/plain']
and then set a formatter for the float64 type, we use the same formatter as already exists for float since it already knows about %precision
plain.for_type(np.float64, plain.lookup_by_type(float))
Now
In [26]: a = float(1.23456789)
In [28]: b = np.float64(1.23456789)
In [29]: %precision 3
Out[29]: '%.3f'
In [30]: a
Out[30]: 1.235
In [31]: b
Out[31]: 1.235
In the implementation I also found that %precision calls np.set_printoptions() with a suitable format string. I didn't know it did this, and potentially problematic if the user has already set this. Following the example above
In [32]: c = np.r_[a, a, a]
In [33]: c
Out[33]: array([1.235, 1.235, 1.235])
we see it is doing the right thing for array elements.
I can do this formatter initialisation explicitly in my own code, but a better fix might to modify IPython code/formatters.py line 677
#default('type_printers')
def _type_printers_default(self):
d = pretty._type_pprinters.copy()
d[float] = lambda obj,p,cycle: p.text(self.float_format%obj)
# suggested "fix"
if 'numpy' in sys.modules:
d[numpy.float64] = lambda obj,p,cycle: p.text(self.float_format%obj)
# end suggested fix
return d
to also handle np.float64 here if NumPy is included. Happy for feedback on this, if I feel brave I might submit a PR.

which float precision are numpy arrays by default?

I wonder which format floats are in numpy array by default.
(or do they even get converted when declaring a np.array? if so how about python lists?)
e.g. float16,float32 or float64?
float64. You can check it like
>>> np.array([1, 2]).dtype
dtype('int64')
>>> np.array([1., 2]).dtype
dtype('float64')
If you dont specify the data type when you create the array then numpy will infer the type, from the docs
dtypedata-type, optional - The desired data-type for the array. If not given, then the type will be determined as the minimum type
required to hold the objects in the sequence

Set non-scientific float representation as default floating point format for pandas [duplicate]

How can one modify the format for the output from a groupby operation in pandas that produces scientific notation for very large numbers?
I know how to do string formatting in python but I'm at a loss when it comes to applying it here.
df1.groupby('dept')['data1'].sum()
dept
value1 1.192433e+08
value2 1.293066e+08
value3 1.077142e+08
This suppresses the scientific notation if I convert to string but now I'm just wondering how to string format and add decimals.
sum_sales_dept.astype(str)
Granted, the answer I linked in the comments is not very helpful. You can specify your own string converter like so.
In [25]: pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [28]: Series(np.random.randn(3))*1000000000
Out[28]:
0 -757322420.605
1 -1436160588.997
2 -1235116117.064
dtype: float64
I'm not sure if that's the preferred way to do this, but it works.
Converting numbers to strings purely for aesthetic purposes seems like a bad idea, but if you have a good reason, this is one way:
In [6]: Series(np.random.randn(3)).apply(lambda x: '%.3f' % x)
Out[6]:
0 0.026
1 -0.482
2 -0.694
dtype: object
Here is another way of doing it, similar to Dan Allan's answer but without the lambda function:
>>> pd.options.display.float_format = '{:.2f}'.format
>>> Series(np.random.randn(3))
0 0.41
1 0.99
2 0.10
or
>>> pd.set_option('display.float_format', '{:.2f}'.format)
You can use round function just to suppress scientific notation for specific dataframe:
df1.round(4)
or you can suppress is globally by:
pd.options.display.float_format = '{:.4f}'.format
If you want to style the output of a data frame in a jupyter notebook cell, you can set the display style on a per-dataframe basis:
df = pd.DataFrame({'A': np.random.randn(4)*1e7})
df.style.format("{:.1f}")
See the documentation here.
Setting a fixed number of decimal places globally is often a bad idea since it is unlikely that it will be an appropriate number of decimal places for all of your various data that you will display regardless of magnitude. Instead, try this which will give you scientific notation only for large and very small values (and adds a thousands separator unless you omit the ","):
pd.set_option('display.float_format', lambda x: '%,g' % x)
Or to almost completely suppress scientific notation without losing precision, try this:
pd.set_option('display.float_format', str)
I had multiple dataframes with different floating point, so thx to Allans idea made dynamic length.
pd.set_option('display.float_format', lambda x: f'%.{len(str(x%1))-2}f' % x)
The minus of this is that if You have last 0 in float, it will cut it. So it will be not 0.000070, but 0.00007.
Expanding on this useful comment, here is a solution setting the formatting options only to display the results without changing options permanently:
with pd.option_context('display.float_format', lambda x: f'{x:,.3f}'):
display(sum_sales_dept)
dept
value1 119,243,300.0
value2 129,306,600.0
value3 107,714,200.0
If you would like to use the values, say as part of csvfile csv.writer, the numbers can be formatted before creating a list:
df['label'].apply(lambda x: '%.17f' % x).values.tolist()