Pandas: Group by rounded floating number - pandas

I have a dataframe with a column of floating numbers. For example:
df = pd.DataFrame({'A' : np.random.randn(100), 'B': np.random.randn(100)})
What I want to do is to group by column A after rounding column A to 2 decimal places.
The way I do it is highly inefficient:
df.groupby(df.A.map(lambda x: "%.2f" % x))
I particularly don't want to convert everything to a string, as speed becomes a huge problem. But I don't feel it is safe to do the following:
df.groupby(np.around(df.A, 2))
I am not sure, but I feel that there might be cases where two float64 numbers will have the same string representation after rounding to 2 decimal places, but might have slightly different representations when np.around to 2 decimal places. For example, is it possible a string representation of 1.52 can be represented by np.around(., 2) as 1.52000001 sometimes but 1.51999999 some other times?
My question is what is a better and more efficient way.

I think you not need to convert float to string.
import pandas as pd
from random import random
df = pd.DataFrame({'A' : map(lambda x: random(), range(100000)), 'B': map(lambda x: random(), range(100000))})
df.groupby(df['A'].apply(lambda x: round(x, 1))).count()

Related

Pandas wrong round decimation

I am calculating the duration of the data acquisition from some sensors. Although the data is collected faster, I would like to sample it at 10Hz. Anyways, I created a dataframe with a column called 'Time_diff' which I expect it goes [0.0, 0.1, 0.2, 0.3 ...]. However it goes somehow like [0.0, 0.1, 0.2, 0.30000004 ...]. I am rounding the data frame but still, I have this weird decimation. Is there any suggestions on how to fix it?
The code:
for i in range(self.n_of_trials):
start = np.zeros(0)
stop = np.zeros(0)
for df in self.trials[i].df_list:
start = np.append(stop, df['Time'].iloc[0])
stop = np.append(start, df['Time'].iloc[-1])
t_start = start.min()
t_stop = stop.max()
self.trials[i].duration = t_stop-t_start
t = np.arange(0, self.trials[i].duration+self.trials[i].dt, self.trials[i].dt)
self.trials[i].df_merged['Time_diff'] = t
self.trials[i].df_merged.round(1)
when I print the data it looks like this:
0 0.0
1 0.1
2 0.2
3 0.3
4 0.4
...
732 73.2
733 73.3
734 73.4
735 73.5
736 73.6
Name: Time_diff, Length: 737, dtype: float64
However when I open as csv file it is like that:
Addition
I think the problem is not csv conversion but how the float data converted/rounded. Here is the next part of the code where I merge more dataframes on 10Hz time stamps:
for j in range(len(self.trials[i].df_list)):
df = self.trials[i].df_list[j]
df.insert(0, 'Time_diff', round(df['Time']-t_start, 1))
df.round({'Time_diff': 1})
df.drop_duplicates(subset=['Time_diff'], keep='first', inplace=True)
self.trials[i].df_merged = pd.merge(self.trials[i].df_merged, df, how="outer", on="Time_diff", suffixes=(None, '_'+self.trials[i].df_list_names[j]))
#Test csv
self.trials[2].df_merged.to_csv(path_or_buf='merged.csv')
And since the inserted dataframes have exact correct decimation, it is not merged properly and create another instance with a new index.
This is not a rounding problem, it is a behavior intrinsic in how floating point numbers work. Actually 0.30000000000000004 is the result of 0.1+0.1+0.1 (try it out yourself in a Python prompt).
In practice not every decimal number is exactly representable as a floating point number so what you get is instead the closest possible value.
You have some options depending if you just want to improve the visualization or if you need to work on exact values. If for example you want to use that column for a merge you can use an approximate comparison instead of an exact one.
Another option is to use the decimal module: https://docs.python.org/3/library/decimal.html which works with exact arithmetic but can be slower.
In your case you said the column should represent frequency at steps of 10Hz so I think changing the representation so that you directly use 10, 20, 30, ... will allow you to use integers instead of floats.
If you want to see the "true" value of a floating point number in python you can use format(0.1*6, '.30f') and it will print the number with 30 digits (still an approximation but much better than the default).

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

Statistics and Pandas: What normalization means in value_counts() in Pandas

The question is not about coding but to understand what normalize means in terms of statistics and correlation of data
This is an example of what I am doing.
Without normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(), color='black')
plt.show();
With normalization:
plt.subplot(111)
plt.plot(df['alcoholism'].value_counts(normalize=True), marker='o')
plt.plot(df.query('no_show =="Yes"')['alcoholism'].value_counts(normalize=True), color='black')
plt.show();
Which one better correlates the values with or without normalization? or is it a whole wrong idea?
I am new to data and pandas, so excuse my bad code, chaining, commenting, style :)
As you can see when you normalize (second plot), the sum of both points is equal to 1, for each line that is plotted. Normalizing is giving you the rate of occurrences of each value instead of the number of occurrences.
Heres what the doc says:
normalize : bool, default False
    Return proportions rather than frequencies.
value_counts() probably returns something like:
0 110000
1 1000
dtype: int64
and value_counts(normalize=True) probably returns something like:
0 0.990991
1 0.009009
dtype: float64
In other words, the relation between the normalized and non-normalized can be checked as:
>>> counts = df['alcoholism'].value_counts()
>>> rate = df['alcoholism'].value_counts(normalize=True)
>>> np.allclose(rate, counts / counts.sum())
True
Where np.allclose allowing to properly compare two series of floating point numbers.

how can I turn a number into the value of a new dataframe?

I am trying to turn a scalar into a dataframe structure assigning a column name and an index name.
I have the following scalar (in concrete a numpy.float64) : -0.090058
and I would like to turn it into a df:
decimal
ratio -0.090058
I thought it was going to be straight forward. This is what I have tried unsuccesfully:
df=pd.DataFrame(value,index='ratio',columns='decimal')
you were almost there:
In [222]: pd.DataFrame(value,index=['ratio'],columns=['decimal'])
Out[222]:
decimal
ratio -0.090058
you can also do it this way:
In [223]: pd.DataFrame(index=['ratio']).assign(decimal=value)
Out[223]:
decimal
ratio -0.090058
Solution with passing dict:
df = pd.DataFrame({'decimal':value},index=['ratio'])
print (df)
decimal
ratio -0.090058

How to generate pandas DataFrame column of Categorical from string column?

I can convert a pandas string column to Categorical, but when I try to insert it as a new DataFrame column it seems to get converted right back to Series of str:
train['LocationNFactor'] = pd.Categorical.from_array(train['LocationNormalized'])
>>> type(pd.Categorical.from_array(train['LocationNormalized']))
<class 'pandas.core.categorical.Categorical'>
# however it got converted back to...
>>> type(train['LocationNFactor'][2])
<type 'str'>
>>> train['LocationNFactor'][2]
'Hampshire'
Guessing this is because Categorical doesn't map to any numpy dtype; so do I have to convert it to some int type, and thus lose the factor labels<->levels association?
What's the most elegant workaround to store the levels<->labels association and retain the ability to convert back? (just store as a dict like here, and manually convert when needed?)
I think Categorical is still not a first-class datatype for DataFrame, unlike R.
(Using pandas 0.10.1, numpy 1.6.2, python 2.7.3 - the latest macports versions of everything).
The only workaround for pandas pre-0.15 I found is as follows:
column must be converted to a Categorical for classifier, but numpy will immediately coerce the levels back to int, losing the factor information
so store the factor in a global variable outside the dataframe
.
train_LocationNFactor = pd.Categorical.from_array(train['LocationNormalized']) # default order: alphabetical
train['LocationNFactor'] = train_LocationNFactor.labels # insert in dataframe
[UPDATE: pandas 0.15+ added decent support for Categorical]
The labels<->levels is stored in the index object.
To convert an integer array to string array: index[integer_array]
To convert a string array to integer array: index.get_indexer(string_array)
Here is some exampe:
In [56]:
c = pd.Categorical.from_array(['a', 'b', 'c', 'd', 'e'])
idx = c.levels
In [57]:
idx[[1,2,1,2,3]]
Out[57]:
Index([b, c, b, c, d], dtype=object)
In [58]:
idx.get_indexer(["a","c","d","e","a"])
Out[58]:
array([0, 2, 3, 4, 0])