Pandas: Dangerous to align on float columns for join? - pandas

Consider the following:
>>> a = pd.read_csv('x', keep_default_na=False)
>>> a
id val type
0 0 5.812
1 1 5.232
2 2 5.342
3 3 5.443
>>> b = pd.read_csv('y', keep_default_na=False)
>>> b
id val type
0 0 5.812 a
1 1 5.232 b
2 2 5.342 c
3 3 5.443 d
>>> a.set_index(['id','val']).drop('type',axis=1).join(b.set_index(['id', 'val'])).reset_index()
id val type
0 0 5.812 a
1 1 5.232 b
2 2 5.342 NaN <------ Not c!
3 3 5.443 d
>>> a.dtypes
id int64
val float64
type object
dtype: object
>>> b.dtypes
id int64
val float64
type object
dtype: object
It seems like it is dangerous to use 'float32/64' column types for alignment on on a join/merge operations due to rounding errors (significant digits). In the example above, file X had a value of 5.342 while file Y had 5.3420. How should I deal with this?
I tried doing set_option('precision', 4) before doing read_csv() but seems like this option is only for display.

Related

Panda astype not converting column to int even when using errors=ignore

I have the following DF
ID
0 1.0
1 555555.0
2 NaN
3 200.0
When I try to convert the ID column to Int64 I got the following error:
Cannot convert non-finite values (NA or inf) to integer
I've used the following code to solve this problem:
df["ID"] = df["ID"].astype('int64', errors='ignore')
Although, when I use the above code my ID column persists with float64 type.
Any tip to solve this problem?
Use pd.Int64DType64 instead of np.int64:
df['ID'] = df['ID'].fillna(pd.NA).astype(pd.Int64Dtype())
Output:
>>> df
ID
0 1
1 555555
2 <NA>
3 200
>>> df['ID'].dtype
Int64Dtype()
>>> df['ID'] + 10
0 11
1 555565
2 <NA>
3 210
Name: ID, dtype: Int64
>>> print(df.to_csv(index=False))
ID
1
555555
""
200

Pandas rolling window with less than or equal to

I have a dataframe which is classified based on three dimensions:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
when I do a rolling of metric d by the following command:
>>> df.d.rolling(window = 3).mean()
0 NaN
1 NaN
2 2.0
Name: d, dtype: float64
but what I actually want is to perform a rolling <= given number, in a way that if for the first entry the result is the same number itself and then from the second entry it rolls for the window size of 1 and for third it rolls for the window size of 2 and from 3 onwards it rolls the running average of 3 previous windows.
So the result I am expecting is:
for the dataframe:
>>> df
a b c d
0 a b c 1
1 a e x 2
2 a f e 3
>>> df.d.rolling(window = 3).mean()
0 1 #Since this is the first one and so average of the first number is equal to number itself.
1 1.5 # Average of 1 and 2 as rolling criteria is <= 3
2 2.0 # Since here we have 3 elements so from here on it follows the general trend.
Name: d, dtype: float64
Is it possible to roll this way?
I was able to roll using the following command:
>>> df.d.rolling(min_periods = 1, window = 3).mean()
0 1.0
1 1.5
2 2.0
Name: d, dtype: float64
with the help of min_periods one can specify the rolling window minimum config count.

calculate the mean of one row according it's label

calculate the mean of the values in one row according it's label:
A = [1,2,3,4,5,6,7,8,9,10]
B = [0,0,0,0,0,1,1,1,1, 1]
Result = pd.DataFrame(data=[A, B])
I want the output is: 0->3; 1-> 7.8
pandas has the groupby function, but I don't know how to implement this. Thanks
This is simple groupby problem ...
Result=Result.T
Result.groupby(Result[1])[0].mean()
Out[372]:
1
0 3
1 8
Name: 0, dtype: int64
Firstly, it sounds like you want to label the index:
In [11]: Result = pd.DataFrame(data=[A, B], index=['A', 'B'])
In [12]: Result
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 1 2 3 4 5 6 7 8 9 10
B 0 0 0 0 0 1 1 1 1 1
If the index was unique you wouldn't have to do any groupby, just take the mean of each row (that's the axis=1):
In [13]: Result.mean(axis=1)
Out[13]:
A 5.5
B 0.5
dtype: float64
However, if you had multiple rows with the same label, then you'd need to groupby:
In [21]: Result2 = pd.DataFrame(data=[A, A, B], index=['A', 'A', 'B'])
In [22]: Result2.mean(axis=1)
Out[22]:
A 5.5
A 5.5
B 0.5
dtype: float64
Note: the duplicate rows (that happen to have the same mean as I lazily used the same row contents), in general we'd want to take the mean of those means:
In [23]: Result2.mean(axis=1).groupby(level=0).mean()
Out[23]:
A 5.5
B 0.5
dtype: float64
Note: .groupby(level=0) groups the rows which have the same index label.
You're making it difficult on yourself by constructing the dataframe in such a way as to put the things you want to take the mean of and the things you want to be your labels as different rows.
Option 1
groubpy
This deals with the data presented in the dataframe Result
Result.loc[0].groupby(Result.loc[1]).mean()
1
0 3
1 8
Name: 0, dtype: int64
Option 2
Overkill using np.bincount and because your grouping values are 0 and 1. I'd have a solution even if they weren't but it makes it simpler.
I wanted to use the raw lists A and B
pd.Series(np.bincount(B, A) / np.bincount(B))
0 3.0
1 8.0
dtype: float64
Option 3
Construct a series instead of a dataframe.
Again using raw lists A and B
pd.Series(A, B).mean(level=0)
0 3
1 8
dtype: int64

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.
You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Extract rows with maximum values in pandas dataframe

We can use .idxmax to get the maximum value of a dataframe­(df). My problem is that I have a df with several columns (more than 10), one of a column has identifiers of same value. I need to extract the identifiers with the maximum value:
>df
id value
a 0
b 1
b 1
c 0
c 2
c 1
Now, this is what I'd want:
>df
id value
a 0
b 1
c 2
I am trying to get it by using df.groupy(['id']), but it is a bit tricky:
df.groupby(["id"]).ix[df['value'].idxmax()]
Of course, that doesn't work. I fear that I am not on the right path, so I thought I'd ask you guys! Thanks!
Close! Groupby the id, then use the value column; return the max for each group.
In [14]: df.groupby('id')['value'].max()
Out[14]:
id
a 0
b 1
c 2
Name: value, dtype: int64
Op wants to provide these locations back to the frame, just create a transform and assign.
In [17]: df['max'] = df.groupby('id')['value'].transform(lambda x: x.max())
In [18]: df
Out[18]:
id value max
0 a 0 0
1 b 1 1
2 b 1 1
3 c 0 2
4 c 2 2
5 c 1 2