Panda astype not converting column to int even when using errors=ignore - pandas

I have the following DF
ID
0 1.0
1 555555.0
2 NaN
3 200.0
When I try to convert the ID column to Int64 I got the following error:
Cannot convert non-finite values (NA or inf) to integer
I've used the following code to solve this problem:
df["ID"] = df["ID"].astype('int64', errors='ignore')
Although, when I use the above code my ID column persists with float64 type.
Any tip to solve this problem?

Use pd.Int64DType64 instead of np.int64:
df['ID'] = df['ID'].fillna(pd.NA).astype(pd.Int64Dtype())
Output:
>>> df
ID
0 1
1 555555
2 <NA>
3 200
>>> df['ID'].dtype
Int64Dtype()
>>> df['ID'] + 10
0 11
1 555565
2 <NA>
3 210
Name: ID, dtype: Int64
>>> print(df.to_csv(index=False))
ID
1
555555
""
200

Related

Different outcome using pandas nunique() and unique()

I have a big DF with 10 millions rows and I need to find the unique number for each column.
I wrote the function below:
(need to return a series)
def count_unique_values(df):
return pd.Series(df.nunique())
and I get this output:
Area 210
Item 436
Element 4
Year 53
Unit 2
Value 313640
dtype: int64
expected result should be value 313641.
when I just do
df['Value'].unique()
I do get that answer. Didn't figure out why I get less with nunique() just there.
Because DataFrame.nunique omit missing values, because default parameter dropna=True, Series.unique function not.
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'D':[np.nan,3,5,5,3,5],
})
print (df)
A D
0 a NaN
1 b 3.0
2 c 5.0
3 d 5.0
4 e 3.0
5 f 5.0
def count_unique_values(df):
return df.nunique()
print (count_unique_values(df))
A 6
D 2
dtype: int64
print (df['D'].unique())
[nan 3. 5.]
print (df['D'].nunique())
2
print (df['D'].unique())
[nan 3. 5.]
Solution is add parameter dropna=False:
print (df['D'].nunique(dropna=False))
3
print (df['D'].unique())
3
So in your function:
def count_unique_values(df):
return df.nunique(dropna=False)
print (count_unique_values(df))
A 6
D 3
dtype: int64

apply() function to generate new value in a new column

I am new to python 3 and pandas. I tried to add a new column into a dataframe where the value is the difference between two existing columns.
My current code is:
import pandas as pd
import io
from io import StringIO
x="""a,b,c
1,2,3
4,5,6
7,8,9"""
with StringIO(x) as df:
new=pd.read_csv(df)
print (new)
y=new.copy()
y.loc[:,"d"]=0
# My lambda function is completely wrong, but I don't know how to make it right.
y["d"]=y["d"].apply(lambda x:y["a"]-y["b"], axis=1)
Desired output is
a b c d
1 2 3 -1
4 5 6 -1
7 8 9 -1
Does anyone have any idea how I can make my code work?
Thanks for your help.
You need y only for DataFrame for DataFrame.apply with axis=1 for process by rows:
y["d"]= y.apply(lambda x:x["a"]-x["b"], axis=1)
For better debugging is possible create custom function:
def f(x):
print (x)
a = x["a"]-x["b"]
return a
y["d"]= y.apply(f, axis=1)
a 1
b 2
c 3
Name: 0, dtype: int64
a 4
b 5
c 6
Name: 1, dtype: int64
a 7
b 8
c 9
Name: 2, dtype: int64
Better solution if need only subtract columns:
y["d"] = y["a"] - y["b"]
print (y)
a b c d
0 1 2 3 -1
1 4 5 6 -1
2 7 8 9 -1

Why the column name is missing in pandas output in the group by result?

Update
if use to_frame() the column name seems not in the same row
重量
型号
HG-R2075 2040
HG220 680
This is my code, it groups the "型号"(which means type), and get the sum of the "重量"(weight) and exclude the column("是否发送") with a value in it.
import pandas as pd
import numpy as np
import sys
import os
script_dir = os.path.dirname(os.path.abspath(__file__))
os.chdir(script_dir ) # change to the path that you already know
try:
ClientName = sys.argv[1]
except :
print(u'没有输入或者错误的客户名称!')
df = pd.read_excel("Summary.xlsm")
df = df[df['客户'].str.contains(ClientName)][pd.isnull(df[u"是否已经发送"])].groupby([ u'型号'])[u'重量'].sum()
print('[CQ:face,id=21] ' + '*' * 10 + u'以下是' + ClientName + u'未发送的重量' + '*' * 10 + '[CQ:face,id=21]')
print(str(df))
Output is this :
[CQ:face,id=21] **********以下是KATUN未发送的重量**********[CQ:face,id=
21]
型号 (****the column name is missing here*****)
HG-R2075 2040
HG220 680
Name: 重量, dtype: int64
I don't know why the column name is missing?
The output I want is this: how to make it?
型号 重量
HG-R2075 2040
HG220 680
Name: 重量, dtype: int64
The result df of your groupby operation is actually a Series, not a DataFrame. That's why it is printed with a different format.
print(df.to_frame()) should to the trick.
EDIT: Actually in such a dataframe index name and column name will not be printed on the same row. To get a cleaner output, use reset_index to get 2 proper columns:
print(df.reset_index().to_string(index=False))
First use boolean indexing with chaining by &.
If need 2 column DataFrame add as_index=False or Series.reset_index:
mask = df['客户'].str.contains(ClientName) & df[u"是否已经发送"].isnull()
df = df[mask].groupby([ u'型号'], as_index=False)[u'重量'].sum()
Or:
df = df[mask].groupby([ u'型号'])[u'重量'].sum().reset_index()
For one column DataFrame use Series.to_frame - first column is index:
df = df[mask].groupby([ u'型号'])[u'重量'].sum().to_frame()
Sample:
np.random.seed(345)
N = 10
df = pd.DataFrame({'客户':np.random.choice(list('abc'), size=N),
u"是否已经发送":np.random.choice([np.nan,0], size=N),
u'型号':np.random.randint(2, size=N),
u'重量':np.random.randint(10, size=N)})
print (df)
型号 客户 是否已经发送 重量
0 0 a 0.0 4
1 0 a 0.0 0
2 1 b NaN 8
3 1 b NaN 5
4 1 c 0.0 6
5 1 a NaN 3
6 1 a NaN 3
7 1 b 0.0 4
8 0 a NaN 2
9 1 c NaN 8
ClientName = 'a'
mask = df['客户'].str.contains(ClientName) & df[u"是否已经发送"].isnull()
df1 = df[mask].groupby([ u'型号'], as_index=False)[u'重量'].sum()
print(df1)
型号 重量
0 0 2
1 1 6
df1 = df[mask].groupby([ u'型号'])[u'重量'].sum().reset_index()
print(df1)
型号 重量
0 0 2
1 1 6
df2 = df[mask].groupby([ u'型号'])[u'重量'].sum().to_frame()
print (df2)
重量
型号
0 2
1 6

error using astype when NaN exists in a dataframe

df
A B
0 a=10 b=20.10
1 a=20 NaN
2 NaN b=30.10
3 a=40 b=40.10
I tried :
df['A'] = df['A'].str.extract('(\d+)').astype(int)
df['B'] = df['B'].str.extract('(\d+)').astype(float)
But I get the following error:
ValueError: cannot convert float NaN to integer
And:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
How do I fix this ?
If some values in column are missing (NaN) and then converted to numeric, always dtype is float. You cannot convert values to int. Only to float, because type of NaN is float.
print (type(np.nan))
<class 'float'>
See docs how convert values if at least one NaN:
integer > cast to float64
If need int values you need replace NaN to some int, e.g. 0 by fillna and then it works perfectly:
df['A'] = df['A'].str.extract('(\d+)', expand=False)
df['B'] = df['B'].str.extract('(\d+)', expand=False)
print (df)
A B
0 10 20
1 20 NaN
2 NaN 30
3 40 40
df1 = df.fillna(0).astype(int)
print (df1)
A B
0 10 20
1 20 0
2 0 30
3 40 40
print (df1.dtypes)
A int32
B int32
dtype: object
From pandas >= 0.24 there is now a built-in pandas integer.
This does allow integer nan's, so you don't need to fill na's.
Notice the capital in 'Int64' in the code below.
This is the pandas integer, instead of the numpy integer.
You need to use: .astype('Int64')
So, do this:
df['A'] = df['A'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
df['B'] = df['B'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

Pandas: Dangerous to align on float columns for join?

Consider the following:
>>> a = pd.read_csv('x', keep_default_na=False)
>>> a
id val type
0 0 5.812
1 1 5.232
2 2 5.342
3 3 5.443
>>> b = pd.read_csv('y', keep_default_na=False)
>>> b
id val type
0 0 5.812 a
1 1 5.232 b
2 2 5.342 c
3 3 5.443 d
>>> a.set_index(['id','val']).drop('type',axis=1).join(b.set_index(['id', 'val'])).reset_index()
id val type
0 0 5.812 a
1 1 5.232 b
2 2 5.342 NaN <------ Not c!
3 3 5.443 d
>>> a.dtypes
id int64
val float64
type object
dtype: object
>>> b.dtypes
id int64
val float64
type object
dtype: object
It seems like it is dangerous to use 'float32/64' column types for alignment on on a join/merge operations due to rounding errors (significant digits). In the example above, file X had a value of 5.342 while file Y had 5.3420. How should I deal with this?
I tried doing set_option('precision', 4) before doing read_csv() but seems like this option is only for display.