error using astype when NaN exists in a dataframe - pandas

df
A B
0 a=10 b=20.10
1 a=20 NaN
2 NaN b=30.10
3 a=40 b=40.10
I tried :
df['A'] = df['A'].str.extract('(\d+)').astype(int)
df['B'] = df['B'].str.extract('(\d+)').astype(float)
But I get the following error:
ValueError: cannot convert float NaN to integer
And:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
How do I fix this ?

If some values in column are missing (NaN) and then converted to numeric, always dtype is float. You cannot convert values to int. Only to float, because type of NaN is float.
print (type(np.nan))
<class 'float'>
See docs how convert values if at least one NaN:
integer > cast to float64
If need int values you need replace NaN to some int, e.g. 0 by fillna and then it works perfectly:
df['A'] = df['A'].str.extract('(\d+)', expand=False)
df['B'] = df['B'].str.extract('(\d+)', expand=False)
print (df)
A B
0 10 20
1 20 NaN
2 NaN 30
3 40 40
df1 = df.fillna(0).astype(int)
print (df1)
A B
0 10 20
1 20 0
2 0 30
3 40 40
print (df1.dtypes)
A int32
B int32
dtype: object

From pandas >= 0.24 there is now a built-in pandas integer.
This does allow integer nan's, so you don't need to fill na's.
Notice the capital in 'Int64' in the code below.
This is the pandas integer, instead of the numpy integer.
You need to use: .astype('Int64')
So, do this:
df['A'] = df['A'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
df['B'] = df['B'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

Related

Panda astype not converting column to int even when using errors=ignore

I have the following DF
ID
0 1.0
1 555555.0
2 NaN
3 200.0
When I try to convert the ID column to Int64 I got the following error:
Cannot convert non-finite values (NA or inf) to integer
I've used the following code to solve this problem:
df["ID"] = df["ID"].astype('int64', errors='ignore')
Although, when I use the above code my ID column persists with float64 type.
Any tip to solve this problem?
Use pd.Int64DType64 instead of np.int64:
df['ID'] = df['ID'].fillna(pd.NA).astype(pd.Int64Dtype())
Output:
>>> df
ID
0 1
1 555555
2 <NA>
3 200
>>> df['ID'].dtype
Int64Dtype()
>>> df['ID'] + 10
0 11
1 555565
2 <NA>
3 210
Name: ID, dtype: Int64
>>> print(df.to_csv(index=False))
ID
1
555555
""
200

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?
iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7
Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

Filtering with column of nullable integer type in pandas

What is the correct way to filter a column with a dtype of Int64Dtype?
import pandas as pd
df = pd.DataFrame(data={'A': [1,2,None], 'B' : [3,4,np.nan]},dtype=pd.Int64Dtype())
df
A B
0 1 3
1 2 4
2 <NA> <NA>
df[df['A'] == 1]
ValueError: cannot mask with array containing NA / NaN values
Following the suggestions to update pandas did the trick. Thanks.
In anaconda,
conda update --all
pd.__version__
'1.0.5'

map one column in a df to another df where all words are present

I am trying to map a column to a dataframe from another dataframe where all words exist from the target dataframe
multiple matches are fine as I can filter them out after.
Thanks in advance!
df1
ColA
this is a sentence
with some words
in a column
and another
for fun
df2
ColB ColC
this a 123
in column 456
fun times 789
Some attempts
dfResult = df1.apply(lambda x: np.all([word in x.df1['ColA'].split(' ') for word in x.df2['ColB'].split(' ')]),axis = 1)
dfResult = df1.ColA.apply(lambda sentence: all(word in sentence for word in df2.ColB))
desired output
dfResult
ColA ColC
this is a sentence 123
with some words NaN
in a column 456
and another NaN
for fun NaN
Turn to set and look for subsets with Numpy broadcasting
Disclaimer: No assurances that this will be fast.
A = df1.ColA.str.split().apply(set).to_numpy() # If pandas version is < 0.24 use `.values`
B = df2.ColB.str.split().apply(set).to_numpy() # instead of `.to_numpy()`
C = df2.ColC.to_numpy()
# When `dtype` is `object` Numpy falls back on performing
# the operation on each pair of values. Since these are `set` objects
# `<=` tests for subset.
i, j = np.where(B <= A[:, None])
out = pd.array([np.nan] * len(A), pd.Int64Dtype()) # Empty nullable integers
# Use `out = np.empty(len(A), dtype=object)` if pandas version is < 0.24
out[i] = C[j]
df1.assign(ColC=out)
ColA ColC
0 this is a sentence 123
1 with some words NaN
2 in a column 456
3 and another NaN
4 for fun NaN
By using loop and set.issubset
pd.DataFrame([[y if set(z.split()).issubset(set(x.split())) else np.nan for z,y in zip(df2.ColB,df2.ColC)] for x in df1.ColA ]).max(1)
Out[34]:
0 123.0
1 NaN
2 456.0
3 NaN
4 NaN
dtype: float64

Map a pandas column with column names

I have two data frames:
import pandas as pd
# Column contains column name
df1 = pd.DataFrame({"Column": pd.Series(['a', 'b', 'b', 'c']),
"Item": pd.Series(['x', 'y', 'z', 'x']),
"Result": pd.Series([3, 4, 5, 6])})
df2 = pd.DataFrame({"a": pd.Series(['x', 'n', 'n']),
"b": pd.Series(['x', 'y', 'n']),
"c": pd.Series(['x', 'z', 'n'])})
How can I add "Result" to df2 based on the "Item" in the "Column"?
Expected dataframe df2 is:
a b c Result
- - - ------
x x x 3
n y z 4
n n n null
How can the above question be a duplicate of 3 questions, 2 of which are marked with an 'or' by #smci?
This is a lot more complicated than at first glance. df1 is in long-form, it has two entries for 'b'. So first it needs to be stacked/unstacked/pivoted into a 3x3 table of 'Result' where 'Column' becomes the index, and the values from 'Item' = 'x'/'y'/'z' are expanded to a full 3x3 matrix with NaN for missing values:
>>> df1_full = df1.pivot(index='Column', columns='Item', values='Result')
Item x y z
Column
a 3.0 NaN NaN
b NaN 4.0 5.0
c 6.0 NaN NaN
(Note the unwanted type-conversion to float, this is because numpy doesn't have NaN for integers, see Issue 17013 in pre-pandas-0.22.0 versions. No problem, we'll just cast back to int at the end.)
Now we want to do df1_full.merge(df2, left_index=True, right_on=??)
But first we need another trick/intermediate column to find the leftmost valid value in df2 which corresponds to a valid column-name from df1; the value n is invalid, maybe we replace it with NaN to make life easier:
>>> df2.replace('n', np.NaN)
a b c
0 x x x
1 NaN y z
2 NaN NaN NaN
>>> df2_nan.columns = [0,1,2]
0 1 2
0 x x x
1 NaN y z
2 NaN NaN NaN
And we want to successively test df2's columns from L-to-R as to whether their value is in df1_full.columns, similar to Computing the first non-missing value from each column in a DataFrame
, except testing successive columns (axis=1). Then store that intermediate column-name into a new column, 'join_col' :
>>> df2['join_col'] = df2.replace('n', np.NaN).apply(pd.Series.first_valid_index, axis=1)
a b c join_col
0 x x x a
1 n y z b
2 n n n None
Actually we want to index into the column-names of df1, but it blows up on the NaN:
>>> df1.columns[ df2_nan.apply(pd.Series.first_valid_index, axis=1) ]
(Well that's not exactly working, but you get the idea.)
Finally we do the merge df1_full.merge(df2, left_index=True, right_on='join_col'). And maybe take the desired column slice ['a','b','c','Result']. And cast Result back to int, or map 'Nan' -> 'null'.