Throw and exception and move on in pandas - pandas

I have created a pandas dataframe called df with the following code:
import numpy as np
import pandas as pd
ds = {'col1' : ["1","2","3","A"], "col2": [45,6,7,87], "col3" : ["23","4","5","6"]}
df = pd.DataFrame(ds)
The dataframe looks like this:
print(df)
col1 col2 col3
0 1 45 23
1 2 6 4
2 3 7 5
3 A 87 6
Now, col1 and col3 are objects:
print(df.info())
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 4 non-null object
1 col2 4 non-null int64
2 col3 4 non-null object
I want to transform, where possible, the object columns into floats.
For example, I can convert col3 into a float like this:
df['col3'] = df['col3'].astype(float)
But I cannot convert col1 into a float:
df['col1'] = df['col1'].astype(float)
ValueError: could not convert string to float: 'A'
Is it possible to create a code that converts, where possible, the object columns into float and by-passes the cases in which it is not possible (so, without throwing an error which stops the process)? I guess it has to do with exceptions?

I think you can make a test whether the content in a string, object or not, in which cases the conversion won't be made. Did you try this ?
for y in df.columns:
if(df[y].dtype == object):
continue
else:
# your treatement here
or, apparently in pandas 0.20.2, there is a function which makes the test : is_string_dtype(df['col1'])
This is in the case where all the values of a column are of the same type, if the values are mixed, iterate over df.values

I have sorted it.
def convert_float(x):
try:
return x.astype(float)
except:
return x
cols = df.columns
for i in range(len(cols)):
df[cols[i]] = convert_float(df[cols[i]])
print(df)
print(df.info())

Related

How can I create a new column in an existing dataframe, fill it with na and specify its dtype to be int64?

I have a big dataframe, and I would like to create an extra row and fill it with NA and also specify the dtype of the column to be int64. How would I do this?
e.g.
dataframe:
col1 col2
5 's'
7 'g'
6 'f'
Let's say I want to add a new column called new_col and populate it with NA and specify the dtype to be int64.
I tried something like:
df['new_col'].fillna().dtype('int64')
But this doesn't seem to work.
The desired output is:
col1 col2 new_col
5 's' na
7 'g' na
6 'f' na
I can't show the desired dtype of new_col but I would like it to be int64.
import numpy as np
df['new_col'] = np.nan
df['new_col'] = pd.to_numeric(df['new_col'], errors='coerce').astype('int', errors='ignore')

Pandas dataframe reverse increment dictionary in new column

I saved a dictionary with different keys in each row and all values equal to 0 in a new Dataframe column.
Starting from the end, and depending on the value of another column (same row), I would like to increment this dictionary, and store the view of this dictionary.
Incrementing is fine, but storing the view doesn't work. In the end I have the same dictionary in the whole column.
Before
col1 col_dict
1 {1:0, 2:0, 3:0}
2 {1:0, 2:0, 3:0}
3 {1:0, 2:0, 3:0}
What i want:
col1 col_dict
1 {1:1, 2:1, 3:1}
2 {1:0, 2:1, 3:1}
3 {1:0, 2:0, 3:1}
What I have:
col1 col_dict
1 {1:1, 2:1, 3:1}
2 {1:1, 2:1, 3:1}
3 {1:1, 2:1, 3:1}
For example:
def function():
for x in reversed(range(20)):
#taking the value in the other column, and incrementing the value in the dictionary
dataset["dict_column"][x][str(dataset.value[x])][0] += 1
I have tried to pass to list format, same problem.
I think it's due to the process of pandas.
Thank you in advance.
Open to any solution to do the job
You can use the copy of the dictionary for assigning to col_dict after incrementing the dictionary. Reindexing the dataframe to ensure reverse increment.
import pandas as pd
import copy
df = pd.DataFrame()
df["col1"] = [1, 2, 3]
col_dict = {i:0 for i in df["col1"]}
def get_dict(col):
col_dict[col] = 1
return copy.copy(col_dict)
df = df.iloc[::-1]
df["col_dict"] = df["col1"].apply(get_dict)
df = df.iloc[::-1]
print(df)
Thanks to John Doe, I found a solution to my problem, may not be the best option, but as I don't need performances..
Here is my code, less efficient than his I guess, but for a beginner like me, might be more clear to understand (no offense).
import pandas as pd
import copy
def function():
#to skip the copy for the first (last) row
flag = 0
for x in reversed(range(20)):
if flag == 0:
dataset["dict_column"][x][str(dataset.value[x])][0] += 1
flag = 1
else:
dataset["dict_column"][x] = copy.deepcopy(dataset["dict_column"][x+1])
dataset["dict_column"][x][str(dataset.value[x])][0] += 1

Instead of appending value as a new column on the same row, pandas adds a new column AND new row

What I have below is an example of the type of the type of concatenation that I am trying to do.
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df, pd.Series(1, name = 'label')])
The result I am hoping for is:
col1 col2 col3 label
a 1.0 2.0 3.0 1
but I get is
col1 col2 col3 0
a 1.0 2.0 3.0 NaN
0 NaN NaN NaN 1.0
I know that I'm joining these wrong, but I cannot seem to figure out how its done. Any advice?
This is because the series you are adding has an incompatible index. The original dataframe has ['a'] as the specified index and there is no index specified in the series. If you want to add a new column without specifying an index, the following will give you what you want:
df = pd.DataFrame(np.array([1, 2, 3]).reshape((1, 3)), columns = ['col1', 'col2', 'col3'], index = ['a'])
df2 = pd.DataFrame() # already exists elsewhere in code
df2 = df2.append([df]) # append the desired dataframe
df2['label'] = 1 # add a new column with the value 1 across all rows
print(df2.to_string())
col1 col2 col3 label
a 1 2 3 1

cannot convert nan to int (but there are no nans)

I have a dataframe with a column of floats that I want to convert to int:
> df['VEHICLE_ID'].head()
0 8659366.0
1 8659368.0
2 8652175.0
3 8652174.0
4 8651488.0
In theory I should just be able to use:
> df['VEHICLE_ID'] = df['VEHICLE_ID'].astype(int)
But I get:
Output: ValueError: Cannot convert NA to integer
But I am pretty sure that there are no NaNs in this series:
> df['VEHICLE_ID'].fillna(999,inplace=True)
> df[df['VEHICLE_ID'] == 999]
> Output: Empty DataFrame
Columns: [VEHICLE_ID]
Index: []
What's going on?
Basically the error is telling you that you NaN values and I will show why your attempts didn't reveal this:
In [7]:
# setup some data
df = pd.DataFrame({'a':[1.0, np.NaN, 3.0, 4.0]})
df
Out[7]:
a
0 1.0
1 NaN
2 3.0
3 4.0
now try to cast:
df['a'].astype(int)
this raises:
ValueError: Cannot convert NA to integer
but then you tried something like this:
In [5]:
for index, row in df['a'].iteritems():
if row == np.NaN:
print('index:', index, 'isnull')
this printed nothing, but NaN cannot be evaluated like this using equality, in fact it has a special property that it will return False when comparing against itself:
In [6]:
for index, row in df['a'].iteritems():
if row != row:
print('index:', index, 'isnull')
index: 1 isnull
now it prints the row, you should use isnull for readability:
In [9]:
for index, row in df['a'].iteritems():
if pd.isnull(row):
print('index:', index, 'isnull')
index: 1 isnull
So what to do? We can drop the rows: df.dropna(subset='a'), or we can replace using fillna:
In [8]:
df['a'].fillna(0).astype(int)
Out[8]:
0 1
1 0
2 3
3 4
Name: a, dtype: int32
When your series contains floats and nan's and you want to convert to integers, you will get an error when you do try to convert your float to a numpy integer, because there are na values.
DON'T DO:
df['VEHICLE_ID'] = df['VEHICLE_ID'].astype(int)
From pandas >= 0.24 there is now a built-in pandas integer. This does allow integer nan's. Notice the capital in 'Int64'. This is the pandas integer, instead of the numpy integer.
SO, DO THIS:
df['VEHICLE_ID'] = df['VEHICLE_ID'].astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

How to avoid SettingWithCopyWarning in pandas?

I want to convert the type of a column to int using pandas. Here's the source code:
# CustomerID is missing on several rows. Drop these rows and encode customer IDs as Integers.
cleaned_data = retail_data.loc[pd.isnull(retail_data.CustomerID) == False]
cleaned_data['CustomerID'] = cleaned_data.CustomerID.astype(int)
This raises the warning below:
SettingWithCopyWarning: A value is trying to be set on a copy of a
slice from a DataFrame
How can I avoid this warning? Is there a better way to convert the type of CustomerID to int? I'm on python 3.5.
Use it in one loc:
retail_data.loc[~retail_data.CustomerID.isnull(),'CustomerID'] = retail_data.loc[~retail_data.CustomerID.isnull(),'CustomerID'].astype(int)
Example:
import pandas as pd
import numpy as np
retail_data = pd.DataFrame(np.random.rand(4,1)*10, columns=['CustomerID'])
retail_data.iloc[2,0] = np.nan
print(retail_data)
CustomerID
0 9.872067
1 5.645863
2 NaN
3 9.008643
retail_data.loc[~retail_data.CustomerID.isnull(),'CustomerID'] = retail_data.loc[~retail_data.CustomerID.isnull(),'CustomerID'].astype(int)
CustomerID
0 9.0
1 5.0
2 NaN
3 9.0
You'll notice that the dtype of the column is still float, because the np.nan cannot be encoded in an int column.
If you really want to drop these rows without changing the underlying retail_data, make an actual copy():
cleaned_data = retail_data.loc[~retail_data.CustomerID.isnull()].copy()