Converting multiple types of values to integer values - pandas

I have a Pandas DF in which I want to convert column values to integer values. The information should be stored in meters but can be stored in kilometers as well, resulting in the following possible values:
23145 (correct)
23145.0 (.0 should be removed)
101.1 (should be multiplied *1000)
47,587 (should be multiplied *1000)
'No value known'
I tried different options with converting data types, but I always seem to break the existing integers and cannot check for them correctly because the type 'object' is the dtype. Sometimes faulty values or strings block conversion as well.
Any ideas how to check if the value currently is an integer and do nothing, remove .0 from applicable values and multiply where applicable.
I also have some other columns with integers (e.g. number 22321323) where randomly a .0 is assigned (e.g. number 22321323.0). How can I correctly convert these values to not include the .0?

If you use a .apply on the column, you should be able to very easily convert these values while casing on their type. For example:
import pandas as pd
def convert(x):
if isinstance(x, int):
return x
elif isinstance(x, float):
return int(x)
else:
# Defaults to 0 when not convertable
return 0
print(x)
df = pd.DataFrame({'col': [23145, 23145.0, 'No value known']})
df['col'] = df['col'].apply(convert)

Related

Convert string (object datatype) to categorical data

I had the issue in the past and hope you can help. I am trying to convert columns of 'object' datatype into categorical values.
So:
print(df_train['aus_heiz_befeuerung'].unique())
['Gas' 'Unbekannt' 'Alternativ' 'Öl' 'Elektro' 'Kohle']
These values from the above columns should be converted to. eg. 1, 2, 4, 5, 3.
Unfortunately I can not figure out how.
I have tried different astype versions and the following code block:
# string label to categorical values
from sklearn.preprocessing import LabelEncoder
for i in range(df_train.shape[1]):
if df_train.iloc[:,i].dtypes == object:
lbl = LabelEncoder()
lbl.fit(list(df_train.iloc[:,i].values) + list(df_test.iloc[:,i].values))
df_train.iloc[:,i] = lbl.transform(list(df_train.iloc[:,i].values))
df_test.iloc[:,i] = lbl.transform(list(df_test.iloc[:,i].values))
print(df_train['aus_heiz_befeuerung'].unique())
It leads to :
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
So happy for all ideas.
You can use the pandas.Categorical() function to convert the values in a column to categorical values. For example, to convert the values in the aus_heiz_befeuerung column to categorical values, you can use the following code:
df_train['aus_heiz_befeuerung'] =
pd.Categorical(df_train['aus_heiz_befeuerung'])
This will assign a numerical value to each unique category in the column, so that the values in the column become integers instead of strings. You can specify the order in which the categories should be assigned numerical values by passing a list of category names to the categories parameter of the pandas.Categorical() function. For example, to assign the categories in the order specified in your question ('Gas', 'Unbekannt', 'Alternativ', 'Öl', 'Elektro', 'Kohle'), you can use the following code:
df_train['aus_heiz_befeuerung'] = pd.Categorical(df_train['aus_heiz_befeuerung'], categories=['Gas', 'Unbekannt', 'Alternativ', 'Öl', 'Elektro', 'Kohle'])
After you have converted the values in the column to categorical values, you can use the cat.codes property to access the integer values that have been assigned to each category. For example:
df_train['aus_heiz_befeuerung'].cat.codes
This will return a pandas.Series object containing the integer values that have been assigned to each category in the aus_heiz_befeuerung column.

Round values of a varying quantity of columns on Databricks Scala

I am using Scala on Databricks and:
I have a dataframe that has N columns.
All but the first Y columns are of the type "float" and have numbers that I want to round to 0 decimals.
I don't want to write to each column that needs to be rounded one specific line of code, because there may be a lot of columns that will need to be rounded and they vary.
In order to do that, I tried to create a function with Map (not sure if it is the best option):
def roundValues(precision: Int)(df: DataFrame): DataFrame = {
val roundedCols = df.columns.map(c => round(col(c), precision).as(c))
df.select(roundedCols: _*)
}
df.transform(roundValues(0))
But I always get an error because the first Y columns are strings, dates, or other types.
My questions:
How can I round the values on all of the necessary columns?
The number of Y columns in the beginning may vary, as well as the number of N-Y columns that I need to round. Is there a way for me not to have to manually insert the name of the columns that will need to be rounded? (ex.: round only the columns of the type float, ignore all other)
In the end, should I convert from float to other type? I am going to use the final dataframe to do some plots or some simple calculations. I won't need decimals anymore for these things.
You can get datatype information from dataframe schema:
import org.apache.spark.sql.types.FloatType
val floatColumns = df.schema.fields.filter(_.dataType == FloatType).map(_.name)
val selectExpr = df.columns.map(c =>
if (floatColumns.contains(c))
round(col(c), 0).as(c)
else col(c)
)
val df1 = df.select(selectExpr: _*)

Convert a float column with nan to int pandas

I am trying to convert a float pandas column with nans to int format, using apply.
I would like to use something like this:
df.col = df.col.apply(to_integer)
where the function to_integer is given by
def to_integer(x):
if np.isnan(x):
return np.NaN
else:
return int(x)
However, when I attempt to apply it, the column remains the same.
How could I achieve this without having to use the standard technique of dtypes?
You can't have NaN in an int column, NaN are float (unless you use an object type, which is not a good idea since you'll lose many vectorial abilities).
You can however use the new nullable integer type (NA).
Conversion can be done with convert_dtypes:
df = pd.DataFrame({'col': [1, 2, None]})
df = df.convert_dtypes()
# type(df.at[0, 'col'])
# numpy.int64
# type(df.at[2, 'col'])
# pandas._libs.missing.NAType
output:
col
0 1
1 2
2 <NA>
Not sure how you would achieve this without using dtypes. Sometimes when loading in data, you may have a column that contains mixed dtypes. Loading in a column with one dtype and attemping to turn it into mixed dtypes is not possible though (at least, not that I know of).
So I will echo what #mozway said and suggest you use nullable integer data types
e.g
df['col'] = df['col'].astype('Int64')
(note the capital I)

How to using define round function like pandas round that executing one line code

Goal
Only one line to execute.
I refer round function from this post. But I want using like df.round(2) which changes the affected columns but keep the sequence of data but not required selecting float or int type.
df.applymap(myfunction) will get TypeError: must be real number, not str, which means I have to select type first.
Try
I refer round source code but I could not and understand how to change my function.
Firstly get the columns where values are float:
cols=df.select_dtypes('float').columns
Finally:
df[cols]=df[cols].agg(round,ndigits=2)
If you want to make changes in the function then add if/else condition:
from numpy import ceil, floor
def float_round(num, places=2, direction=ceil):
if isinstance(num,float):
return direction(num * (10 ** places)) / float(10 ** places)
else:
return num
out=df.applymap(float_round)
With the error message you mention, it's likely the column is already a string, and needs to be converted to some numeric type.
Let's now assume that the column is numeric, there are a few ways you could implement custom rounding functions that don't require reimplementing the .round() method of a dataframe object.
With the requirements you laid above, we want a way to round a data frame that:
fits on one line
doesn't require selecting numeric type
There are two ways we could do this that are functionally equivalent. One is to treat the dataframe as an argument to a function that is safe for numpy arrays.
Another is to use the apply method (explanation here) which applies a function to a row or a column.
import pandas as pd
import numpy as np
from numpy import ceil
# generate a 100x10 dataframe with a null value
data = np.random.random(1000) * 10
data = data.reshape(100,10)
data[0, 0] = np.nan
df = pd.DataFrame(data)
# changing data type of the second column
df[1] = df[1].astype(int)
# verify dtypes are different
print(df.dtypes)
# taken from other stack post
def float_round(num, places=2, direction=ceil):
return direction(num * (10 ** places)) / float(10 ** places)
# method 1 - use the dataframe as an argument
result1 = float_round(df)
print(result1.head())
# method 2 - apply
result2 = df.apply(float_round)
print(result2)
Because apply is applied row or column-wise, you can specify logic in your round function to ignore non-numeric columns. For instance:
# taken from other stack post
def float_round(num, places=2, direction=ceil):
# check type of a specific column
if num.dtype == 'O':
return num
return direction(num * (10 ** places)) / float(10 ** places)
# this will work, method 1 will fail
result2 = df.apply(float_round)
print(result2)

Finding Mismatch of a column having decimal value using Pandas

I have 2 csv files with same headers. I merged them with primary keys. Now from the merged file, I need to create another file with data which has all matching values and mismatch at 7th decimal place for col1 and col2 which are float value columns. What is the best way to do that?
generate some data that matches shape you note
simple case of do equality of rounded numbers, then to_csv()
included sample 5 rows
from pathlib import Path
b = np.random.randint(1,100, 100)
df1 = pd.DataFrame(b+np.random.uniform(10**-8, 10**-7, 100), columns=["col1"])
df2 = pd.DataFrame(b+np.random.uniform(10**-8, 10**-7, 100), columns=["col2"])
fn = Path.cwd().joinpath("SO_same.csv")
df1.join(df2).assign(eq7dp=lambda df: df.col1.round(7).eq(df.col2.round(7))).head(5).to_csv(fn)
with open(fn) as f: contents=f.read()
print(contents)
output
,col1,col2,eq7dp
0,37.00000005733964,37.00000002893621,False
1,46.00000001386966,46.00000008236663,False
2,99.00000007870301,99.00000007452154,True
3,42.00000001906606,42.00000001278533,True
4,79.00000007529009,79.00000007372863,True
supplement
In comments you note you want to use a np.where() expression, to select col1 if equal else False. You need to ensure that 2nd and 3rd parameters to np.where() are compatible. NB False is zero when converted to an int/float.
df1.join(df2).assign(eq7dp=lambda df: df.col1.round(7).eq(df.col2.round(7)),
col3=lambda df: np.where(df.col1.round(7).eq(df.col2.round(7)),df.col1,np.full(len(df),False))
)