Julia InexactError: Int64 - dataframe

I am new to Julia. Got this InexactError . To mention that I have tried to convert to float beforehand and it did not work, maybe I am doing something wrong.
column = df[:, i]
max = maximum(column)
min = minimum(column)
scaled_column = (column .- min)/max # This is the error, I think
df[:, i] = scaled_column
julia> VERSION
v"1.4.2"

Hard to give a sure answer without a minimal working example of the problem, but in general an InexactError happens when you try to convert a value to an exact type (like integer types, but unlike floating-point types) in which the original value cannot be exactly represented. For example:
julia> convert(Int, 1.0)
1
julia> convert(Int, 1.5)
ERROR: InexactError: Int64(1.5)
Other programming languages arbitrarily chose some way of rounding here (often truncation but sometimes rounding to nearest). Julia doesn't guess and requires you to be explicit. If you want to round, truncate, take a ceiling, etc. you can:
julia> floor(Int, 1.5)
1
julia> round(Int, 1.5)
2
julia> ceil(Int, 1.5)
2
Back to your problem: you're not calling convert anywhere, so why are you getting a conversion error? There are various situations where Julia will automatically call convert for you, typically when you try to assign a value to a typed location. For example, if you have an array of Ints and you assign a floating point value into it, it will be converted automatically:
julia> v = [1, 2, 3]
3-element Array{Int64,1}:
1
2
3
julia> v[2] = 4.0
4.0
julia> v
3-element Array{Int64,1}:
1
4
3
julia> v[2] = 4.5
ERROR: InexactError: Int64(4.5)
So that's likely what's happening to you: you get non-integer values by doing (column .- min)/max and then you try to assign it into an integer location and you get the error.

As a side note you can use transform! to achieve what you want like this:
transform!(df, i => (x -> (x .- minimum(x)) ./ maximum(x)) => i)
and this operation will replace the column.

Related

Why is the precision of numpy output different each time?

Run the following program:
for i in range(10):
a = np.random.uniform(0, 1)
print(a)
We have the result:
0.4418517709510906
0.05536715253773261
0.44633855235431785
0.3143041997189251
0.16175184090609163
0.8822875281567105
0.11367473012241913
0.9951703577237277
0.009103257465210124
0.5185580156093157
Why is the precision of each output different? Sometimes accurate to 16 decimal places, but sometimes accurate to 18 decimal places. Why does this happen?
Also, if I want to control the precision of the output, i.e., only 15 decimal places are output each time, how can I do this?
Edit: I try to use np.set_printoptions(precision=15)
np.set_printoptions(precision=15)
for i in range(10):
a = np.random.uniform(0, 1)
print(a)
But the output is:
0.3908531691561824
0.6363290508517755
0.3484260990246082
0.23792451272035053
0.5776808805593472
0.3631616619602701
0.878754651138258
0.6266540814279749
0.8309347174000745
0.5763464514883537
This still doesn't get the result I want. The result I want is something like below:
0.390853169156182
0.636329050851775
0.348426099024608
0.237924512720350
0.577680880559347
0.363161661960270
0.878754651138258
0.626654081427974
0.830934717400074
0.576346451488353
print(a) prints the shortest numeric string that yields the same float64 value as a.
Example:
a = 0.392820481778549002
b = 0.392820481778549
a_bits = np.asarray(a).view(np.int64).item()
b_bits = np.asarray(b).view(np.int64).item()
print(f"{a:.18f}", hex(a_bits))
print(f"{b:.18f}", hex(b_bits))
print(a == b)
Result:
0.392820481778549002 0x3fd923f8849c0570
0.392820481778549002 0x3fd923f8849c0570
True
You can use the f"{a:.18f}" syntax to get fixed-width output. The equivalent for numpy arrays is np.set_printoptions(precision=18, floatmode="fixed").

Vectorize a function with a condition

I would like to vectorize a function with a condition, meaning to calculate its values with array arithmetic. np.vectorize handles vectorization, but it does not work with array arithmetic, so it is not a complete solution
An answer was given as the solution in the question "How to vectorize a function which contains an if statement?" but did not prevent errors here; see the MWE below.
import numpy as np
def myfx(x):
return np.where(x < 1.1, 1, np.arcsin(1 / x))
y = myfx(x)
This runs but raises the following warnings:
<stdin>:2: RuntimeWarning: divide by zero encountered in true_divide
<stdin>:2: RuntimeWarning: invalid value encountered in arcsin
What is the problem, or is there a better way to do this?
I think this could be done by
Getting the indices ks of x for which x[k] > 1.1 for each k in ks.
Applying np.arcsin(1 / x[ks]) to the slice x[ks], and using 1 for the rest of the elements.
Recombining the arrays.
I am not sure about the efficiency, though.
The statement np.where(x < 1.1, 1, np.arcsin(1 / x)) is equivalent to
mask = x < 1.1
a = 1
b = np.arcsin(1 / x)
np.where(mask, a, b)
Notice that you're calling np.arcsin on all the elements of x, regardless of whether 1 / x <= 1 or not. Your basic plan is correct. You can do the operations in-place on an output array using the where keyword of np.arcsin and np.reciprocal, without having to recombine anything:
def myfx(x):
mask = (x >= 1.1)
out = np.ones(x.shape)
np.reciprocal(x, where=mask, out=out) # >= 1.1 implies != 0
return np.arcsin(out, where=mask, out=out)
Using np.ones ensures that the unmasked elements of out are initialized correctly. An equivalent method would be
out = np.empty(x.shape)
out[~mask] = 1
You can always find an arithmetic expression that prevents the "divide by zero".
Example:
def myfx(x):
return np.where( x < 1.1, 1, np.arcsin(1/np.maximum(x, 1.1)) )
The values where x<1.1 in the right wing are not used, so it's not an issue computing np.arcsin(1/1.1) where x < 1.1.

How to handle missing in boolean context in Julia?

I'm trying to create a categorical variable based on ranges of values from other (numerical) column. However, the code don't work when I have missings in the numerical column
Here is a replicable example:
using RDatasets;
using DataFrames;
using Pipe;
using FreqTables;
df = dataset("datasets","iris")
#lowercase columns just for convenience
#pipe df |> rename!(_, [lowercase(k) for k in names(df)]);
#without this line, the code works fine
#pipe df |> allowmissing!(_, :sepallength) |> replace!(_.sepallength, 4.9 => missing);
df[:size] = #. ifelse(df[:sepallength]<=4.7, "small", missing)
df[:size] = #. ifelse((df[:sepallength]>4.7) & (df[:sepallength]<=4.9), "avg", df[:size])
df[:size] = #. ifelse((df[:sepallength]>4.9) & (df[:sepallength]<=5), "large", df[:size])
df[:size] = #. ifelse(df[:sepallength]>5, "huge", df[:size])
println(#pipe df |> freqtable(_, :size))
Output:
TypeError: non-boolean (Missing) used in boolean context
I would like to ignore the missing cases in the numerical variable but I cannot just drop de missings because this will drop other important informations in my dataset. Moreover, if I drop just the missings in sepallength the column df[:size] would have a different length than the original dataframe.
Use the coalesce function like this:
julia> x = [1,2,3,missing,5,6,7]
7-element Array{Union{Missing, Int64},1}:
1
2
3
missing
5
6
7
julia> #. ifelse(coalesce(x < 4.7, false), "small", missing)
7-element Array{Union{Missing, String},1}:
"small"
"small"
"small"
missing
missing
missing
missing
As a side note do not write df[:size] (this syntax has been deprecated for over 2 years now and soon it will error) but rather df.size or df."size" to access the column of the data frame (the df."size" is for cases when your column names contain characters like spaces etc., e.g. df."my fancy column!").
I think Bogumil's approach is correct and probably best for most situations, but one other option that I like to use is to define my own comparison operators that can deal with missings by returning false if a missing is encountered. Using the unicode capabilities of Julia makes this quite pleasant in my opinion:
julia> ==ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x == y;
julia> >=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x >= y;
julia> <=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x <= y;
julia> <ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x < y;
julia> >ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x > y;
julia> x = rand([missing; 1:10], 50)
julia> x .> 10
50-element Array{Union{Missing, Bool},1}
...
julia> x .>ₘ 10
50-element BitArray{1}
...
There are of course downsides to defining such an elementary operator in your own code, particularly using Unicode as well, in terms of your code being harder for other people to read (and potentially even to display correctly!), so I probably wouldn't advocate for this as the standard approach, or something to be used in library code. I do think though that for explorative work it makes life easier.

How to specify the format for printing an array of Floats in julia?

I have an array or matrix that I want to print, but only to three digits of precision. How do I do that. I tried the following.
> #printf("%.3f", rand())
0.742
> #printf("%.3f", rand(3))
LoadError: TypeError: non-boolean (Array{Bool,1}) used in boolean context
while loading In[13], in expression starting on line 1
Update: Ideally, I just want to call a function like printx("{.3f}", rand(m, n)) without having to further process my array or matrix.
The OP said:
Update: Ideally, I just want to call a function like printx("{.3f}", rand(m, n)) without having to further process my array or matrix.
This answer to a similar questions suggests something like this:
julia> VERSION
v"1.0.0"
julia> using Printf
julia> m = 3; n = 5;
julia> A = rand(m, n)
3×5 Array{Float64,2}:
0.596055 0.0574471 0.122782 0.829356 0.226897
0.606948 0.0312382 0.244186 0.356534 0.786589
0.147872 0.61846 0.494186 0.970206 0.701587
# For this session of the REPL, redefine show function. Next REPL will be back to normal.
# Note %1.3f% spec for printf format string to get 3 digits to right of decimal.
julia> Base.show(io::IO, f::Float64) = #printf(io, "%1.3f", f)
# Now we have the 3 digits to the right spec working in the REPL.
julia> A
3×5 Array{Float64,2}:
0.596 0.057 0.123 0.829 0.227
0.607 0.031 0.244 0.357 0.787
0.148 0.618 0.494 0.970 0.702
# The print function prints with 3 decimals as well, but note the semicolons for rows.
# This may not be what was wanted either, but could have a use.
julia> print(A)
[0.596 0.057 0.123 0.829 0.227; 0.607 0.031 0.244 0.357 0.787; 0.148 0.618 0.494 0.970 0.702]
How about this?
julia> print(round.(rand(3); digits=3))
[0.188,0.202,0.237]
I would do it this way:
julia> using Printf
julia> map(x -> #sprintf("%.3f",x), rand(3))
3-element Array{String,1}:
"0.471"
"0.252"
"0.090"
I don't think #printf accepts a list of arguments as you might be expecting.
One solution you could try it to use #sprintf to create formatted strings, but collect them up in a list comprehension. You might then use join to concatenate them together like so:
join([#sprintf "%3.2f" x for x in rand(3)], ", ")

parse input to rational in Julia

I want to read the user input and store it as a Rational, whatever the type: integer, float ot rational. For instance:
5 --> store it as 5//1
2.3 --> store it as 23//10
4//7 --> store it as 4//7
At the moment I wrote the following:
a = convert(Rational,parse(Float64,readline(STDIN)))
which is fine if I input an integer, like 5.
But if I input 2.3, a stores 2589569785738035//1125899906842624
.
And if I input a fraction (whether in the form 4/7 or the form 4//7) I get an ArgumentError: invalid number format for Float64.
How to solve the Float&Rational problems?
One way is to parse the raw input to an Expr (symbols), eval the expression, convert it to a Float64 and use rationalize to simplify the rational generated:
julia> rationalize(convert(Float64, eval(parse("5"))))
5//1
julia> rationalize(convert(Float64, eval(parse("2.3"))))
23//10
julia> rationalize(convert(Float64, eval(parse("4/7"))))
4//7
julia> rationalize(convert(Float64, eval(parse("4//7"))))
4//7
rationalize works with approximate floating point number and you could specify the error in the parameter tol.
tested with Julia Version 0.4.3
Update: The parse method was deprecated in Julia version >= 1.0. There are two methods that should be used: Base.parse (just for numbers, and it requires a Type argument) and Meta.parse (for expressions):
julia> rationalize(convert(Float64, eval(parse(Int64, "5"))))
5//1
julia> rationalize(convert(Float64, eval(parse(Float64, "2.3"))))
23//10
Multiplication (then division) by a highly composite integer works pretty well.
julia> N = 2*2 * 3*3 * 5*5 * 7 * 11 * 13
900900
julia> a = round(Int, N * parse(Float64, "2.3")) // N
23//10
julia> a = round(Int, N * parse(Float64, "5")) // N
5//1
julia> a = round(Int, N * parse(Float64, "9.1111111111")) // N
82//9
You could implement your own parse:
function Base.parse(::Type{Rational{Int}}, x::ASCIIString)
ms, ns = split(x, '/', keep=false)
m = parse(Int, ms)
n = parse(Int, ns)
return m//n
end
Base.parse(::Type{Rational}, x::ASCIIString) = parse(Rational{Int}, x)