Manipulating data in DataFrame: how to calculate the square of a column - dataframe

I would like to calculate the square of a column A 1,2,3,4, process it with other calculation store it in column C
using CSV, DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
df.C = ((((df.A./2).^2).*3.14)./1000)
Is there an easier way to write it?

I am not sure how much shorter you would want the formula to be, but you can write:
df.C = #. (df.A / 2) ^ 2 * 3.14 / 1000
to avoid having to write . everywhere.
Or you can use transform!, but it is not shorter (its benefit is that you can uset it in a processing pipeline, e.g. using Pipe.jl):
transform!(df, :A => ByRow(a -> (a / 2) ^ 2 * 3.14 / 1000) => :C)

Try this:
df.D = .5df.A .^2 * 0.00314
Explanation:
not so many parentheses needed
multiplying scalar by vector is here as good as the vectorization for short vectors (up two something like 100 elements)
A simple benchmark using BenchmarkTools:
julia> #btime $df.E = .5*$df.A .^2 * 0.00314;
592.085 ns (9 allocations: 496 bytes)
julia> #btime $df.F = #. ($df.A / 2) ^ 2 * 0.00314;
875.490 ns (11 allocations: 448 bytes)
The fastest is however a longer version where you provide the type information #. (df.A::Vector{Int} / 2) ^ 2 * 0.00314 (again this matters rather for short DataFrames and note that here the Z column must exist so we create it here):
julia> #btime begin $df.Z = Vector{Float64}(undef, nrow(df));#. $df.Z = ($df.A::Vector{Int} / 2.0) ^ 2.0 * 0.00314; end;
162.564 ns (3 allocations: 208 bytes)

Related

How to change array elements that are greater than 5 to 5, in one line?

I would like to take an array x and change all numbers greater than 5 to 5. What is the standard way to do this in one line?
Below is some code that does this in several lines. This question on logical indexing is related but appears to concern selection rather than assignment.
Thanks
x = [1 2 6 7]
for i in 1:length(x)
if x[i] >= 5
x[i] = 5
end
end
Desired output:
x = [1 2 5 5]
The broadcast operator . works with any function, including relational operators, and it also works with assignment. Hence an intuitive one-liner is:
x[x .> 5] .= 5
This part x .> 5 broadcasts > 5 over x, resulting in a vector of booleans indicating elements greater than 5. This part .= 5 broadcasts the assignment of 5 across all elements indicated by x[x .> 5].
However, inspired by the significant speed-up in Benoit's very cool answer below (please do check it out) I decided to also add an optimized variant with a speed test. The above approach, while very intuitive looking, is not optimal because it allocates a temporary array of booleans for the indices. A (more) optimal approach that avoids temporary allocation, and as a bonus will work for any predicate (conditional) function is:
function f_cond!(x::Vector{Int}, f::Function, val::Int)
#inbounds for n in eachindex(x)
f(x[n]) && (x[n] = val)
end
return x
end
So using this function we would write f_cond!(x, a->a>5, 5) which assigns 5 to any element for which the conditional (anonymous) function a->a>5 evaluates to true. Obviously this solution is not a neat one-liner, but check out the following speed tests:
julia> using BenchmarkTools
julia> x1 = rand(1:10, 100);
julia> x2 = copy(x1);
julia> #btime $x1[$x1 .> 5] .= 5;
327.862 ns (8 allocations: 336 bytes)
julia> #btime f_cond!($x2, a->a>5, 5);
15.067 ns (0 allocations: 0 bytes)
This is just ludicrously faster. Also, you can just replace Int with T<:Any. Given the speed-up, one might wonder if there is a function in Base that already does this. A one-liner is:
map!(a->a>5 ? 5 : a, x, x)
and while this significantly speeds up over the first approach, it falls well short of the second.
Incidentally, I felt certain this must be a duplicate to another StackOverflow question, but 5 minutes searching didn't reveal anything.
You can broadcast min as well:
x .= min.(x, 5)
Note that this is (slightly) more efficient than using x[x .> 5] .= 5 because it does not allocate the temporary array of Booleans, x .> 5, and it can be automatically vectorized, with a single pass over the memory (as per Oscar's comment below):
julia> using BenchmarkTools
julia> x = [1 2 6 7] ; #btime $x .= min.($x, 5) ; # fast, no allocations
19.144 ns (0 allocations: 0 bytes)
julia> x = [1 2 6 7] ; #btime $x[$x .> 5] .= 5 ; # slower, allocates
148.678 ns (5 allocations: 304 bytes)

How to handle missing in boolean context in Julia?

I'm trying to create a categorical variable based on ranges of values from other (numerical) column. However, the code don't work when I have missings in the numerical column
Here is a replicable example:
using RDatasets;
using DataFrames;
using Pipe;
using FreqTables;
df = dataset("datasets","iris")
#lowercase columns just for convenience
#pipe df |> rename!(_, [lowercase(k) for k in names(df)]);
#without this line, the code works fine
#pipe df |> allowmissing!(_, :sepallength) |> replace!(_.sepallength, 4.9 => missing);
df[:size] = #. ifelse(df[:sepallength]<=4.7, "small", missing)
df[:size] = #. ifelse((df[:sepallength]>4.7) & (df[:sepallength]<=4.9), "avg", df[:size])
df[:size] = #. ifelse((df[:sepallength]>4.9) & (df[:sepallength]<=5), "large", df[:size])
df[:size] = #. ifelse(df[:sepallength]>5, "huge", df[:size])
println(#pipe df |> freqtable(_, :size))
Output:
TypeError: non-boolean (Missing) used in boolean context
I would like to ignore the missing cases in the numerical variable but I cannot just drop de missings because this will drop other important informations in my dataset. Moreover, if I drop just the missings in sepallength the column df[:size] would have a different length than the original dataframe.
Use the coalesce function like this:
julia> x = [1,2,3,missing,5,6,7]
7-element Array{Union{Missing, Int64},1}:
1
2
3
missing
5
6
7
julia> #. ifelse(coalesce(x < 4.7, false), "small", missing)
7-element Array{Union{Missing, String},1}:
"small"
"small"
"small"
missing
missing
missing
missing
As a side note do not write df[:size] (this syntax has been deprecated for over 2 years now and soon it will error) but rather df.size or df."size" to access the column of the data frame (the df."size" is for cases when your column names contain characters like spaces etc., e.g. df."my fancy column!").
I think Bogumil's approach is correct and probably best for most situations, but one other option that I like to use is to define my own comparison operators that can deal with missings by returning false if a missing is encountered. Using the unicode capabilities of Julia makes this quite pleasant in my opinion:
julia> ==ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x == y;
julia> >=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x >= y;
julia> <=ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x <= y;
julia> <ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x < y;
julia> >ₘ(x, y) = ismissing(x) | ismissing(y) ? false : x > y;
julia> x = rand([missing; 1:10], 50)
julia> x .> 10
50-element Array{Union{Missing, Bool},1}
...
julia> x .>ₘ 10
50-element BitArray{1}
...
There are of course downsides to defining such an elementary operator in your own code, particularly using Unicode as well, in terms of your code being harder for other people to read (and potentially even to display correctly!), so I probably wouldn't advocate for this as the standard approach, or something to be used in library code. I do think though that for explorative work it makes life easier.

Julia InexactError: Int64

I am new to Julia. Got this InexactError . To mention that I have tried to convert to float beforehand and it did not work, maybe I am doing something wrong.
column = df[:, i]
max = maximum(column)
min = minimum(column)
scaled_column = (column .- min)/max # This is the error, I think
df[:, i] = scaled_column
julia> VERSION
v"1.4.2"
Hard to give a sure answer without a minimal working example of the problem, but in general an InexactError happens when you try to convert a value to an exact type (like integer types, but unlike floating-point types) in which the original value cannot be exactly represented. For example:
julia> convert(Int, 1.0)
1
julia> convert(Int, 1.5)
ERROR: InexactError: Int64(1.5)
Other programming languages arbitrarily chose some way of rounding here (often truncation but sometimes rounding to nearest). Julia doesn't guess and requires you to be explicit. If you want to round, truncate, take a ceiling, etc. you can:
julia> floor(Int, 1.5)
1
julia> round(Int, 1.5)
2
julia> ceil(Int, 1.5)
2
Back to your problem: you're not calling convert anywhere, so why are you getting a conversion error? There are various situations where Julia will automatically call convert for you, typically when you try to assign a value to a typed location. For example, if you have an array of Ints and you assign a floating point value into it, it will be converted automatically:
julia> v = [1, 2, 3]
3-element Array{Int64,1}:
1
2
3
julia> v[2] = 4.0
4.0
julia> v
3-element Array{Int64,1}:
1
4
3
julia> v[2] = 4.5
ERROR: InexactError: Int64(4.5)
So that's likely what's happening to you: you get non-integer values by doing (column .- min)/max and then you try to assign it into an integer location and you get the error.
As a side note you can use transform! to achieve what you want like this:
transform!(df, i => (x -> (x .- minimum(x)) ./ maximum(x)) => i)
and this operation will replace the column.

How to specify the format for printing an array of Floats in julia?

I have an array or matrix that I want to print, but only to three digits of precision. How do I do that. I tried the following.
> #printf("%.3f", rand())
0.742
> #printf("%.3f", rand(3))
LoadError: TypeError: non-boolean (Array{Bool,1}) used in boolean context
while loading In[13], in expression starting on line 1
Update: Ideally, I just want to call a function like printx("{.3f}", rand(m, n)) without having to further process my array or matrix.
The OP said:
Update: Ideally, I just want to call a function like printx("{.3f}", rand(m, n)) without having to further process my array or matrix.
This answer to a similar questions suggests something like this:
julia> VERSION
v"1.0.0"
julia> using Printf
julia> m = 3; n = 5;
julia> A = rand(m, n)
3×5 Array{Float64,2}:
0.596055 0.0574471 0.122782 0.829356 0.226897
0.606948 0.0312382 0.244186 0.356534 0.786589
0.147872 0.61846 0.494186 0.970206 0.701587
# For this session of the REPL, redefine show function. Next REPL will be back to normal.
# Note %1.3f% spec for printf format string to get 3 digits to right of decimal.
julia> Base.show(io::IO, f::Float64) = #printf(io, "%1.3f", f)
# Now we have the 3 digits to the right spec working in the REPL.
julia> A
3×5 Array{Float64,2}:
0.596 0.057 0.123 0.829 0.227
0.607 0.031 0.244 0.357 0.787
0.148 0.618 0.494 0.970 0.702
# The print function prints with 3 decimals as well, but note the semicolons for rows.
# This may not be what was wanted either, but could have a use.
julia> print(A)
[0.596 0.057 0.123 0.829 0.227; 0.607 0.031 0.244 0.357 0.787; 0.148 0.618 0.494 0.970 0.702]
How about this?
julia> print(round.(rand(3); digits=3))
[0.188,0.202,0.237]
I would do it this way:
julia> using Printf
julia> map(x -> #sprintf("%.3f",x), rand(3))
3-element Array{String,1}:
"0.471"
"0.252"
"0.090"
I don't think #printf accepts a list of arguments as you might be expecting.
One solution you could try it to use #sprintf to create formatted strings, but collect them up in a list comprehension. You might then use join to concatenate them together like so:
join([#sprintf "%3.2f" x for x in rand(3)], ", ")

parse input to rational in Julia

I want to read the user input and store it as a Rational, whatever the type: integer, float ot rational. For instance:
5 --> store it as 5//1
2.3 --> store it as 23//10
4//7 --> store it as 4//7
At the moment I wrote the following:
a = convert(Rational,parse(Float64,readline(STDIN)))
which is fine if I input an integer, like 5.
But if I input 2.3, a stores 2589569785738035//1125899906842624
.
And if I input a fraction (whether in the form 4/7 or the form 4//7) I get an ArgumentError: invalid number format for Float64.
How to solve the Float&Rational problems?
One way is to parse the raw input to an Expr (symbols), eval the expression, convert it to a Float64 and use rationalize to simplify the rational generated:
julia> rationalize(convert(Float64, eval(parse("5"))))
5//1
julia> rationalize(convert(Float64, eval(parse("2.3"))))
23//10
julia> rationalize(convert(Float64, eval(parse("4/7"))))
4//7
julia> rationalize(convert(Float64, eval(parse("4//7"))))
4//7
rationalize works with approximate floating point number and you could specify the error in the parameter tol.
tested with Julia Version 0.4.3
Update: The parse method was deprecated in Julia version >= 1.0. There are two methods that should be used: Base.parse (just for numbers, and it requires a Type argument) and Meta.parse (for expressions):
julia> rationalize(convert(Float64, eval(parse(Int64, "5"))))
5//1
julia> rationalize(convert(Float64, eval(parse(Float64, "2.3"))))
23//10
Multiplication (then division) by a highly composite integer works pretty well.
julia> N = 2*2 * 3*3 * 5*5 * 7 * 11 * 13
900900
julia> a = round(Int, N * parse(Float64, "2.3")) // N
23//10
julia> a = round(Int, N * parse(Float64, "5")) // N
5//1
julia> a = round(Int, N * parse(Float64, "9.1111111111")) // N
82//9
You could implement your own parse:
function Base.parse(::Type{Rational{Int}}, x::ASCIIString)
ms, ns = split(x, '/', keep=false)
m = parse(Int, ms)
n = parse(Int, ns)
return m//n
end
Base.parse(::Type{Rational}, x::ASCIIString) = parse(Rational{Int}, x)