I would like to concatenate anonymous calls in Julia - dataframe

So, I'm learning more about Julia and I would like to do the following:
I have a 3 row by 2 columns matrix, which is fixed,
A = rand(2,3)
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.705942 0.553562 0.731246
0.205833 0.106978 0.131893
Then, I would like to have a anonymous function, which does the following:
a = ones(1,3);
a[2] = rand();
Finally, I would like to broadcast
broadcast(+, ones(1,3) => a[2]=rand(), A)
So I have the middle column of A, i.e., A[:,2], added by two different random numbers, and in the rest of the columns, we add ones.
EDIT:
If I add a, as it is:
julia> a = ones(1,3)
1×3 Matrix{Float64}:
1.0 1.0 1.0
julia> a[2] = rand()
0.664824196431979
julia> a
1×3 Matrix{Float64}:
1.0 0.664824 1.0
I would like that this a were dynamic, and a function.
So that:
broadcast(+, a, A)
Would give:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 0.553562 + rand() (correct) 1.73125
1.20583 0.106970 + rand() (different rand()) 1.13189
Instead of:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 1.21839 (0.553562 + -> 0.664824) 1.73125
1.20583 0.771802 (0.106978 + -> 0.664824) 1.13189
So, I thought of this pseudo-code:
broadcast(+, a=ones(1,3) => a[2]=rand(), A)
Formalizing:
broadcast(+, <anonymous-fucntion>, A)
Second EDIT:
Rules/Constrains:
Rule 1: the call must be data-transparent. That is, A must not change state, just like when we call f.(A).
Rule 2: not creating an auxiliary variable (a must not exist). The only vector that must exist, before and after, the call is A.
Rule 3: f.(A) must be anonymous; that is, you can't use define f as function f(A) ... end

With the caveat that I don't know how much you really learn by setting artificial rules like this, some tidier ways are:
julia> A = [ 0.705942 0.553562 0.731246
0.205833 0.106978 0.131893 ]; # as given
julia> r = 0.664824196431979; # the one random number
julia> (A' .+ (1, r, 1))' # no extra vector
2×3 adjoint(::Matrix{Float64}) with eltype Float64:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> mapslices(row -> row .+ (1, r, 1), A; dims=2) # one line, but slow
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> B = A .+ 1; #views B[:, 2] .+= (-1 + r); B # fast, no extra allocations
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
I can't tell from your question whether you want one random number or two different ones. If you want two, then you can do this:
julia> using Random
julia> Random.seed!(1); mapslices(row -> row .+ (1, rand(), 1), A; dims=2)
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
julia> Random.seed!(1); B = A .+ 1; #views B[:, 2] .+= (-1 .+ rand.()); B
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
Note that (-1 .+ rand.()) isn't making a new array on the right, it's fused by .+= into one loop over a column of B. Note also that B[:,2] .= stuff just writes into B, but B[:, 2] .+= stuff means B[:, 2] .= B[:, 2] .+ stuff and so, without #views, the slice B[:, 2] on the right would allocate a copy.

Firstly I'd like to say that the approach taken in the other answers is the most performant one. It seems like you want the entire matrix at the end, in that case for the best performance it is generally good to get data (like randomness) in big batches and to not "hide" data from the compiler (especially type information). A lot of interesting things can be achieved with higher level abstractions but since you say performance is important, let's establish a baseline:
function approach1(A)
a = ones(2,3)
#. a[:, 2] = rand()
broadcast(+, a, A)
end
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.199619 0.273481 0.99254
0.0927839 0.179071 0.188591 julia> #btime approach1($A)
65.420 ns (2 allocations: 256 bytes)
2×3 Matrix{Float64}:
1.19962 0.968391 1.99254
1.09278 1.14451 1.18859
With that out of the way let's try some other solutions.
If a single row with lazy elements doesn't count as an auxiliary variable this seems like a good starting point:
function approach2(A)
a = Matrix{Function}(undef, 1, 3)
fill!(a, ()->1.0)
a[2] = rand
broadcast((a,b)->a() + b, a, A)
end
We get a row a = [()->1.0 rand ()->1.0] and evaluate each function when the broadcast gets that element.
julia> #btime approach2($A)
1.264 μs (24 allocations: 960 bytes)
The performance is 20 times worse, why? We've hidden type information from the compiler, it can't tell that a() is a Float64 by just asserting this (changing the last row to broadcast((a,b)->a()::Float64 + b, a, A) increase the performance almost tenfold:
julia> #btime approach2($A)
164.108 ns (14 allocations: 432 bytes)
If this is acceptable we can make it cleaner: introduce a LazyNumber type that keeps track of the return type, and has promote rules/operators so we can get back to broadcast(+, ...). However, we are still 2-3 times slower, can we do better?
An approach that could allow us to squeeze out some more would be to represent the whole array lazily. Something like a Fill type, a LazySetItem that applies on top of a matrix. Once again actually creating the array will be cheaper unless you can avoid getting parts of the array

I agree that it is not very clear what you are trying to achieve, and even if what you want to learn is how to achieve something or how theoretically something works.
If all you want is just to add a random vector to a matrix column (and 1 elsewhere), it is as simple as... add a random vector to the desired matrix column and 1 elsewhere:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.94194 0.691855 0.583107
0.198166 0.740017 0.914133
julia> A[:,[1,3]] .+= 1
2×2 view(::Matrix{Float64}, :, [1, 3]) with eltype Float64:
1.94194 1.58311
1.19817 1.91413
julia> A[:,2] += rand(size(A,1))
2-element Vector{Float64}:
1.0306116987831297
0.8757712661515558
julia> A
2×3 Matrix{Float64}:
1.94194 1.03061 1.58311
1.19817 0.875771 1.91413

Why not just have a be the same size as A, and then you don't even need broadcasting or any weird tricks:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.564824 0.765611 0.841353
0.566965 0.322331 0.109889
julia> a = ones(2,3);
julia> a[:, 2] .= [rand() for _ in 1:size(a, 1)] #in every row, make the second column's value a different rand() result
julia> a
2×3 Matrix{Float64}:
1.0 0.519228 1.0
1.0 0.0804104 1.0
julia> A + a
2×3 Matrix{Float64}:
1.56482 1.28484 1.84135
1.56696 0.402741 1.10989

Related

Transform dataframe columns using column selector (Cols) fails

I am wondering why I cannot use the Cols column selector in transform to change a dataframe column. For instance:
df = DataFrame(x = 1:5, y = 6:10)
transform(df, [:x, :y] .=> v -> v .+ 100) # OK
df[!, Cols(1:2)] .= df[!, Cols(1:2)] .+ 100 # OK
transform(df, Cols(1:2) .=> v -> v .+ 100) # MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
I've read in the DataFrames documentation that says column selectors such as Cols, Between, Not, and All can be used in transform, among others, but yet I get this error.
Thanks for any pointers.
These selectors can be used when they are passed to transform directly. Here you are using broadcasting with .=> (note the dot), so you are not passing them directly to transform, but instead try pass the following:
julia> Cols(1:2) .=> v -> v .+ 100
ERROR: MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
The error you observe is not emitted by DataFrames.jl but by Julia base.
What you need to do is to use names to make things work:
julia> names(df, Cols(1:2)) .=> v -> v .+ 100
2-element Vector{Pair{String, var"#7#8"}}:
"x" => var"#7#8"()
"y" => var"#7#8"()
and in consequence the following works:
transform(df, names(df, Cols(1:2)) .=> v -> v .+ 100)
In the future the functionality you request might be added but it requires changes in DataAPI.jl, see here.
EDIT
As signaled in the original answer in DataFrames.jl 1.3 the functionality has been added and now you can do transform(df, Cols(1:2) .=> v -> v .+ 100) without error. See https://bkamins.github.io/julialang/2021/12/17/selectors.html for an explanation how it works now.

DataFrames : no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)

When I try to use the dot operator (element wise operation) in a DataFrame where a function returning a tuple is applied, I get the following error.
Here is a toy example,
df = DataFrame()
df[:, :x] = rand(5)
df[:, :y] = rand(5)
#Function that returns two values in the form of a tuple
add_minus_two(x,y) = (x-y,x+y)
df[:,"x+y"] = add_minus_two.(df[:,:x], df[:,:y])[2]
#Out > ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
#However removing the dot operator works fine
df[:,"x+y"] = add_minus_two(df[:,:x], df[:,:y])[2]
#Out > 5 x 3 DataFrame
#Furthermore if its just one argument either dot or not, works fine as well
add_two(x,y) = x+y
df[:, "x+y"] = add_two(df[:,:x], df[:,:y])
df[:, "x+y"] = add_two.(df[:,:x], df[:,:y])
#out > 5 x 3 DataFrame
Any reason why this is. I thought for elementwise operation you need to use "dot" operator.
Also for my actual problem (when a function return 2 values in a tuple), when NOT using the dot operator gives,
ERROR: MethodError: no method matching compute_T(::Vector{Float64}, ::Vector{Float64})
and using the dot operator gives,
ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
and returning a single argument, similar to the toy example works fine as well.
Any clue what I am doing incorrectly here ?
This is not a DataFrames.jl issue, but how Julia Base works.
I concentrate only on RHS, as LHS is irrelevant (and RHS is unrelated to DataFrames.jl).
First, how to write what you want. Initialization:
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df[:, :x] = rand(5)
5-element Vector{Float64}:
0.6146045473316457
0.6319531776216596
0.599267794937812
0.40864382019544965
0.3738682778395166
julia> df[:, :y] = rand(5)
5-element Vector{Float64}:
0.07891853567296825
0.2143545316544586
0.5943274462916335
0.2182702556068421
0.5810132720450707
julia> add_minus_two(x,y) = (x-y,x+y)
add_minus_two (generic function with 1 method)
And now you get:
julia> add_minus_two(df[:,:x], df[:,:y])
([0.5356860116586775, 0.417598645967201, 0.004940348646178538, 0.19037356458860755, -0.2071449942055541], [0.693523083004614, 0.8463077092761182, 1.1935952412294455, 0.6269140758022917, 0.9548815498845873])
julia> add_minus_two.(df[:,:x], df[:,:y])
5-element Vector{Tuple{Float64, Float64}}:
(0.5356860116586775, 0.693523083004614)
(0.417598645967201, 0.8463077092761182)
(0.004940348646178538, 1.1935952412294455)
(0.19037356458860755, 0.6269140758022917)
(-0.2071449942055541, 0.9548815498845873)
julia> add_minus_two(df[:,:x], df[:,:y])[2]
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
julia> add_minus_two.(df[:,:x], df[:,:y])[2]
(0.417598645967201, 0.8463077092761182)
julia> getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2) # this is probably what you want
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
Now the point is that when you write:
df[:,"x+y"] = whatever_you_pass
The whatever_you_pass part must be an AbstractVector with an appropriate number of columns. This means that what will work is:
add_minus_two.(df[:,:x], df[:,:y])
add_minus_two(df[:,:x], df[:,:y])[2]
getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2)
and what will fail is (as in these cases a Tuple not AbstractVector is produced)
add_minus_two(df[:,:x], df[:,:y])
add_minus_two.(df[:,:x], df[:,:y])[2]
Out of the working syntaxes just pick the one you want.
The general recommendation is that when you do assignment always inspect the RHS stand alone and analyze if it has a proper structure.
Also, notably, this will work:
julia> transform(df, [:x, :y] => ByRow(add_minus_two) => ["x-y", "x+y"])
5×4 DataFrame
Row │ x y x-y x+y
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ 0.614605 0.0789185 0.535686 0.693523
2 │ 0.631953 0.214355 0.417599 0.846308
3 │ 0.599268 0.594327 0.00494035 1.1936
4 │ 0.408644 0.21827 0.190374 0.626914
5 │ 0.373868 0.581013 -0.207145 0.954882
(you have not asked about it but maybe this is what you actually are looking for - and as opposed to setindex! this syntax is DataFrames.jl specific)

How to multiply a dataframe's column by a log using julia?

I have a dataframe. I want to multiply column "b" by a "log" and then replace NaN by 0s.
How can I do that in Julia?
I am checking this: DataFrames.jl
But I do not understand.
df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]),
b = repeat([2, 1], outer=[4]),
c = randn(8))
I want to multiply column "b" by a "log"
Assuming you mean you want to apply the (natural) log to each element in column :b, you can do the following:
log.(df.b)
log(x) applies the (natural) log to an individual element x. By putting a dot after the log, you are broadcasting the log function across each element.
If you wanted to replace column b, do the following:
df.b = log.(df.b)
and then replace NaN by 0s
I'm assuming you want to handle the case where you have a DomainError (ie taking the log of a negative number). Your best best is to handle the error before it arises:
map( x -> x <= 0 ? 0.0 : log(x), df.b)
This maps the anonymous function x -> x <= 0 ? 0.0 : log(x) across each element of column b in your DataFrame. This function tests if x is less than zero - if yes then return 0.0 else return log(x). This "one-line if statement" is called a ternary operator.
Use a generator:
( v <= 0. ? 0. : log(v) for v in df.c )
If you want to add a new column:
df[!, :d] .= ( v <= 0. ? 0. : log(v) for v in df.c)
This is faster than using map (those tests assume that df.d already exits:
julia> using BenchmarkTools
julia> #btime $df[!, :d] .= ( v <= 0.0 ? 0.0 : log(v) for v in $df.c)
1.440 μs (14 allocations: 720 bytes)
julia> #btime $df[!, :d] .= map( x -> x <= 0.0 ? 0.0 : log(x), $df.c);
1.570 μs (14 allocations: 720 bytes)

Loop through columns in Julia

I want to add a number to all columns in a DataFrame. I am trying to use,
for i in names(df)
df.i = df.i .+ 1
end
But this is giving error as ArgumentError: column name :i not found in the data frame
Any help is appreciated. Thanks in advance.
Current advice for DataFrames.jl 1.0 or newer
Just write:
df .+= 1
to get what you want.
If you want to loop through columns it is also supported. Here are some examples:
for n in names(df)
df[!, n] .+= 1
end
for col in eachcol(df)
col .+= 1
end
Old advice for DataFrames.jl before 1.0 release
Currently you can use:
for i in axes(df, 2)
df[i] .+= 1
end
or
for n in names(df)
df[n] .+= 1
end
However, in the future you might need to write (there is a discussion if we should change the meaning of single argument indexing):
for col in eachcol(df, false)
col .+= 1
end
or
foreach(x -> x .+= 1, eachcol(df, false))

DataFrames.jl Number of rows

I'd like to get the number of rows of a dataframe.
I can achieve that with size(myDataFrame)[1].
Is there a cleaner way ?
If you are using DataFrames specifically, then you can use nrow():
julia> df = DataFrame(Any[1:10, 1:10]);
julia> nrow(df)
10
Alternatively, you can specify the dimension argument for size:
julia> size(df, 1)
10
This also work for arrays as well so it's a bit more general:
julia> my_array = rand(4, 3)
4×3 Array{Float64,2}:
0.980798 0.873643 0.819478
0.341972 0.34974 0.160342
0.262292 0.387406 0.00741398
0.512669 0.81579 0.329353
julia> size(my_array, 1)
4