Transform dataframe columns using column selector (Cols) fails - dataframe

I am wondering why I cannot use the Cols column selector in transform to change a dataframe column. For instance:
df = DataFrame(x = 1:5, y = 6:10)
transform(df, [:x, :y] .=> v -> v .+ 100) # OK
df[!, Cols(1:2)] .= df[!, Cols(1:2)] .+ 100 # OK
transform(df, Cols(1:2) .=> v -> v .+ 100) # MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
I've read in the DataFrames documentation that says column selectors such as Cols, Between, Not, and All can be used in transform, among others, but yet I get this error.
Thanks for any pointers.

These selectors can be used when they are passed to transform directly. Here you are using broadcasting with .=> (note the dot), so you are not passing them directly to transform, but instead try pass the following:
julia> Cols(1:2) .=> v -> v .+ 100
ERROR: MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
The error you observe is not emitted by DataFrames.jl but by Julia base.
What you need to do is to use names to make things work:
julia> names(df, Cols(1:2)) .=> v -> v .+ 100
2-element Vector{Pair{String, var"#7#8"}}:
"x" => var"#7#8"()
"y" => var"#7#8"()
and in consequence the following works:
transform(df, names(df, Cols(1:2)) .=> v -> v .+ 100)
In the future the functionality you request might be added but it requires changes in DataAPI.jl, see here.
EDIT
As signaled in the original answer in DataFrames.jl 1.3 the functionality has been added and now you can do transform(df, Cols(1:2) .=> v -> v .+ 100) without error. See https://bkamins.github.io/julialang/2021/12/17/selectors.html for an explanation how it works now.

Related

I would like to concatenate anonymous calls in Julia

So, I'm learning more about Julia and I would like to do the following:
I have a 3 row by 2 columns matrix, which is fixed,
A = rand(2,3)
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.705942 0.553562 0.731246
0.205833 0.106978 0.131893
Then, I would like to have a anonymous function, which does the following:
a = ones(1,3);
a[2] = rand();
Finally, I would like to broadcast
broadcast(+, ones(1,3) => a[2]=rand(), A)
So I have the middle column of A, i.e., A[:,2], added by two different random numbers, and in the rest of the columns, we add ones.
EDIT:
If I add a, as it is:
julia> a = ones(1,3)
1×3 Matrix{Float64}:
1.0 1.0 1.0
julia> a[2] = rand()
0.664824196431979
julia> a
1×3 Matrix{Float64}:
1.0 0.664824 1.0
I would like that this a were dynamic, and a function.
So that:
broadcast(+, a, A)
Would give:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 0.553562 + rand() (correct) 1.73125
1.20583 0.106970 + rand() (different rand()) 1.13189
Instead of:
julia> broadcast(+, a, A)
2×3 Matrix{Float64}:
1.70594 1.21839 (0.553562 + -> 0.664824) 1.73125
1.20583 0.771802 (0.106978 + -> 0.664824) 1.13189
So, I thought of this pseudo-code:
broadcast(+, a=ones(1,3) => a[2]=rand(), A)
Formalizing:
broadcast(+, <anonymous-fucntion>, A)
Second EDIT:
Rules/Constrains:
Rule 1: the call must be data-transparent. That is, A must not change state, just like when we call f.(A).
Rule 2: not creating an auxiliary variable (a must not exist). The only vector that must exist, before and after, the call is A.
Rule 3: f.(A) must be anonymous; that is, you can't use define f as function f(A) ... end
With the caveat that I don't know how much you really learn by setting artificial rules like this, some tidier ways are:
julia> A = [ 0.705942 0.553562 0.731246
0.205833 0.106978 0.131893 ]; # as given
julia> r = 0.664824196431979; # the one random number
julia> (A' .+ (1, r, 1))' # no extra vector
2×3 adjoint(::Matrix{Float64}) with eltype Float64:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> mapslices(row -> row .+ (1, r, 1), A; dims=2) # one line, but slow
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
julia> B = A .+ 1; #views B[:, 2] .+= (-1 + r); B # fast, no extra allocations
2×3 Matrix{Float64}:
1.70594 1.21839 1.73125
1.20583 0.771802 1.13189
I can't tell from your question whether you want one random number or two different ones. If you want two, then you can do this:
julia> using Random
julia> Random.seed!(1); mapslices(row -> row .+ (1, rand(), 1), A; dims=2)
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
julia> Random.seed!(1); B = A .+ 1; #views B[:, 2] .+= (-1 .+ rand.()); B
2×3 Matrix{Float64}:
1.70594 0.675436 1.73125
1.20583 0.771383 1.13189
Note that (-1 .+ rand.()) isn't making a new array on the right, it's fused by .+= into one loop over a column of B. Note also that B[:,2] .= stuff just writes into B, but B[:, 2] .+= stuff means B[:, 2] .= B[:, 2] .+ stuff and so, without #views, the slice B[:, 2] on the right would allocate a copy.
Firstly I'd like to say that the approach taken in the other answers is the most performant one. It seems like you want the entire matrix at the end, in that case for the best performance it is generally good to get data (like randomness) in big batches and to not "hide" data from the compiler (especially type information). A lot of interesting things can be achieved with higher level abstractions but since you say performance is important, let's establish a baseline:
function approach1(A)
a = ones(2,3)
#. a[:, 2] = rand()
broadcast(+, a, A)
end
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.199619 0.273481 0.99254
0.0927839 0.179071 0.188591 julia> #btime approach1($A)
65.420 ns (2 allocations: 256 bytes)
2×3 Matrix{Float64}:
1.19962 0.968391 1.99254
1.09278 1.14451 1.18859
With that out of the way let's try some other solutions.
If a single row with lazy elements doesn't count as an auxiliary variable this seems like a good starting point:
function approach2(A)
a = Matrix{Function}(undef, 1, 3)
fill!(a, ()->1.0)
a[2] = rand
broadcast((a,b)->a() + b, a, A)
end
We get a row a = [()->1.0 rand ()->1.0] and evaluate each function when the broadcast gets that element.
julia> #btime approach2($A)
1.264 μs (24 allocations: 960 bytes)
The performance is 20 times worse, why? We've hidden type information from the compiler, it can't tell that a() is a Float64 by just asserting this (changing the last row to broadcast((a,b)->a()::Float64 + b, a, A) increase the performance almost tenfold:
julia> #btime approach2($A)
164.108 ns (14 allocations: 432 bytes)
If this is acceptable we can make it cleaner: introduce a LazyNumber type that keeps track of the return type, and has promote rules/operators so we can get back to broadcast(+, ...). However, we are still 2-3 times slower, can we do better?
An approach that could allow us to squeeze out some more would be to represent the whole array lazily. Something like a Fill type, a LazySetItem that applies on top of a matrix. Once again actually creating the array will be cheaper unless you can avoid getting parts of the array
I agree that it is not very clear what you are trying to achieve, and even if what you want to learn is how to achieve something or how theoretically something works.
If all you want is just to add a random vector to a matrix column (and 1 elsewhere), it is as simple as... add a random vector to the desired matrix column and 1 elsewhere:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.94194 0.691855 0.583107
0.198166 0.740017 0.914133
julia> A[:,[1,3]] .+= 1
2×2 view(::Matrix{Float64}, :, [1, 3]) with eltype Float64:
1.94194 1.58311
1.19817 1.91413
julia> A[:,2] += rand(size(A,1))
2-element Vector{Float64}:
1.0306116987831297
0.8757712661515558
julia> A
2×3 Matrix{Float64}:
1.94194 1.03061 1.58311
1.19817 0.875771 1.91413
Why not just have a be the same size as A, and then you don't even need broadcasting or any weird tricks:
julia> A = rand(2,3)
2×3 Matrix{Float64}:
0.564824 0.765611 0.841353
0.566965 0.322331 0.109889
julia> a = ones(2,3);
julia> a[:, 2] .= [rand() for _ in 1:size(a, 1)] #in every row, make the second column's value a different rand() result
julia> a
2×3 Matrix{Float64}:
1.0 0.519228 1.0
1.0 0.0804104 1.0
julia> A + a
2×3 Matrix{Float64}:
1.56482 1.28484 1.84135
1.56696 0.402741 1.10989

Using operators inside a plot function in Julia

I have following dataframe
I would like to plot Evar / (T^2 * L)
using Plots, DataFrames, CSV
#df data plot(:T, :Evar / (:T * T * :L) , group=:L, legend=nothing)
MethodError: no method matching *(::Vector{Float64}, ::Vector{Float64})
Unfortunately I am not sure how to use operators inside the plot function.
For the "/" operator it seems to work, but if I want to multiply using "*" I get the error above.
Here is an example of what I mean by "/" working:
You need to vectorize the multiplication and division so this will be:
#df data plot(:T, :Evar ./ (:T .* :T .* :L) , group=:L, legend=nothing)
Simpler example:
julia> a = [1,3,4];
julia> b = [4,5,6];
julia> a * b
ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})
julia> a .* b
3-element Vector{Int64}:
4
15
24
Not that / works because / is defined for vectors but the results is perhaps not exactly what you would have wanted:
julia> c = a / b
3×3 Matrix{Float64}:
0.0519481 0.0649351 0.0779221
0.155844 0.194805 0.233766
0.207792 0.25974 0.311688
It just returned matrix such as c*b == a where * is a matrix multiplication.

DataFrames : no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)

When I try to use the dot operator (element wise operation) in a DataFrame where a function returning a tuple is applied, I get the following error.
Here is a toy example,
df = DataFrame()
df[:, :x] = rand(5)
df[:, :y] = rand(5)
#Function that returns two values in the form of a tuple
add_minus_two(x,y) = (x-y,x+y)
df[:,"x+y"] = add_minus_two.(df[:,:x], df[:,:y])[2]
#Out > ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
#However removing the dot operator works fine
df[:,"x+y"] = add_minus_two(df[:,:x], df[:,:y])[2]
#Out > 5 x 3 DataFrame
#Furthermore if its just one argument either dot or not, works fine as well
add_two(x,y) = x+y
df[:, "x+y"] = add_two(df[:,:x], df[:,:y])
df[:, "x+y"] = add_two.(df[:,:x], df[:,:y])
#out > 5 x 3 DataFrame
Any reason why this is. I thought for elementwise operation you need to use "dot" operator.
Also for my actual problem (when a function return 2 values in a tuple), when NOT using the dot operator gives,
ERROR: MethodError: no method matching compute_T(::Vector{Float64}, ::Vector{Float64})
and using the dot operator gives,
ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
and returning a single argument, similar to the toy example works fine as well.
Any clue what I am doing incorrectly here ?
This is not a DataFrames.jl issue, but how Julia Base works.
I concentrate only on RHS, as LHS is irrelevant (and RHS is unrelated to DataFrames.jl).
First, how to write what you want. Initialization:
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df[:, :x] = rand(5)
5-element Vector{Float64}:
0.6146045473316457
0.6319531776216596
0.599267794937812
0.40864382019544965
0.3738682778395166
julia> df[:, :y] = rand(5)
5-element Vector{Float64}:
0.07891853567296825
0.2143545316544586
0.5943274462916335
0.2182702556068421
0.5810132720450707
julia> add_minus_two(x,y) = (x-y,x+y)
add_minus_two (generic function with 1 method)
And now you get:
julia> add_minus_two(df[:,:x], df[:,:y])
([0.5356860116586775, 0.417598645967201, 0.004940348646178538, 0.19037356458860755, -0.2071449942055541], [0.693523083004614, 0.8463077092761182, 1.1935952412294455, 0.6269140758022917, 0.9548815498845873])
julia> add_minus_two.(df[:,:x], df[:,:y])
5-element Vector{Tuple{Float64, Float64}}:
(0.5356860116586775, 0.693523083004614)
(0.417598645967201, 0.8463077092761182)
(0.004940348646178538, 1.1935952412294455)
(0.19037356458860755, 0.6269140758022917)
(-0.2071449942055541, 0.9548815498845873)
julia> add_minus_two(df[:,:x], df[:,:y])[2]
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
julia> add_minus_two.(df[:,:x], df[:,:y])[2]
(0.417598645967201, 0.8463077092761182)
julia> getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2) # this is probably what you want
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
Now the point is that when you write:
df[:,"x+y"] = whatever_you_pass
The whatever_you_pass part must be an AbstractVector with an appropriate number of columns. This means that what will work is:
add_minus_two.(df[:,:x], df[:,:y])
add_minus_two(df[:,:x], df[:,:y])[2]
getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2)
and what will fail is (as in these cases a Tuple not AbstractVector is produced)
add_minus_two(df[:,:x], df[:,:y])
add_minus_two.(df[:,:x], df[:,:y])[2]
Out of the working syntaxes just pick the one you want.
The general recommendation is that when you do assignment always inspect the RHS stand alone and analyze if it has a proper structure.
Also, notably, this will work:
julia> transform(df, [:x, :y] => ByRow(add_minus_two) => ["x-y", "x+y"])
5×4 DataFrame
Row │ x y x-y x+y
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ 0.614605 0.0789185 0.535686 0.693523
2 │ 0.631953 0.214355 0.417599 0.846308
3 │ 0.599268 0.594327 0.00494035 1.1936
4 │ 0.408644 0.21827 0.190374 0.626914
5 │ 0.373868 0.581013 -0.207145 0.954882
(you have not asked about it but maybe this is what you actually are looking for - and as opposed to setindex! this syntax is DataFrames.jl specific)

How to apply Cuda.jl on DataFrames in julia

I am using dataframes.jl package in the below mentioned code to perform certain operations.
I would like to know, how may I apply CUDA.jl on this code, if possible while keeping the dataframe aspect?
Secondly, is it possible to allow the code to automatically choose between CPU and GPU based on the availability?
Code
using DataFrame
df = DataFrame(i = Int64[], a = Float64[], b =Float64[])
for i in 1:10
push!(df.i, i)
a = i + sin(i)*cos(i)/sec(i)^100
push!(df.a, a)
b = i + tan(i)*cosec(i)/sin(i)
push!(df.b, b)
end
transform!(df, [:a, :b] .=> (x -> [missing; diff(x)]) .=> [:da, :db])
Please suggest a solution to make this code compatible with CUDA.jl.
Thanks in advance!!

Taking an expression as an argument in Julia function

I'm trying to implement OLS regression in Julia as a learning exercise. A feature I would like to have is excepting a formula as an argument (e.g. 'formula = Y ~ x1 + x2', where Y, x1, and x2 are columns in a DataFrame). Here is an existing example.
How do I "map" the formula/expression to the correct DataFrame columns?
Formulas in the Julia statistics packages are implemented as a macro. A macro is defined for the ~ symbol, which means that the expressions are parsed by the Julia compiler. Once parsed by the compiler, they are stored as the rhs and lhs fields of a composite type called Formula.
The details of the implementation, which is relatively simple, can be seen in the DataFrames.jl source code here: https://github.com/JuliaStats/DataFrames.jl/blob/725a22602b8b3f6413e35ebdd707b69c4ed7b659/src/statsmodels/formula.jl
Use an anonymous function as an input.
julia > using DataFrames
julia > f = (x,y) -> x[:A] .* y[:B] # Anonymous function
julia > x = DataFrame(A = 6)
julia > y = DataFrame(B = 7)
julia > function OSL(x::DataFrame,y::DataFrame,f::Function);return f(x,y);end
julia > OSL(x,y,f)
1-element DataArrays.DataArray{Int64,1}:
42
Here's a minimal example using the boston dataset from ISLR, regressing medv on lstat. (Check pg. 111 of ISLR if you want verify that the weight vector is correct)
julia> using DataFrames, RDatasets
julia> df = dataset("MASS", "Boston")
julia> fm = #formula(MedV ~ LStat)
julia> mf = ModelFrame(fm, df)
julia> X = ModelMatrix(mf).m
julia> y = Array(df[:MedV])
julia> w = X \ y
2-element Array{Float64,1}:
34.5538
-0.950049
For more information: http://dataframesjl.readthedocs.io/en/latest/formulas.html