When I try to use the dot operator (element wise operation) in a DataFrame where a function returning a tuple is applied, I get the following error.
Here is a toy example,
df = DataFrame()
df[:, :x] = rand(5)
df[:, :y] = rand(5)
#Function that returns two values in the form of a tuple
add_minus_two(x,y) = (x-y,x+y)
df[:,"x+y"] = add_minus_two.(df[:,:x], df[:,:y])[2]
#Out > ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
#However removing the dot operator works fine
df[:,"x+y"] = add_minus_two(df[:,:x], df[:,:y])[2]
#Out > 5 x 3 DataFrame
#Furthermore if its just one argument either dot or not, works fine as well
add_two(x,y) = x+y
df[:, "x+y"] = add_two(df[:,:x], df[:,:y])
df[:, "x+y"] = add_two.(df[:,:x], df[:,:y])
#out > 5 x 3 DataFrame
Any reason why this is. I thought for elementwise operation you need to use "dot" operator.
Also for my actual problem (when a function return 2 values in a tuple), when NOT using the dot operator gives,
ERROR: MethodError: no method matching compute_T(::Vector{Float64}, ::Vector{Float64})
and using the dot operator gives,
ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
and returning a single argument, similar to the toy example works fine as well.
Any clue what I am doing incorrectly here ?
This is not a DataFrames.jl issue, but how Julia Base works.
I concentrate only on RHS, as LHS is irrelevant (and RHS is unrelated to DataFrames.jl).
First, how to write what you want. Initialization:
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df[:, :x] = rand(5)
5-element Vector{Float64}:
0.6146045473316457
0.6319531776216596
0.599267794937812
0.40864382019544965
0.3738682778395166
julia> df[:, :y] = rand(5)
5-element Vector{Float64}:
0.07891853567296825
0.2143545316544586
0.5943274462916335
0.2182702556068421
0.5810132720450707
julia> add_minus_two(x,y) = (x-y,x+y)
add_minus_two (generic function with 1 method)
And now you get:
julia> add_minus_two(df[:,:x], df[:,:y])
([0.5356860116586775, 0.417598645967201, 0.004940348646178538, 0.19037356458860755, -0.2071449942055541], [0.693523083004614, 0.8463077092761182, 1.1935952412294455, 0.6269140758022917, 0.9548815498845873])
julia> add_minus_two.(df[:,:x], df[:,:y])
5-element Vector{Tuple{Float64, Float64}}:
(0.5356860116586775, 0.693523083004614)
(0.417598645967201, 0.8463077092761182)
(0.004940348646178538, 1.1935952412294455)
(0.19037356458860755, 0.6269140758022917)
(-0.2071449942055541, 0.9548815498845873)
julia> add_minus_two(df[:,:x], df[:,:y])[2]
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
julia> add_minus_two.(df[:,:x], df[:,:y])[2]
(0.417598645967201, 0.8463077092761182)
julia> getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2) # this is probably what you want
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
Now the point is that when you write:
df[:,"x+y"] = whatever_you_pass
The whatever_you_pass part must be an AbstractVector with an appropriate number of columns. This means that what will work is:
add_minus_two.(df[:,:x], df[:,:y])
add_minus_two(df[:,:x], df[:,:y])[2]
getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2)
and what will fail is (as in these cases a Tuple not AbstractVector is produced)
add_minus_two(df[:,:x], df[:,:y])
add_minus_two.(df[:,:x], df[:,:y])[2]
Out of the working syntaxes just pick the one you want.
The general recommendation is that when you do assignment always inspect the RHS stand alone and analyze if it has a proper structure.
Also, notably, this will work:
julia> transform(df, [:x, :y] => ByRow(add_minus_two) => ["x-y", "x+y"])
5×4 DataFrame
Row │ x y x-y x+y
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ 0.614605 0.0789185 0.535686 0.693523
2 │ 0.631953 0.214355 0.417599 0.846308
3 │ 0.599268 0.594327 0.00494035 1.1936
4 │ 0.408644 0.21827 0.190374 0.626914
5 │ 0.373868 0.581013 -0.207145 0.954882
(you have not asked about it but maybe this is what you actually are looking for - and as opposed to setindex! this syntax is DataFrames.jl specific)
Related
I am using Julia and I got a dataframe with 42 values, of which 2 are missing.
This values are prices that go from 0.23 to 0.3
I am trying to get a new column that tells if its cheap or expensive by a ifelse statement.
the ifelse should go:
df.x_category=ifelse.(df.x .< mean(df.x),"cheap", "expensive")
but i get the following error:
ERROR: TypeError: non-boolean (Missing) used in boolean context
Is there a way to skip those missing values?
I tried with:
df.x_category=ifelse.(skipmissing(df.x) .< mean(skipmissing(df.x)),"cheap", "expensive")
but get this error:
ERROR: ArgumentError: New columns must have the same length as old columns
I can't just delete missing observations.
How can i make this?
Thanks in advance!
ifelse can handle only 2 values and you need handle 3.
Assuming that you have
df = DataFrame(x=rand([0.23,0.3,missing], 10))
than mean(df.x) yields a missing since some of values are missings. You need to do instead mean(skipmissing(df.x))).
Hence the code could be:
julia> map(x -> ismissing(x) ? missing : ifelse(x,"cheap", "expensive"), df.x .< mean(skipmissing(df.x)))
10-element Vector{Union{Missing, String}}:
missing
missing
"cheap"
missing
"expensive"
missing
missing
missing
"cheap"
"cheap"
Here I have combined ifelse with map for handling the missing value there are other ways but each one will require nesting some conditional function.
i would do it with a function that returns cheap, expensive or missing:
using Statistics
data = ifelse.(rand(Bool,100),missing,100*rand(100)) #generator for the data
meandata = mean(skipmissing(data)) #mean of the data
function category_select(x)
ismissing(x) && return missing #short-circuit operator
return ifelse(x<meandata,"cheap","expensive") #parentheses are optional
end
category_select2(x) = ismissing(x) ? missing : (x < meandata ? "cheap" : "expensive)
#broadcast values
x_category = category_selector.(data)
x_category = category_selector2.(data)
now, what is happening? there are two things with the ifelse function:
It evaluates both branches at the same time, so if one branch can error, it will error. take this example:
maybelog(x) = ifelse(x<0,zero(x),log(x)) #ifelse
maybelog2(x) = begin if x<0; zero(x);else;log(x);end #full if expression
maybelog3(x) = x<0 ? zero(x) : log(x) #ternary operator
maybelog fails with x = -1, whereas maybelog2 and maybelog3 does not.
The first argument is always a bool. In your initial expression,the result of df.x .< mean(df.x) can be true, false or missing, so ifelse also fails there.
in your modified expression, the length of skipmissing(df.x) is different than the length of x as the first one doesnt count the missing values present in x, resulting in a smaller vector than the size of your dataframe.
If you are using DataFrames.jl (which it looks like you do) then I recommend you to learn the metapackages that simplify syntax for such scenarios. Here is how you can write your query using DataFrameMacros.jl:
#transform!(df,
#subset(!ismissing(:x)),
:x_category = #c ifelse.(:x .< mean(:x), "cheap", "expensive"))
This is the simplest approach I think.
You can try something like this. Using toy data.
First get your string values from ifelse into a vector.
Then prepare the string vector by converting it to a Union of strings and missing to hold missing values.
Finally put the missing values into the vector.
julia> using DataFrames, Random
julia> vec = ifelse.(df.d[ismissing.(df.d) .== false] .> 0.5,"higher","lower")
40-element Vector{String}:
"higher"
"lower"
"lower"
etc...
julia> vec = convert(Vector{Union{Missing,String}}, vec)
40-element Vector{Union{Missing, String}}
julia> for i in findall(ismissing.(df.d)) insert!(vec, i, missing) end
julia> df.x = vec
julia> df
42×2 DataFrame
Row │ d x
│ Float64? String?
─────┼──────────────────────────
1 │ 0.533183 higher
2 │ 0.454029 lower
3 │ 0.0176868 lower
4 │ 0.172933 lower
5 │ 0.958926 higher
6 │ 0.973566 higher
7 │ 0.30387 lower
8 │ 0.176909 lower
9 │ 0.956916 higher
10 │ 0.584284 higher
11 │ 0.937466 higher
12 │ missing missing
13 │ 0.422956 lower
etc...
Data
julia> Random.seed!(42)
MersenneTwister(42)
julia> data = Random.rand(42)
42-element Vector{Float64}:
0.5331830160438613
0.4540291355871424
etc...
julia> data = convert(Vector{Union{Missing,Float64}}, data)
42-element Vector{Union{Missing, Float64}}
julia> data[[12,34]] .= missing
2-element view(::Vector{Union{Missing, Float64}}, [12, 34]) with eltype Union{Missing, Float64}:
missing
missing
julia> df = DataFrame(d=data)
I am wondering why I cannot use the Cols column selector in transform to change a dataframe column. For instance:
df = DataFrame(x = 1:5, y = 6:10)
transform(df, [:x, :y] .=> v -> v .+ 100) # OK
df[!, Cols(1:2)] .= df[!, Cols(1:2)] .+ 100 # OK
transform(df, Cols(1:2) .=> v -> v .+ 100) # MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
I've read in the DataFrames documentation that says column selectors such as Cols, Between, Not, and All can be used in transform, among others, but yet I get this error.
Thanks for any pointers.
These selectors can be used when they are passed to transform directly. Here you are using broadcasting with .=> (note the dot), so you are not passing them directly to transform, but instead try pass the following:
julia> Cols(1:2) .=> v -> v .+ 100
ERROR: MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
The error you observe is not emitted by DataFrames.jl but by Julia base.
What you need to do is to use names to make things work:
julia> names(df, Cols(1:2)) .=> v -> v .+ 100
2-element Vector{Pair{String, var"#7#8"}}:
"x" => var"#7#8"()
"y" => var"#7#8"()
and in consequence the following works:
transform(df, names(df, Cols(1:2)) .=> v -> v .+ 100)
In the future the functionality you request might be added but it requires changes in DataAPI.jl, see here.
EDIT
As signaled in the original answer in DataFrames.jl 1.3 the functionality has been added and now you can do transform(df, Cols(1:2) .=> v -> v .+ 100) without error. See https://bkamins.github.io/julialang/2021/12/17/selectors.html for an explanation how it works now.
Given the following data frame from DataFrames.jl:
julia> using DataFrames
julia> df = DataFrame(x1=[1, 2, 3], x2=Union{Int,Missing}[1, 2, 3], x3=[1, 2, missing])
3×3 DataFrame
Row │ x1 x2 x3
│ Int64 Int64? Int64?
─────┼────────────────────────
1 │ 1 1 1
2 │ 2 2 2
3 │ 3 3 missing
I would like to find columns that contain missing value in them.
I have tried:
julia> names(df, Missing)
String[]
but this is incorrect as the names function, when passed a type, looks for subtypes of the passed type.
If you want to find columns that actually contain missing value use:
julia> names(df, any.(ismissing, eachcol(df)))
1-element Vector{String}:
"x3"
In this approach we iterate each column of the df data frame and check if it contains at least one missing value.
If you want to find columns that potentially can contain missing value you need to check their element type:
julia> names(df, [eltype(col) >: Missing for col in eachcol(df)]) # using a comprehension
2-element Vector{String}:
"x2"
"x3"
julia> names(df, .>:(eltype.(eachcol(df)), Missing)) # using broadcasting
2-element Vector{String}:
"x2"
"x3"
I'm trying to implement OLS regression in Julia as a learning exercise. A feature I would like to have is excepting a formula as an argument (e.g. 'formula = Y ~ x1 + x2', where Y, x1, and x2 are columns in a DataFrame). Here is an existing example.
How do I "map" the formula/expression to the correct DataFrame columns?
Formulas in the Julia statistics packages are implemented as a macro. A macro is defined for the ~ symbol, which means that the expressions are parsed by the Julia compiler. Once parsed by the compiler, they are stored as the rhs and lhs fields of a composite type called Formula.
The details of the implementation, which is relatively simple, can be seen in the DataFrames.jl source code here: https://github.com/JuliaStats/DataFrames.jl/blob/725a22602b8b3f6413e35ebdd707b69c4ed7b659/src/statsmodels/formula.jl
Use an anonymous function as an input.
julia > using DataFrames
julia > f = (x,y) -> x[:A] .* y[:B] # Anonymous function
julia > x = DataFrame(A = 6)
julia > y = DataFrame(B = 7)
julia > function OSL(x::DataFrame,y::DataFrame,f::Function);return f(x,y);end
julia > OSL(x,y,f)
1-element DataArrays.DataArray{Int64,1}:
42
Here's a minimal example using the boston dataset from ISLR, regressing medv on lstat. (Check pg. 111 of ISLR if you want verify that the weight vector is correct)
julia> using DataFrames, RDatasets
julia> df = dataset("MASS", "Boston")
julia> fm = #formula(MedV ~ LStat)
julia> mf = ModelFrame(fm, df)
julia> X = ModelMatrix(mf).m
julia> y = Array(df[:MedV])
julia> w = X \ y
2-element Array{Float64,1}:
34.5538
-0.950049
For more information: http://dataframesjl.readthedocs.io/en/latest/formulas.html
I'd like to get the number of rows of a dataframe.
I can achieve that with size(myDataFrame)[1].
Is there a cleaner way ?
If you are using DataFrames specifically, then you can use nrow():
julia> df = DataFrame(Any[1:10, 1:10]);
julia> nrow(df)
10
Alternatively, you can specify the dimension argument for size:
julia> size(df, 1)
10
This also work for arrays as well so it's a bit more general:
julia> my_array = rand(4, 3)
4×3 Array{Float64,2}:
0.980798 0.873643 0.819478
0.341972 0.34974 0.160342
0.262292 0.387406 0.00741398
0.512669 0.81579 0.329353
julia> size(my_array, 1)
4