Using operators inside a plot function in Julia - dataframe

I have following dataframe
I would like to plot Evar / (T^2 * L)
using Plots, DataFrames, CSV
#df data plot(:T, :Evar / (:T * T * :L) , group=:L, legend=nothing)
MethodError: no method matching *(::Vector{Float64}, ::Vector{Float64})
Unfortunately I am not sure how to use operators inside the plot function.
For the "/" operator it seems to work, but if I want to multiply using "*" I get the error above.
Here is an example of what I mean by "/" working:

You need to vectorize the multiplication and division so this will be:
#df data plot(:T, :Evar ./ (:T .* :T .* :L) , group=:L, legend=nothing)
Simpler example:
julia> a = [1,3,4];
julia> b = [4,5,6];
julia> a * b
ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})
julia> a .* b
3-element Vector{Int64}:
4
15
24
Not that / works because / is defined for vectors but the results is perhaps not exactly what you would have wanted:
julia> c = a / b
3×3 Matrix{Float64}:
0.0519481 0.0649351 0.0779221
0.155844 0.194805 0.233766
0.207792 0.25974 0.311688
It just returned matrix such as c*b == a where * is a matrix multiplication.

Related

Transform dataframe columns using column selector (Cols) fails

I am wondering why I cannot use the Cols column selector in transform to change a dataframe column. For instance:
df = DataFrame(x = 1:5, y = 6:10)
transform(df, [:x, :y] .=> v -> v .+ 100) # OK
df[!, Cols(1:2)] .= df[!, Cols(1:2)] .+ 100 # OK
transform(df, Cols(1:2) .=> v -> v .+ 100) # MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
I've read in the DataFrames documentation that says column selectors such as Cols, Between, Not, and All can be used in transform, among others, but yet I get this error.
Thanks for any pointers.
These selectors can be used when they are passed to transform directly. Here you are using broadcasting with .=> (note the dot), so you are not passing them directly to transform, but instead try pass the following:
julia> Cols(1:2) .=> v -> v .+ 100
ERROR: MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
The error you observe is not emitted by DataFrames.jl but by Julia base.
What you need to do is to use names to make things work:
julia> names(df, Cols(1:2)) .=> v -> v .+ 100
2-element Vector{Pair{String, var"#7#8"}}:
"x" => var"#7#8"()
"y" => var"#7#8"()
and in consequence the following works:
transform(df, names(df, Cols(1:2)) .=> v -> v .+ 100)
In the future the functionality you request might be added but it requires changes in DataAPI.jl, see here.
EDIT
As signaled in the original answer in DataFrames.jl 1.3 the functionality has been added and now you can do transform(df, Cols(1:2) .=> v -> v .+ 100) without error. See https://bkamins.github.io/julialang/2021/12/17/selectors.html for an explanation how it works now.

DataFrames : no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)

When I try to use the dot operator (element wise operation) in a DataFrame where a function returning a tuple is applied, I get the following error.
Here is a toy example,
df = DataFrame()
df[:, :x] = rand(5)
df[:, :y] = rand(5)
#Function that returns two values in the form of a tuple
add_minus_two(x,y) = (x-y,x+y)
df[:,"x+y"] = add_minus_two.(df[:,:x], df[:,:y])[2]
#Out > ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
#However removing the dot operator works fine
df[:,"x+y"] = add_minus_two(df[:,:x], df[:,:y])[2]
#Out > 5 x 3 DataFrame
#Furthermore if its just one argument either dot or not, works fine as well
add_two(x,y) = x+y
df[:, "x+y"] = add_two(df[:,:x], df[:,:y])
df[:, "x+y"] = add_two.(df[:,:x], df[:,:y])
#out > 5 x 3 DataFrame
Any reason why this is. I thought for elementwise operation you need to use "dot" operator.
Also for my actual problem (when a function return 2 values in a tuple), when NOT using the dot operator gives,
ERROR: MethodError: no method matching compute_T(::Vector{Float64}, ::Vector{Float64})
and using the dot operator gives,
ERROR: MethodError: no method matching setindex!(::DataFrame, ::Tuple{Float64, Float64}, ::Colon, ::String)
and returning a single argument, similar to the toy example works fine as well.
Any clue what I am doing incorrectly here ?
This is not a DataFrames.jl issue, but how Julia Base works.
I concentrate only on RHS, as LHS is irrelevant (and RHS is unrelated to DataFrames.jl).
First, how to write what you want. Initialization:
julia> using DataFrames
julia> df = DataFrame()
0×0 DataFrame
julia> df[:, :x] = rand(5)
5-element Vector{Float64}:
0.6146045473316457
0.6319531776216596
0.599267794937812
0.40864382019544965
0.3738682778395166
julia> df[:, :y] = rand(5)
5-element Vector{Float64}:
0.07891853567296825
0.2143545316544586
0.5943274462916335
0.2182702556068421
0.5810132720450707
julia> add_minus_two(x,y) = (x-y,x+y)
add_minus_two (generic function with 1 method)
And now you get:
julia> add_minus_two(df[:,:x], df[:,:y])
([0.5356860116586775, 0.417598645967201, 0.004940348646178538, 0.19037356458860755, -0.2071449942055541], [0.693523083004614, 0.8463077092761182, 1.1935952412294455, 0.6269140758022917, 0.9548815498845873])
julia> add_minus_two.(df[:,:x], df[:,:y])
5-element Vector{Tuple{Float64, Float64}}:
(0.5356860116586775, 0.693523083004614)
(0.417598645967201, 0.8463077092761182)
(0.004940348646178538, 1.1935952412294455)
(0.19037356458860755, 0.6269140758022917)
(-0.2071449942055541, 0.9548815498845873)
julia> add_minus_two(df[:,:x], df[:,:y])[2]
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
julia> add_minus_two.(df[:,:x], df[:,:y])[2]
(0.417598645967201, 0.8463077092761182)
julia> getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2) # this is probably what you want
5-element Vector{Float64}:
0.693523083004614
0.8463077092761182
1.1935952412294455
0.6269140758022917
0.9548815498845873
Now the point is that when you write:
df[:,"x+y"] = whatever_you_pass
The whatever_you_pass part must be an AbstractVector with an appropriate number of columns. This means that what will work is:
add_minus_two.(df[:,:x], df[:,:y])
add_minus_two(df[:,:x], df[:,:y])[2]
getindex.(add_minus_two.(df[:,:x], df[:,:y]), 2)
and what will fail is (as in these cases a Tuple not AbstractVector is produced)
add_minus_two(df[:,:x], df[:,:y])
add_minus_two.(df[:,:x], df[:,:y])[2]
Out of the working syntaxes just pick the one you want.
The general recommendation is that when you do assignment always inspect the RHS stand alone and analyze if it has a proper structure.
Also, notably, this will work:
julia> transform(df, [:x, :y] => ByRow(add_minus_two) => ["x-y", "x+y"])
5×4 DataFrame
Row │ x y x-y x+y
│ Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────
1 │ 0.614605 0.0789185 0.535686 0.693523
2 │ 0.631953 0.214355 0.417599 0.846308
3 │ 0.599268 0.594327 0.00494035 1.1936
4 │ 0.408644 0.21827 0.190374 0.626914
5 │ 0.373868 0.581013 -0.207145 0.954882
(you have not asked about it but maybe this is what you actually are looking for - and as opposed to setindex! this syntax is DataFrames.jl specific)

How to make a scatter plot based on the values of a column in the data set?

I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.

Julia DataFrames equivalent of pandas pct_change()

Currently, I have written the below function for percent change calculation:
function pct_change(input::AbstractVector{<:Number})::AbstractVector{Number}
result = [NaN]
for i in 2:length(input)
push!(result, (input[i] - input[i-1])/abs(input[i-1]))
end
return result
end
This works as expected. But wanted to know whether there is a built-in function for Julia DataFrames similar to pandas pct_change which I can use directly? Or any other better way or improvements that I can make to my function above?
This is a very specific function and is not provided in DataFrames.jl, but rather TimeSeries.jl. Here is an example:
julia> using TimeSeries, Dates
julia> ta = TimeArray(Date(2018, 1, 1):Day(1):Date(2018, 12, 31), 1:365);
julia> percentchange(ta);
(there are some more options to what should be calculated)
The drawback is that it accepts only TimeArray objects and that it drops periods for which percent change cannot be calculated (as they are retained in Python).
If you want your custom definition consider denoting the first value as missing rather than NaN, as missing. Also your function will not produce the most accurate representation of the numbers (e.g. if you wanted to use BigFloat or exact calculations using Rational type they will be converted to Float64). Here are example alternative function implementations that avoid these problems:
function pct_change(input::AbstractVector{<:Number})
res = #view(input[2:end]) ./ #view(input[1:end-1]) .- 1
[missing; res]
end
or
function pct_change(input::AbstractVector{<:Number})
[i == 1 ? missing : (input[i]-input[i-1])/input[i-1] for i in eachindex(input)]
end
And now you have in both cases:
julia> pct_change(1:10)
10-element Array{Union{Missing, Float64},1}:
missing
1.0
0.5
0.33333333333333326
0.25
0.19999999999999996
0.16666666666666674
0.1428571428571428
0.125
0.11111111111111116
julia> pct_change(big(1):10)
10-element Array{Union{Missing, BigFloat},1}:
missing
1.0
0.50
0.3333333333333333333333333333333333333333333333333333333333333333333333333333391
0.25
0.2000000000000000000000000000000000000000000000000000000000000000000000000000069
0.1666666666666666666666666666666666666666666666666666666666666666666666666666609
0.1428571428571428571428571428571428571428571428571428571428571428571428571428547
0.125
0.111111111111111111111111111111111111111111111111111111111111111111111111111113
julia> pct_change(1//1:10)
10-element Array{Union{Missing, Rational{Int64}},1}:
missing
1//1
1//2
1//3
1//4
1//5
1//6
1//7
1//8
1//9
with proper values returned.

Taking an expression as an argument in Julia function

I'm trying to implement OLS regression in Julia as a learning exercise. A feature I would like to have is excepting a formula as an argument (e.g. 'formula = Y ~ x1 + x2', where Y, x1, and x2 are columns in a DataFrame). Here is an existing example.
How do I "map" the formula/expression to the correct DataFrame columns?
Formulas in the Julia statistics packages are implemented as a macro. A macro is defined for the ~ symbol, which means that the expressions are parsed by the Julia compiler. Once parsed by the compiler, they are stored as the rhs and lhs fields of a composite type called Formula.
The details of the implementation, which is relatively simple, can be seen in the DataFrames.jl source code here: https://github.com/JuliaStats/DataFrames.jl/blob/725a22602b8b3f6413e35ebdd707b69c4ed7b659/src/statsmodels/formula.jl
Use an anonymous function as an input.
julia > using DataFrames
julia > f = (x,y) -> x[:A] .* y[:B] # Anonymous function
julia > x = DataFrame(A = 6)
julia > y = DataFrame(B = 7)
julia > function OSL(x::DataFrame,y::DataFrame,f::Function);return f(x,y);end
julia > OSL(x,y,f)
1-element DataArrays.DataArray{Int64,1}:
42
Here's a minimal example using the boston dataset from ISLR, regressing medv on lstat. (Check pg. 111 of ISLR if you want verify that the weight vector is correct)
julia> using DataFrames, RDatasets
julia> df = dataset("MASS", "Boston")
julia> fm = #formula(MedV ~ LStat)
julia> mf = ModelFrame(fm, df)
julia> X = ModelMatrix(mf).m
julia> y = Array(df[:MedV])
julia> w = X \ y
2-element Array{Float64,1}:
34.5538
-0.950049
For more information: http://dataframesjl.readthedocs.io/en/latest/formulas.html