I'm trying to implement OLS regression in Julia as a learning exercise. A feature I would like to have is excepting a formula as an argument (e.g. 'formula = Y ~ x1 + x2', where Y, x1, and x2 are columns in a DataFrame). Here is an existing example.
How do I "map" the formula/expression to the correct DataFrame columns?
Formulas in the Julia statistics packages are implemented as a macro. A macro is defined for the ~ symbol, which means that the expressions are parsed by the Julia compiler. Once parsed by the compiler, they are stored as the rhs and lhs fields of a composite type called Formula.
The details of the implementation, which is relatively simple, can be seen in the DataFrames.jl source code here: https://github.com/JuliaStats/DataFrames.jl/blob/725a22602b8b3f6413e35ebdd707b69c4ed7b659/src/statsmodels/formula.jl
Use an anonymous function as an input.
julia > using DataFrames
julia > f = (x,y) -> x[:A] .* y[:B] # Anonymous function
julia > x = DataFrame(A = 6)
julia > y = DataFrame(B = 7)
julia > function OSL(x::DataFrame,y::DataFrame,f::Function);return f(x,y);end
julia > OSL(x,y,f)
1-element DataArrays.DataArray{Int64,1}:
42
Here's a minimal example using the boston dataset from ISLR, regressing medv on lstat. (Check pg. 111 of ISLR if you want verify that the weight vector is correct)
julia> using DataFrames, RDatasets
julia> df = dataset("MASS", "Boston")
julia> fm = #formula(MedV ~ LStat)
julia> mf = ModelFrame(fm, df)
julia> X = ModelMatrix(mf).m
julia> y = Array(df[:MedV])
julia> w = X \ y
2-element Array{Float64,1}:
34.5538
-0.950049
For more information: http://dataframesjl.readthedocs.io/en/latest/formulas.html
Related
I have following dataframe
I would like to plot Evar / (T^2 * L)
using Plots, DataFrames, CSV
#df data plot(:T, :Evar / (:T * T * :L) , group=:L, legend=nothing)
MethodError: no method matching *(::Vector{Float64}, ::Vector{Float64})
Unfortunately I am not sure how to use operators inside the plot function.
For the "/" operator it seems to work, but if I want to multiply using "*" I get the error above.
Here is an example of what I mean by "/" working:
You need to vectorize the multiplication and division so this will be:
#df data plot(:T, :Evar ./ (:T .* :T .* :L) , group=:L, legend=nothing)
Simpler example:
julia> a = [1,3,4];
julia> b = [4,5,6];
julia> a * b
ERROR: MethodError: no method matching *(::Vector{Int64}, ::Vector{Int64})
julia> a .* b
3-element Vector{Int64}:
4
15
24
Not that / works because / is defined for vectors but the results is perhaps not exactly what you would have wanted:
julia> c = a / b
3×3 Matrix{Float64}:
0.0519481 0.0649351 0.0779221
0.155844 0.194805 0.233766
0.207792 0.25974 0.311688
It just returned matrix such as c*b == a where * is a matrix multiplication.
I am using dataframes.jl package in the below mentioned code to perform certain operations.
I would like to know, how may I apply CUDA.jl on this code, if possible while keeping the dataframe aspect?
Secondly, is it possible to allow the code to automatically choose between CPU and GPU based on the availability?
Code
using DataFrame
df = DataFrame(i = Int64[], a = Float64[], b =Float64[])
for i in 1:10
push!(df.i, i)
a = i + sin(i)*cos(i)/sec(i)^100
push!(df.a, a)
b = i + tan(i)*cosec(i)/sin(i)
push!(df.b, b)
end
transform!(df, [:a, :b] .=> (x -> [missing; diff(x)]) .=> [:da, :db])
Please suggest a solution to make this code compatible with CUDA.jl.
Thanks in advance!!
Is there an alternative to numpy.atleast_2d() in Julia.
The python function can be found on this link: https://www.geeksforgeeks.org/numpy-atleast_2d-in-python/
Looking at the Python Numpy docs this needs to be defined as:
atleast_2d(a) = fill(a,1,1)
atleast_2d(a::AbstractArray) = ndims(a) == 1 ? reshape(a, :, 1) : a
Testing:
julia> atleast_2d(3)
1×1 Matrix{Int64}:
3
julia> atleast_2d([4,5])
2×1 Matrix{Int64}:
4
5
I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.
my dataframe [11 x 300], where the column header equals 'x' ([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25]), and each row-value represents 'y' for. Each row can be described by an exponential function in the following format : a * x ^k + b.
The goal is to add three additional columns, describing a, k and b for that specific row. Just like: Python curve fitting on pandas dataframe then add coef to new columns
Instead of a polynomial function, my data needs be described in the following format: a * x **k + b.
As I cannot find any solution to derive the coefficients by using np.polyfit, I split my dataframe into different lists.
x = np.array([0.75,1,1.25,1.5,1.75,2,2.25,2.5,2.75,3,3.25])
y1 = np.array([288.79,238.32,199.42,181.22,165.50,154.74,152.25,152.26,144.81,144.81,144.81])
y2 = np.array([309.92,255.75,214.02,194.48,177.61,166.06,163.40,163.40,155.41,155.41,155.41])
...
y300 = np.array([352.18,290.63,243.20,221.00,201.83,188.71,185.68,185.68,176.60,176.60,176.60])
def func(x,a,k,b):
return a * (x**k) + b
popt1, pcov = curve_fit(func,x,y1, p0 = (300,-0.5,0))
...
popt300, pcov = curve_fit(func,x,y300, p0 = (300,-0.5,0))
output:
popt1
[107.73727907 -1.545475 123.48621504]
...
popt300
[131.38411712 -1.5454452 150.59522147
This works, when I split all dataframe rows into lists and define popt for every list/row.
Avoiding to split all 300 columns - I prefer to apply the same methodology as Python curve fitting on pandas dataframe then add coef to new columns
my_coep_array = pd.DataFrame(np.polyfit(x, df.values,1)).T
But how to define my np.polyfit - a * x **k + b?