How to apply Cuda.jl on DataFrames in julia - dataframe

I am using dataframes.jl package in the below mentioned code to perform certain operations.
I would like to know, how may I apply CUDA.jl on this code, if possible while keeping the dataframe aspect?
Secondly, is it possible to allow the code to automatically choose between CPU and GPU based on the availability?
Code
using DataFrame
df = DataFrame(i = Int64[], a = Float64[], b =Float64[])
for i in 1:10
push!(df.i, i)
a = i + sin(i)*cos(i)/sec(i)^100
push!(df.a, a)
b = i + tan(i)*cosec(i)/sin(i)
push!(df.b, b)
end
transform!(df, [:a, :b] .=> (x -> [missing; diff(x)]) .=> [:da, :db])
Please suggest a solution to make this code compatible with CUDA.jl.
Thanks in advance!!

Related

Transform dataframe columns using column selector (Cols) fails

I am wondering why I cannot use the Cols column selector in transform to change a dataframe column. For instance:
df = DataFrame(x = 1:5, y = 6:10)
transform(df, [:x, :y] .=> v -> v .+ 100) # OK
df[!, Cols(1:2)] .= df[!, Cols(1:2)] .+ 100 # OK
transform(df, Cols(1:2) .=> v -> v .+ 100) # MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
I've read in the DataFrames documentation that says column selectors such as Cols, Between, Not, and All can be used in transform, among others, but yet I get this error.
Thanks for any pointers.
These selectors can be used when they are passed to transform directly. Here you are using broadcasting with .=> (note the dot), so you are not passing them directly to transform, but instead try pass the following:
julia> Cols(1:2) .=> v -> v .+ 100
ERROR: MethodError: no method matching length(::Cols{Tuple{UnitRange{Int64}}})
The error you observe is not emitted by DataFrames.jl but by Julia base.
What you need to do is to use names to make things work:
julia> names(df, Cols(1:2)) .=> v -> v .+ 100
2-element Vector{Pair{String, var"#7#8"}}:
"x" => var"#7#8"()
"y" => var"#7#8"()
and in consequence the following works:
transform(df, names(df, Cols(1:2)) .=> v -> v .+ 100)
In the future the functionality you request might be added but it requires changes in DataAPI.jl, see here.
EDIT
As signaled in the original answer in DataFrames.jl 1.3 the functionality has been added and now you can do transform(df, Cols(1:2) .=> v -> v .+ 100) without error. See https://bkamins.github.io/julialang/2021/12/17/selectors.html for an explanation how it works now.

Fill Forward cupy / cudf

should be possible execute a fill forward with cupy/cudf? the idea is execute a schimitt trigger function, something like:
# pandas version
df = some_random_vector
on_off = (df>.3)*1 + (df<.3)*-1
on_off[on_off==0) = np.nan
on_off = on_off.fillna(method='ffill').fillna(0)
i was trying this one but cupy don't have accumulate ufunc:
def schmitt_trigger(x, th_lo, th_hi, initial = False):
on_off = ((x >= th_hi)*1 + (x <= th_lo)*-1).astype(cp.int8)
mask = (on_off==0)
idx = cp.where(~mask, cp.arange(start=0, stop=mask.shape[0], step=1), 0)
cp.maximum.accumulate(idx,axis=1, out=idx)
out = on_off[cp.arange(idx.shape[0])[:,None], idx]
return out
any idea?
thanks!
Sadly, RAPIDS currently doesn't have that feature in cudf and may not for 0.16 either. There is the feature request in github for it. https://github.com/rapidsai/cudf/issues/1361
Would love for you to chime in on the request so that the devs can know its highly desired.
As for the Schmitt Trigger, I'll look into it and your code and edit this post if I get any progress.

How to make a scatter plot based on the values of a column in the data set?

I am given a data set that looks something like this
and I am trying to graph all the points with a 1 on the first column separate from the points with a 0, but I want to put them in the same chart.
I know the final result should be something similar to this
But I can't find a way to filter the points in Julia. I'm using LinearAlgebra, CSV, Plots, DataFrames for my project, and so far I haven't found a way to make DataFrames storage types work nicely with Plots functions. I keep running into errors like Cannot convert Float64 to series data for plotting when I try plotting the points individually with a for loop as a filter as shown in the code below
filter = select(data, :1)
newData = select(data, 2:3)
#graph one initial point to create the plot
plot(newData[1,1], newData[1,2], seriestype = :scatter, title = "My Scatter Plot")
#add the additional points with the 1 in front
for i in 2:size(newData)
if filter[i] == 1
plot!(newData[i, 1], newData[i, 2], seriestype = :scatter, title = "My Scatter Plot")
end
end
Other approaches have given me other errors, but I haven't recorded those.
I'm using Julia 1.4.0 and the latest versions of all of the packages mentioned.
Quick Edit:
It might help to know that I am trying to replicate the Nonlinear dimensionality reduction section of this article https://sebastianraschka.com/Articles/2014_kernel_pca.html#principal-component-analysis
With Plots.jl you can do the following (I am passing a fully reproducible code):
julia> df = DataFrame(c=rand(Bool, 100), x = 2 .* rand(100) .- 1);
julia> df.y = ifelse.(df.c, 1, -1) .* df.x .^ 2;
julia> plot(df.x, df.y, color=ifelse.(df.c, "blue", "red"), seriestype=:scatter, legend=nothing)
However, in this case I would additionally use StatsPlots.jl as then you can just write:
julia> using StatsPlots
julia> #df df plot(:x, :y, group=:c, seriestype=:scatter, legend=nothing)
If you want to do it manually by groups it is easiest to use the groupby function:
julia> gdf = groupby(df, :c);
julia> summary(gdf) # check that we have 2 groups in data
"GroupedDataFrame with 2 groups based on key: c"
julia> plot(gdf[1].x, gdf[1].y, seriestype=:scatter, legend=nothing)
julia> plot!(gdf[2].x, gdf[2].y, seriestype=:scatter)
Note that gdf variable is bound to a GroupedDataFrame object from which you can get groups defined by the grouping column (:c) in this case.

I have a dataframe and I want to find the standard deviation for some specific cells

I'm trying to use pandas to find the standard deviation for the entries in some specific cells
I have tried using numPy's stdev like so:
numpy.std(df[columnName][j:i])
I have also tried using this:
df.std(axis=0)[columnName][j:i]
Just pseudocode becuase my actual code is more confusing than necessary for this question:
df = loadIris()
for feat in df.columns:
i = 0
j = 0
flower = df['flower'][i]
while i < df.index.max():
if df['flower'][i] == flower:
i+=1
else:
j = i
stand = df.std(axis=0)[feat][j:i]
flower = df['flower'][i]
I ended up just appending all of the values to a list and then calculating the standard deviation using statistics.stdev which you can get by importing statistics.

Taking an expression as an argument in Julia function

I'm trying to implement OLS regression in Julia as a learning exercise. A feature I would like to have is excepting a formula as an argument (e.g. 'formula = Y ~ x1 + x2', where Y, x1, and x2 are columns in a DataFrame). Here is an existing example.
How do I "map" the formula/expression to the correct DataFrame columns?
Formulas in the Julia statistics packages are implemented as a macro. A macro is defined for the ~ symbol, which means that the expressions are parsed by the Julia compiler. Once parsed by the compiler, they are stored as the rhs and lhs fields of a composite type called Formula.
The details of the implementation, which is relatively simple, can be seen in the DataFrames.jl source code here: https://github.com/JuliaStats/DataFrames.jl/blob/725a22602b8b3f6413e35ebdd707b69c4ed7b659/src/statsmodels/formula.jl
Use an anonymous function as an input.
julia > using DataFrames
julia > f = (x,y) -> x[:A] .* y[:B] # Anonymous function
julia > x = DataFrame(A = 6)
julia > y = DataFrame(B = 7)
julia > function OSL(x::DataFrame,y::DataFrame,f::Function);return f(x,y);end
julia > OSL(x,y,f)
1-element DataArrays.DataArray{Int64,1}:
42
Here's a minimal example using the boston dataset from ISLR, regressing medv on lstat. (Check pg. 111 of ISLR if you want verify that the weight vector is correct)
julia> using DataFrames, RDatasets
julia> df = dataset("MASS", "Boston")
julia> fm = #formula(MedV ~ LStat)
julia> mf = ModelFrame(fm, df)
julia> X = ModelMatrix(mf).m
julia> y = Array(df[:MedV])
julia> w = X \ y
2-element Array{Float64,1}:
34.5538
-0.950049
For more information: http://dataframesjl.readthedocs.io/en/latest/formulas.html