What is the right way to transform multiple columns of a DataFrame in place in Julia? - dataframe

How should I apply a function to every element of some columns in place?
julia> using DataFrames
julia> df = DataFrame(Time = [3, 4, 5], TopTemp = [70, 73, 100], BottomTemp = [50, 55, 80])
3×3 DataFrame
Row │ Time TopTemp BottomTemp
│ Int64 Int64 Int64
─────┼────────────────────────────
1 │ 3 70 50
2 │ 4 73 55
3 │ 5 100 80
julia> fahrenheit_to_celsius(x) = Int(round((x - 32) * 5 / 9))
fahrenheit_to_celsius (generic function with 1 method)
This works for one column, but I'm not sure it is the best method.
julia> transform!(df, "TopTemp" => ByRow(fahrenheit_to_celsius), renamecols = false)
3×3 DataFrame
Row │ Time TopTemp BottomTemp
│ Int64 Int64 Int64
─────┼────────────────────────────
1 │ 3 21 50
2 │ 4 23 55
3 │ 5 38 80
The same method does not work to convert both columns using a regular expression with broadcasting.
julia> transform!(df, r"Temp" .=> ByRow.(fahrenheit_to_celsius), renamecols = false)
ERROR: LoadError: MethodError: no method matching fahrenheit_to_celsius(::Int64, ::Int64)

Use:
transform!(df, names(df, r"Temp") .=> ByRow(fahrenheit_to_celsius), renamecols = false)
or
df[!, r"Temp"] .= fahrenheit_to_celsius.(df[!, r"Temp"])
also note that it is not in-place, but the columns are replaced in both cases above but this is probably what you want in general.
An in-place operation would be:
df[:, r"Temp"] .= fahrenheit_to_celsius.(df[!, r"Temp"])
but it would fail if fahrenheit_to_celsius would change eltype of columns.
The fastest should be:
foreach(col -> col .= fahrenheit_to_celsius.(col), eachcol(df[!, r"Temp"])))

Related

Multiple conditionals in Julia DataFrame

I have a DataFrame with 3 columns, named :x :y and :z which are Float64 type. :x and "y are iid uniform on (0,1) and z is the sum of x and y.
I want to a simple task. If x and y are both greater than 0.5 I want to print z and replace its value to 1.0.
For some reason the following code is running but not working
if df.x .> 0.5 && df.y .> 0.5
println(df.z)
replace!(df, :z) .= 1.0
end
Would appreciate any help on this
The following ifelse is 60X faster than a loop for 500k rows dataframe.
using DataFrames
x = rand(500_000)
y = rand(500_000)
z = x + y
df = DataFrame(x = x, y = y, z = z)
df.z .= ifelse.((df.x .> 0.5) .&& (df.y .> 0.5), 1.0, df.z)
Your code is working on whole columns, and you want the code to work on rows. The simplest way to do it is (there are faster ways to do it, but the one I show you is simplest):
julia> using DataFrames
julia> df = DataFrame(rand(10, 2), [:x, :y]);
julia> df.z = df.x + df.y;
julia> df = DataFrame(rand(10, 2), [:x, :y]);
julia> df.z = df.x + df.y;
julia> df
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼────────────────────────────────
1 │ 0.00461518 0.767149 0.771764
2 │ 0.670752 0.891172 1.56192
3 │ 0.531777 0.78527 1.31705
4 │ 0.0666402 0.265558 0.332198
5 │ 0.700547 0.25959 0.960137
6 │ 0.764978 0.84093 1.60591
7 │ 0.720063 0.795599 1.51566
8 │ 0.524065 0.260897 0.784962
9 │ 0.577509 0.62598 1.20349
10 │ 0.363896 0.266637 0.630533
julia> for row in eachrow(df)
if row.x > 0.5 && row.y > 0.5
println(row.z)
row.z = 1.0
end
end
1.5619237447442418
1.3170464579861205
1.6059082278386194
1.515661749106264
1.2034891678047939
julia> df
10×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼────────────────────────────────
1 │ 0.00461518 0.767149 0.771764
2 │ 0.670752 0.891172 1.0
3 │ 0.531777 0.78527 1.0
4 │ 0.0666402 0.265558 0.332198
5 │ 0.700547 0.25959 0.960137
6 │ 0.764978 0.84093 1.0
7 │ 0.720063 0.795599 1.0
8 │ 0.524065 0.260897 0.784962
9 │ 0.577509 0.62598 1.0
10 │ 0.363896 0.266637 0.630533
Edit
Assuming you do not need to print here is a benchmark of several options:
julia> df = DataFrame(rand(10^7, 2), [:x, :y]);
julia> df.z = df.x + df.y;
julia> #time for row in eachrow(df) # slowest
if row.x > 0.5 && row.y > 0.5
row.z = 1.0
end
end
3.469350 seconds (90.00 M allocations: 2.533 GiB, 10.07% gc time)
julia> #time df.z[df.x .> 0.5 .&& df.y .> 0.5] .= 1.0; # fast and simple
0.026041 seconds (15 allocations: 20.270 MiB)
julia> function update_condition!(x, y, z)
#inbounds for i in eachindex(x, y, z)
if x[i] > 0.5 && y[i] > 0.5
z[i] = 1.0
end
end
return nothing
end
update_condition! (generic function with 1 method)
julia> update_condition!(df.x, df.y, df.z); # compilation
julia> #time update_condition!(df.x, df.y, df.z); # faster but more complex
0.011243 seconds (3 allocations: 96 bytes)

How to produce grouped summary statistics without explicitly naming the variables

Given a Julia dataframe with many variables and a final class column:
julia> df
5×3 DataFrame
Row │ v1 v2 cl
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 10 1 2
2 │ 20 2 2
3 │ 300 10 1
4 │ 400 20 1
5 │ missing 30 1
I want to obtain a grouped df with summary statistics by class, with the classes by column and variables by rows, like:
julia> dfByCl
11×3 DataFrame
Row │ var cl_1 cl_2
│ String Float64? Float64?
─────┼───────────────────────────────
1 │ nrow 3.0 2.0
2 │ v1_mean 350.0 15.0
3 │ v1_std 70.7107 7.07107
4 │ v1_lb 252.002 5.20018
5 │ v1_ub 447.998 24.7998
6 │ v1_nm 2.0 2.0
7 │ v2_mean 20.0 1.5
8 │ v2_std 10.0 0.707107
9 │ v2_lb 8.68414 0.520018
10 │ v2_ub 31.3159 2.47998
11 │ v2_nm 3.0 2.0
and I don't want to explicitly name all the variables.
Is there something simpler/more elegant than the above code ?
using Statistics, DataFrames, Distributions
meansk(data) = mean(skipmissing(data))
stdsk(data) = std(skipmissing(data))
nm(data) = sum(.! ismissing.(data))
ci(data::AbstractVector,α=0.05) = meansk(data) - quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data)), meansk(data) + quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data))
cilb(data) = ci(data)[1]
ciub(data) = ci(data)[2]
df = DataFrame(v1=[10,20,300,400,missing],v2=[1,2,10,20,30],cl=[2,2,1,1,1])
dfByCl_w = combine(groupby(df,["cl"]),
nrow,
names(df) .=> meansk .=> col -> col * "_mean",
names(df) .=> stdsk .=> col -> col * "_std",
names(df) .=> cilb .=> col -> col * "_lb",
names(df) .=> ciub .=> col -> col * "_ub",
names(df) .=> nm .=> col -> col * "_nm",
)
orderedNames = vcat("cl","nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_w = dfByCl_w[:, orderedNames]
toStack = vcat("nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_l = stack(dfByCl_w,toStack)
dfByCl = unstack(dfByCl_l,"cl","value")
rename!(dfByCl,vcat("var",["cl_$(c)" for c in unique(dfByCl_w.cl)]))
Here is what I would normally do in such a case:
julia> cilb(data,α=0.05) = mean(data) - quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
cilb (generic function with 2 methods)
julia> ciub(data,α=0.05) = mean(data) + quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
ciub (generic function with 2 methods)
julia> combine(groupby(df, :cl),
nrow,
sdf -> describe(sdf, :mean, :std, cilb => :lb, ciub => :ub, :nmissing, cols=r"v"))
4×8 DataFrame
Row │ cl nrow variable mean std lb ub nmissing
│ Int64 Int64 Symbol Float64 Float64 Float64 Float64 Int64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 1 3 v1 350.0 70.7107 252.002 447.998 1
2 │ 1 3 v2 20.0 10.0 8.68414 31.3159 0
3 │ 2 2 v1 15.0 7.07107 5.20018 24.7998 0
4 │ 2 2 v2 1.5 0.707107 0.520018 2.47998 0
Later you can reshape it as you like, but maybe this layout is something you would actually want?

Renaming multiple column in Julia

I am trying to rename the dataframe columns using the below code-
function _process_col(df)
for col in names(df)
print(col)
rename!(df, :col => _clean_col_name(col))
end
return df
end
But is throws error that col is not present in the dataframe. rename!(df, :col => _clean_col_name(col)) is treating col as string not as a variable.
note - _clean_col_name(col) is a custom function to process the column name
Is there any alternative to do this??
If you want to apply _clean_col_name to all columns then use the following form:
julia> using DataFrames
julia> df = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────
1 │ 0.0856504 0.677317 0.8402 0.630016 0.815347
2 │ 0.584487 0.997837 0.252574 0.659241 0.0699587
3 │ 0.196169 0.488646 0.689678 0.554855 0.321897
julia> _clean_col_name(x) = uppercase(x)
_clean_col_name (generic function with 1 method)
julia> rename!(_clean_col_name, df)
3×5 DataFrame
Row │ X1 X2 X3 X4 X5
│ Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────
1 │ 0.0856504 0.677317 0.8402 0.630016 0.815347
2 │ 0.584487 0.997837 0.252574 0.659241 0.0699587
3 │ 0.196169 0.488646 0.689678 0.554855 0.321897
If you want to stick to your function just remove : in front of col just as #BatWannaBe suggested:
julia> function _process_col(df)
for col in names(df)
print(col)
rename!(df, col => _clean_col_name(col))
end
return df
end
_process_col (generic function with 1 method)
julia> df = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────
1 │ 0.445679 0.0197894 0.605917 0.668544 0.979025
2 │ 0.631891 0.185474 0.136334 0.218718 0.365156
3 │ 0.115752 0.308683 0.273192 0.638987 0.195281
julia> _process_col(df)
x1x2x3x4x53×5 DataFrame
Row │ X1 X2 X3 X4 X5
│ Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────
1 │ 0.445679 0.0197894 0.605917 0.668544 0.979025
2 │ 0.631891 0.185474 0.136334 0.218718 0.365156
3 │ 0.115752 0.308683 0.273192 0.638987 0.195281
Please check the docstring of rename! to see other available options (as there are several more), just to give one example:
julia> df = DataFrame(rand(3, 5), :auto)
3×5 DataFrame
Row │ x1 x2 x3 x4 x5
│ Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────
1 │ 0.242173 0.0401673 0.674665 0.27598 0.338189
2 │ 0.0497058 0.958139 0.707002 0.258894 0.623699
3 │ 0.477812 0.5068 0.584878 0.198547 0.713736
julia> rename!(df, (names(df) .=> _clean_col_name.(names(df)))...)
3×5 DataFrame
Row │ X1 X2 X3 X4 X5
│ Float64 Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────
1 │ 0.242173 0.0401673 0.674665 0.27598 0.338189
2 │ 0.0497058 0.958139 0.707002 0.258894 0.623699
3 │ 0.477812 0.5068 0.584878 0.198547 0.713736
function _process_col(df)
array = [];
for col in names(df)
push!(array,_clean_col_name(col))
end
rename!(df, Symbol.(array))
return df
end
where array is the list of new names for your columns works in your case.

Julia, efficient way to create a DataFrame by applying a function to tuple elements

I need to use some data stored in a named tuple to create a dataframe with the same column number and names of the tuple element, by applying a function to them. For example:
a = (A = [1, 2], B = 1:6)
f(a) = begin
df = DataFrame()
for k in keys(a) df[k] = sample(a[k], 10) end # There could be any other function in place of sample()
df
end
but if I run #code_warntype I get Union types which I understand means the compiler cannot predict the type before run time and this impacts performance:
julia> #code_warntype f(a)
Variables
#self#::Core.Const(f)
a::NamedTuple{(:A, :B), Tuple{Vector{Int64}, UnitRange{Int64}}}
#_3::Union{Nothing, Tuple{Symbol, Int64}}
df::DataFrame
k::Symbol
Body::DataFrame
1 ─ (df = Main.DataFrame())
│ %2 = Main.keys(a)::Core.Const((:A, :B))
│ (#_3 = Base.iterate(%2))
│ %4 = (#_3::Core.Const((:A, 2)) === nothing)::Core.Const(false)
│ %5 = Base.not_int(%4)::Core.Const(true)
└── goto #4 if not %5
2 ┄ %7 = #_3::Tuple{Symbol, Int64}::Tuple{Symbol, Int64}
│ (k = Core.getfield(%7, 1))
│ %9 = Core.getfield(%7, 2)::Int64
│ %10 = Base.getindex(a, k)::Union{UnitRange{Int64}, Vector{Int64}}
│ %11 = Main.sample(%10, 10)::Vector{Int64}
│ Base.setindex!(df, %11, k)
│ (#_3 = Base.iterate(%2, %9))
│ %14 = (#_3 === nothing)::Bool
│ %15 = Base.not_int(%14)::Bool
└── goto #4 if not %15
3 ─ goto #2
4 ┄ return df
The question is: what is the most efficient way to write f(a)?
In my specific case all the columns of the dataframe will have the same type, could this information help the compiler?
You can generate the data and only at the last step, convert it to DataFrame. There are multiple ways how it can be done, one of them is map, which is type stable for tuples
function g(a)
map(x -> sample(x, 10), a) |> DataFrame
end
julia> #code_warntype(g(a))
MethodInstance for g(::NamedTuple{(:A, :B), Tuple{Vector{Int64}, UnitRange{Int64}}})
from g(a) in Main at REPL[103]:1
Arguments
#self#::Core.Const(g)
a::NamedTuple{(:A, :B), Tuple{Vector{Int64}, UnitRange{Int64}}}
Locals
#47::var"#47#48"
Body::DataFrame
1 ─ (#47 = %new(Main.:(var"#47#48")))
│ %2 = #47::Core.Const(var"#47#48"())
│ %3 = Main.map(%2, a)::NamedTuple{(:A, :B), Tuple{Vector{Int64}, Vector{Int64}}}
│ %4 = (%3 |> Main.DataFrame)::DataFrame
└── return %4

How to create a custom iterator in Julia 1.0?

I have this structure in Julia 1.0:
mutable struct Metadata
id::Int64
res_id::Int64
end
So that I can create an array of these, where the id is always incremented by one, but the res_id is only sometimes incremented, like so:
data = [
Metadata(1, 1),
Metadata(2, 1),
Metadata(3, 1),
Metadata(4, 2),
Metadata(5, 2),
Metadata(6, 2),
...]
What I want to do is be able to iterate over this Array, but get blocks based on the res_id (all the data with res_id 1, then 2, etc). The desired behavior would be something like this:
for res in iter_res(data)
println(res)
end
julia>
[Metadata(1, 1), Metadata(2, 1), Metadata(3, 1)]
[Metadata(4, 2), Metadata(5, 2), Metadata(6, 2)]
How do I do this in Julia 1.0, considering that I also need to normally iterate over the array to get element by element?
In Julia 1+, this should be done by implementing Base.iterate(::YourType) to get the starting iteration and Base.iterate(::YourType, state) for other iterations based of some state. These methods should return nothing when done, otherwise, (result, state) tuple.
Iterating on YourType with
for i in x
# stuff
end
is then a shorthand for writing
it = iterate(x)
while it !== nothing
i, state = it
# stuff
it = iterate(x, state)
end
See the manual for details.
How I eventually handled the problem:
function iter(data::Vector{Metadata}; property::Symbol = :res_id)
#GET UNIQUE VALUES FOR THIS PROPERTY
up = Vector{Any}()
for s in data
getproperty(s, property) in up ? nothing : push!(up, getproperty(s, property))
end
#GROUP ELEMENTS BASED ON THE UNIQUE VALUES FOR THIS PROPERTY
f = Vector{Vector{Metadata}}()
idx::Int64 = 1
cmp::Any = up[idx]
push!(f, Vector{Metadata}())
for s in data
if getproperty(s, property) == cmp
push!(f[idx], s)
else
push!(f, Vector{Metadata}())
idx += 1
cmp = up[idx]
push!(f[idx], s)
end
end
return f
end
This allows me to accommodate "skipped" res_id's (like jumping from 1 to 3, etc) and even group the Metadata objects by other future characteristics other than res_id, such as Strings, or types other than Int64's. Works, although it probably isn't very efficient.
You can then iterate over the Vector{Metadata} this way:
for r in iter(rs)
println(res)
end
You can iterate over a Generator of filters like this:
julia> mutable struct Metadata
id::Int64
res_id::Int64
end
julia> data = [
Metadata(1, 1),
Metadata(2, 1),
Metadata(3, 1),
Metadata(4, 2),
Metadata(5, 2),
Metadata(6, 2),
];
julia> for res in (filter(x -> x.res_id == i, data) for i in 1:2)
println(res)
end
Metadata[Metadata(1, 1), Metadata(2, 1), Metadata(3, 1)]
Metadata[Metadata(4, 2), Metadata(5, 2), Metadata(6, 2)]
From the names of your variables it seems you are collecting the data from some computational process. Normally you use DataFrame for that purpose.
using DataFrames
data = DataFrame(id=[1,2,3,4,5,6],res_id=[1,1,1,2,2,2])
for group in groupby(data,:res_id)
println(group)
end
This yields:
3×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ res_id │
│ │ Int64 │ Int64 │
├─────┼───────┼────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 1 │
│ 3 │ 3 │ 1 │
3×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ res_id │
│ │ Int64 │ Int64 │
├─────┼───────┼────────┤
│ 1 │ 4 │ 2 │
│ 2 │ 5 │ 2 │
│ 3 │ 6 │ 2 │
This is also more convenient for further processing of results.
Sounds like you need a groupBy function. Here is an implement for reference, in Haskell
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy _ [] = []
groupBy eq (x:xs) = (x:ys) : groupBy eq zs
where (ys,zs) = span (eq x) xs