How to create a custom iterator in Julia 1.0? - iterator

I have this structure in Julia 1.0:
mutable struct Metadata
id::Int64
res_id::Int64
end
So that I can create an array of these, where the id is always incremented by one, but the res_id is only sometimes incremented, like so:
data = [
Metadata(1, 1),
Metadata(2, 1),
Metadata(3, 1),
Metadata(4, 2),
Metadata(5, 2),
Metadata(6, 2),
...]
What I want to do is be able to iterate over this Array, but get blocks based on the res_id (all the data with res_id 1, then 2, etc). The desired behavior would be something like this:
for res in iter_res(data)
println(res)
end
julia>
[Metadata(1, 1), Metadata(2, 1), Metadata(3, 1)]
[Metadata(4, 2), Metadata(5, 2), Metadata(6, 2)]
How do I do this in Julia 1.0, considering that I also need to normally iterate over the array to get element by element?

In Julia 1+, this should be done by implementing Base.iterate(::YourType) to get the starting iteration and Base.iterate(::YourType, state) for other iterations based of some state. These methods should return nothing when done, otherwise, (result, state) tuple.
Iterating on YourType with
for i in x
# stuff
end
is then a shorthand for writing
it = iterate(x)
while it !== nothing
i, state = it
# stuff
it = iterate(x, state)
end
See the manual for details.

How I eventually handled the problem:
function iter(data::Vector{Metadata}; property::Symbol = :res_id)
#GET UNIQUE VALUES FOR THIS PROPERTY
up = Vector{Any}()
for s in data
getproperty(s, property) in up ? nothing : push!(up, getproperty(s, property))
end
#GROUP ELEMENTS BASED ON THE UNIQUE VALUES FOR THIS PROPERTY
f = Vector{Vector{Metadata}}()
idx::Int64 = 1
cmp::Any = up[idx]
push!(f, Vector{Metadata}())
for s in data
if getproperty(s, property) == cmp
push!(f[idx], s)
else
push!(f, Vector{Metadata}())
idx += 1
cmp = up[idx]
push!(f[idx], s)
end
end
return f
end
This allows me to accommodate "skipped" res_id's (like jumping from 1 to 3, etc) and even group the Metadata objects by other future characteristics other than res_id, such as Strings, or types other than Int64's. Works, although it probably isn't very efficient.
You can then iterate over the Vector{Metadata} this way:
for r in iter(rs)
println(res)
end

You can iterate over a Generator of filters like this:
julia> mutable struct Metadata
id::Int64
res_id::Int64
end
julia> data = [
Metadata(1, 1),
Metadata(2, 1),
Metadata(3, 1),
Metadata(4, 2),
Metadata(5, 2),
Metadata(6, 2),
];
julia> for res in (filter(x -> x.res_id == i, data) for i in 1:2)
println(res)
end
Metadata[Metadata(1, 1), Metadata(2, 1), Metadata(3, 1)]
Metadata[Metadata(4, 2), Metadata(5, 2), Metadata(6, 2)]

From the names of your variables it seems you are collecting the data from some computational process. Normally you use DataFrame for that purpose.
using DataFrames
data = DataFrame(id=[1,2,3,4,5,6],res_id=[1,1,1,2,2,2])
for group in groupby(data,:res_id)
println(group)
end
This yields:
3×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ res_id │
│ │ Int64 │ Int64 │
├─────┼───────┼────────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 1 │
│ 3 │ 3 │ 1 │
3×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ res_id │
│ │ Int64 │ Int64 │
├─────┼───────┼────────┤
│ 1 │ 4 │ 2 │
│ 2 │ 5 │ 2 │
│ 3 │ 6 │ 2 │
This is also more convenient for further processing of results.

Sounds like you need a groupBy function. Here is an implement for reference, in Haskell
groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy _ [] = []
groupBy eq (x:xs) = (x:ys) : groupBy eq zs
where (ys,zs) = span (eq x) xs

Related

How to merge two DataFrames with different columns / sizes

Looking for a way to combine two DataFrames.
df1:
shape: (2, 2)
┌────────┬──────────────────────┐
│ Fruit ┆ Phosphorus (mg/100g) │
│ --- ┆ --- │
│ str ┆ i32 │
╞════════╪══════════════════════╡
│ Apple ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Banana ┆ 22 │
└────────┴──────────────────────┘
df2:
shape: (1, 3)
┌──────┬─────────────────────┬──────────────────────┐
│ Name ┆ Potassium (mg/100g) ┆ Phosphorus (mg/100g) │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 │
╞══════╪═════════════════════╪══════════════════════╡
│ Pear ┆ 115 ┆ 12 │
└──────┴─────────────────────┴──────────────────────┘
Result should be:
shape: (3, 3)
+--------+----------------------+---------------------+
| Fruit | Phosphorus (mg/100g) | Potassium (mg/100g) |
| --- | --- | --- |
| str | i32 | i32 |
+========+======================+=====================+
| Apple | 11 | null |
+--------+----------------------+---------------------+
| Banana | 22 | null |
+--------+----------------------+---------------------+
| Pear | 12 | 115 |
+--------+----------------------+---------------------+
Here is the code sniplet I try to make work:
use polars::prelude::*;
fn main() {
let df1: DataFrame = df!("Fruit" => &["Apple", "Banana"],
"Phosphorus (mg/100g)" => &[11, 22])
.unwrap();
let df2: DataFrame = df!("Name" => &["Pear"],
"Potassium (mg/100g)" => &[115],
"Phosphorus (mg/100g)" => &[12]
)
.unwrap();
let df3: DataFrame = df1
.join(&df2, ["Fruit"], ["Name"], JoinType::Left, None)
.unwrap();
assert_eq!(df3.shape(), (3, 3));
println!("{}", df3);
}
It's a FULL OUTER JOIN I am looking for ...
The ERROR I get:
thread 'main' panicked at 'assertion failed: (left == right)
left: (2, 4),
right: (3, 3)', src\main.rs:18:5
You need to explicitly specify the columns you are going to merge, and use JoinType::Outer for the outer join:
use polars::prelude::*;
fn main() {
let df1: DataFrame = df!("Fruit" => &["Apple", "Banana"],
"Phosphorus (mg/100g)" => &[11, 22])
.unwrap();
let df2: DataFrame = df!("Name" => &["Pear"],
"Potassium (mg/100g)" => &[115],
"Phosphorus (mg/100g)" => &[12]
)
.unwrap();
let df3: DataFrame = df1
.join(
&df2,
["Fruit", "Phosphorus (mg/100g)"],
["Name", "Phosphorus (mg/100g)"],
JoinType::Outer,
None).unwrap();
assert_eq!(df3.shape(), (3, 3));
println!("{}", df3);
}
Thanks to #Ayaz :) I was able to make a generic version, one where I do not need to specify the shared column names each time.
Here is my version of the FULL OUTER JOIN of two DataFrames:
use polars::prelude::*;
use array_tool::vec::{Intersect};
fn concat_df(df1: &DataFrame, df2: &DataFrame) -> Result<DataFrame, PolarsError> {
if df1.is_empty() {
return Ok(df2.clone());
}
let df1_column_names = df1.get_column_names();
let df2_column_names = df2.get_column_names();
let common_column_names = &df1_column_names.intersect(df2_column_names)[..];
df1.join(
df2,
common_column_names,
common_column_names,
JoinType::Outer,
None,
)
}

How to produce grouped summary statistics without explicitly naming the variables

Given a Julia dataframe with many variables and a final class column:
julia> df
5×3 DataFrame
Row │ v1 v2 cl
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 10 1 2
2 │ 20 2 2
3 │ 300 10 1
4 │ 400 20 1
5 │ missing 30 1
I want to obtain a grouped df with summary statistics by class, with the classes by column and variables by rows, like:
julia> dfByCl
11×3 DataFrame
Row │ var cl_1 cl_2
│ String Float64? Float64?
─────┼───────────────────────────────
1 │ nrow 3.0 2.0
2 │ v1_mean 350.0 15.0
3 │ v1_std 70.7107 7.07107
4 │ v1_lb 252.002 5.20018
5 │ v1_ub 447.998 24.7998
6 │ v1_nm 2.0 2.0
7 │ v2_mean 20.0 1.5
8 │ v2_std 10.0 0.707107
9 │ v2_lb 8.68414 0.520018
10 │ v2_ub 31.3159 2.47998
11 │ v2_nm 3.0 2.0
and I don't want to explicitly name all the variables.
Is there something simpler/more elegant than the above code ?
using Statistics, DataFrames, Distributions
meansk(data) = mean(skipmissing(data))
stdsk(data) = std(skipmissing(data))
nm(data) = sum(.! ismissing.(data))
ci(data::AbstractVector,α=0.05) = meansk(data) - quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data)), meansk(data) + quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data))
cilb(data) = ci(data)[1]
ciub(data) = ci(data)[2]
df = DataFrame(v1=[10,20,300,400,missing],v2=[1,2,10,20,30],cl=[2,2,1,1,1])
dfByCl_w = combine(groupby(df,["cl"]),
nrow,
names(df) .=> meansk .=> col -> col * "_mean",
names(df) .=> stdsk .=> col -> col * "_std",
names(df) .=> cilb .=> col -> col * "_lb",
names(df) .=> ciub .=> col -> col * "_ub",
names(df) .=> nm .=> col -> col * "_nm",
)
orderedNames = vcat("cl","nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_w = dfByCl_w[:, orderedNames]
toStack = vcat("nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_l = stack(dfByCl_w,toStack)
dfByCl = unstack(dfByCl_l,"cl","value")
rename!(dfByCl,vcat("var",["cl_$(c)" for c in unique(dfByCl_w.cl)]))
Here is what I would normally do in such a case:
julia> cilb(data,α=0.05) = mean(data) - quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
cilb (generic function with 2 methods)
julia> ciub(data,α=0.05) = mean(data) + quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
ciub (generic function with 2 methods)
julia> combine(groupby(df, :cl),
nrow,
sdf -> describe(sdf, :mean, :std, cilb => :lb, ciub => :ub, :nmissing, cols=r"v"))
4×8 DataFrame
Row │ cl nrow variable mean std lb ub nmissing
│ Int64 Int64 Symbol Float64 Float64 Float64 Float64 Int64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 1 3 v1 350.0 70.7107 252.002 447.998 1
2 │ 1 3 v2 20.0 10.0 8.68414 31.3159 0
3 │ 2 2 v1 15.0 7.07107 5.20018 24.7998 0
4 │ 2 2 v2 1.5 0.707107 0.520018 2.47998 0
Later you can reshape it as you like, but maybe this layout is something you would actually want?

Julia, efficient way to create a DataFrame by applying a function to tuple elements

I need to use some data stored in a named tuple to create a dataframe with the same column number and names of the tuple element, by applying a function to them. For example:
a = (A = [1, 2], B = 1:6)
f(a) = begin
df = DataFrame()
for k in keys(a) df[k] = sample(a[k], 10) end # There could be any other function in place of sample()
df
end
but if I run #code_warntype I get Union types which I understand means the compiler cannot predict the type before run time and this impacts performance:
julia> #code_warntype f(a)
Variables
#self#::Core.Const(f)
a::NamedTuple{(:A, :B), Tuple{Vector{Int64}, UnitRange{Int64}}}
#_3::Union{Nothing, Tuple{Symbol, Int64}}
df::DataFrame
k::Symbol
Body::DataFrame
1 ─ (df = Main.DataFrame())
│ %2 = Main.keys(a)::Core.Const((:A, :B))
│ (#_3 = Base.iterate(%2))
│ %4 = (#_3::Core.Const((:A, 2)) === nothing)::Core.Const(false)
│ %5 = Base.not_int(%4)::Core.Const(true)
└── goto #4 if not %5
2 ┄ %7 = #_3::Tuple{Symbol, Int64}::Tuple{Symbol, Int64}
│ (k = Core.getfield(%7, 1))
│ %9 = Core.getfield(%7, 2)::Int64
│ %10 = Base.getindex(a, k)::Union{UnitRange{Int64}, Vector{Int64}}
│ %11 = Main.sample(%10, 10)::Vector{Int64}
│ Base.setindex!(df, %11, k)
│ (#_3 = Base.iterate(%2, %9))
│ %14 = (#_3 === nothing)::Bool
│ %15 = Base.not_int(%14)::Bool
└── goto #4 if not %15
3 ─ goto #2
4 ┄ return df
The question is: what is the most efficient way to write f(a)?
In my specific case all the columns of the dataframe will have the same type, could this information help the compiler?
You can generate the data and only at the last step, convert it to DataFrame. There are multiple ways how it can be done, one of them is map, which is type stable for tuples
function g(a)
map(x -> sample(x, 10), a) |> DataFrame
end
julia> #code_warntype(g(a))
MethodInstance for g(::NamedTuple{(:A, :B), Tuple{Vector{Int64}, UnitRange{Int64}}})
from g(a) in Main at REPL[103]:1
Arguments
#self#::Core.Const(g)
a::NamedTuple{(:A, :B), Tuple{Vector{Int64}, UnitRange{Int64}}}
Locals
#47::var"#47#48"
Body::DataFrame
1 ─ (#47 = %new(Main.:(var"#47#48")))
│ %2 = #47::Core.Const(var"#47#48"())
│ %3 = Main.map(%2, a)::NamedTuple{(:A, :B), Tuple{Vector{Int64}, Vector{Int64}}}
│ %4 = (%3 |> Main.DataFrame)::DataFrame
└── return %4

What is the right way to transform multiple columns of a DataFrame in place in Julia?

How should I apply a function to every element of some columns in place?
julia> using DataFrames
julia> df = DataFrame(Time = [3, 4, 5], TopTemp = [70, 73, 100], BottomTemp = [50, 55, 80])
3×3 DataFrame
Row │ Time TopTemp BottomTemp
│ Int64 Int64 Int64
─────┼────────────────────────────
1 │ 3 70 50
2 │ 4 73 55
3 │ 5 100 80
julia> fahrenheit_to_celsius(x) = Int(round((x - 32) * 5 / 9))
fahrenheit_to_celsius (generic function with 1 method)
This works for one column, but I'm not sure it is the best method.
julia> transform!(df, "TopTemp" => ByRow(fahrenheit_to_celsius), renamecols = false)
3×3 DataFrame
Row │ Time TopTemp BottomTemp
│ Int64 Int64 Int64
─────┼────────────────────────────
1 │ 3 21 50
2 │ 4 23 55
3 │ 5 38 80
The same method does not work to convert both columns using a regular expression with broadcasting.
julia> transform!(df, r"Temp" .=> ByRow.(fahrenheit_to_celsius), renamecols = false)
ERROR: LoadError: MethodError: no method matching fahrenheit_to_celsius(::Int64, ::Int64)
Use:
transform!(df, names(df, r"Temp") .=> ByRow(fahrenheit_to_celsius), renamecols = false)
or
df[!, r"Temp"] .= fahrenheit_to_celsius.(df[!, r"Temp"])
also note that it is not in-place, but the columns are replaced in both cases above but this is probably what you want in general.
An in-place operation would be:
df[:, r"Temp"] .= fahrenheit_to_celsius.(df[!, r"Temp"])
but it would fail if fahrenheit_to_celsius would change eltype of columns.
The fastest should be:
foreach(col -> col .= fahrenheit_to_celsius.(col), eachcol(df[!, r"Temp"])))

Groupby with sum on Julia Dataframe

I am trying to make a groupby + sum on a Julia Dataframe with Int and String values
For instance, df :
│ Row │ A │ B │ C │ D │
│ │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1 │ x1 │ a │ 12 │ green │
│ 2 │ x2 │ a │ 7 │ blue │
│ 3 │ x1 │ b │ 5 │ red │
│ 4 │ x2 │ a │ 4 │ blue │
│ 5 │ x1 │ b │ 9 │ yellow │
To do this in Python, the command could be :
df_group = df.groupby(['A', 'B']).sum().reset_index()
I will obtain the following output result with the initial column labels :
A B C
0 x1 a 12
1 x1 b 14
2 x2 a 11
I would like to do the same thing in Julia. I tried this way, unsuccessfully :
df_group = aggregate(df, ["A", "B"], sum)
MethodError: no method matching +(::String, ::String)
Have you any idea of a way to do this in Julia ?
Try (actually instead of non-string columns, probably you want columns that are numeric):
numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)
and if you want to allow missing values (and skip them when doing a summation) then:
numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)
Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like
using DataFrames
df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"],
:B => ["a", "a", "b", "a", "b"],
:C => [12, 7, 5, 4, 9],
:D => ["green", "blue", "red", "blue", "yellow"])
gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)
with the result
julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A │ B │ C_sum │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │
You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl
using Underscores
#_ groupby(df, [:A, :B]) |> combine(__, :C => sum)
You can give name to the new column with the following syntax
julia> #_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A │ B │ C │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │