Groupby with sum on Julia Dataframe - dataframe

I am trying to make a groupby + sum on a Julia Dataframe with Int and String values
For instance, df :
│ Row │ A │ B │ C │ D │
│ │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1 │ x1 │ a │ 12 │ green │
│ 2 │ x2 │ a │ 7 │ blue │
│ 3 │ x1 │ b │ 5 │ red │
│ 4 │ x2 │ a │ 4 │ blue │
│ 5 │ x1 │ b │ 9 │ yellow │
To do this in Python, the command could be :
df_group = df.groupby(['A', 'B']).sum().reset_index()
I will obtain the following output result with the initial column labels :
A B C
0 x1 a 12
1 x1 b 14
2 x2 a 11
I would like to do the same thing in Julia. I tried this way, unsuccessfully :
df_group = aggregate(df, ["A", "B"], sum)
MethodError: no method matching +(::String, ::String)
Have you any idea of a way to do this in Julia ?

Try (actually instead of non-string columns, probably you want columns that are numeric):
numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)
and if you want to allow missing values (and skip them when doing a summation) then:
numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)

Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like
using DataFrames
df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"],
:B => ["a", "a", "b", "a", "b"],
:C => [12, 7, 5, 4, 9],
:D => ["green", "blue", "red", "blue", "yellow"])
gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)
with the result
julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A │ B │ C_sum │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │
You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl
using Underscores
#_ groupby(df, [:A, :B]) |> combine(__, :C => sum)
You can give name to the new column with the following syntax
julia> #_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A │ B │ C │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │

Related

Add column/series to dataframe in polars rust

I had a hard time finding the answer for such simple question. I was stuck at trying to use "append", "extend" or other methods.
Finally I found/realized that the with_column method is the way to go in polars.
I figure that I should put out my solution here for others who get stuck on the same problem.
use polars::prelude::*;
fn main() {
let a = Series::new("A", vec![1, 2, 3]);
let b = Series::new("B", vec!["a", "b", "c"]);
let mut df = DataFrame::new(vec![a, b]).unwrap();
println!("{:?}", df);
let c = Series::new("C", vec![true, false, false]);
df.with_column(c).unwrap();
println!("{:?}", df);
let d = Series::new("D", vec![1.0, 2.0, 3.0]);
let e = Series::new("E", vec![false, true, true]);
// Also works with lazy and multiple series at once
let df_lazy = df
.lazy()
.with_columns([d.lit(), e.lit()])
.collect()
.unwrap();
println!("{:?}", df_lazy);
}
Output
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i32 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ c │
└─────┴─────┘
shape: (3, 3)
┌─────┬─────┬───────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ bool │
╞═════╪═════╪═══════╡
│ 1 ┆ a ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c ┆ false │
└─────┴─────┴───────┘
shape: (3, 5)
┌─────┬─────┬───────┬─────┬───────┐
│ A ┆ B ┆ C ┆ D ┆ E │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ bool ┆ f64 ┆ bool │
╞═════╪═════╪═══════╪═════╪═══════╡
│ 1 ┆ a ┆ true ┆ 1.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b ┆ false ┆ 2.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c ┆ false ┆ 3.0 ┆ true │
└─────┴─────┴───────┴─────┴───────┘

Compare two Pola-rs dataframes by position

Suppose I have two dataframes like:
let df_1 = df! {
"1" => [1, 2, 2, 3, 4, 3],
"2" => [1, 4, 2, 3, 4, 3],
"3" => [1, 2, 6, 3, 4, 3],
}
.unwrap();
let mut df_2 = df_1.clone();
for idx in 0..df_2.width() {
df_2.apply_at_idx(idx, |s| {
s.cummax(false)
.shift(1)
.fill_null(FillNullStrategy::Zero)
.unwrap()
})
.unwrap();
}
println!("{:#?}", df_1);
println!("{:#?}", df_2);
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:
shape: (6, 3)
┌───────┬───────┬───────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘
In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?
<edit>
If you actually have a single dataframe you can do:
let mask =
when(all().gt_eq(
all().cummax(false).shift(1).fill_null(0)))
.then(all())
.otherwise(lit(NULL));
let out =
df_1.lazy().select(&[mask])
//.sum()
.collect();
</edit>
From https://stackoverflow.com/a/72899438
Masking out values by columns in another DataFrame is a potential for errors
caused by different lengths. For this reason polars does not encourage such
operations
It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.
.with_context() has been added since that answer which can be used to access both dataframes.
In Python:
df1.lazy().with_context(
df2.lazy().select(pl.all().suffix("_right"))
).select([
pl.when(pl.col(name) >= pl.col(f"{name}_right"))
.then(pl.col(name))
for name in df1.columns
]).collect()
I've not used rust - but my attempt at a translation:
let mask =
df_1.get_column_names().iter().map(|name|
when(col(name).gt_eq(col(&format!("{name}_right"))))
.then(col(name))
.otherwise(lit(NULL))
).collect::<Vec<Expr>>();
let out =
df_1.lazy()
.with_context(&[
df_2.lazy().select(&[all().suffix("_right")])
])
.select(&mask)
//.sum()
.collect();
println!("{:#?}", out);
Output:
Ok(shape: (6, 3)
┌──────┬──────┬──────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞══════╪══════╪══════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ null ┆ 6 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘)
It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.
You can't call zip and map on the dataframe directly. That doesn't work.
But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.
Long story short
let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();
let df3 = DataFrame::new(
df.iter()
.zip(df2.iter())
.map(|(series1, series2)| series1.gt(series2).unwrap())
.collect());
That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.

How do I take the median of several columns in polars-rs?

Say I have a DataFrame consisting of the following four Series:
use polars::prelude::*;
use chrono::prelude::*;
use chrono::Duration;
fn main() {
let series_one = Series::new(
"a",
(0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_two = Series::new(
"a",
(4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_three = Series::new(
"a",
(8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_dates = Series::new(
"date",
(0..4)
.into_iter()
.map(|v| NaiveDate::default() + Duration::days(v))
.collect::<Vec<_>>(),
);
and I join them as such:
let df_one = DataFrame::new(vec![series_one, series_dates.clone()]).unwrap();
let df_two = DataFrame::new(vec![series_two, series_dates.clone()]).unwrap();
let df_three = DataFrame::new(vec![series_three, series_dates.clone()]).unwrap();
let df = df_one
.join(
&df_two,
["date"],
["date"],
JoinType::Outer,
Some("1".into()),
)
.unwrap()
.join(
&df_three,
["date"],
["date"],
JoinType::Outer,
Some("2".into()),
)
.unwrap();
which produces the following DataFrame:
shape: (4, 4)
┌─────┬────────────┬─────┬──────┐
│ a ┆ date ┆ a1 ┆ a2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ date ┆ f64 ┆ f64 │
╞═════╪════════════╪═════╪══════╡
│ 0.0 ┆ 1970-01-01 ┆ 4.0 ┆ 8.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ 1970-01-02 ┆ 5.0 ┆ 9.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 1970-01-03 ┆ 6.0 ┆ 10.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 1970-01-04 ┆ 7.0 ┆ 11.0 │
└─────┴────────────┴─────┴──────┘
How can I make a new DataFrame which contains a date column and a a_median column like so?:
┌────────────┬────────────┐
│ a_median ┆ date ┆
│ --- ┆ --- ┆
│ f64 ┆ date ┆
╞════════════╪════════════╡
│ 4.0 ┆ 1970-01-01 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ 1970-01-02 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.0 ┆ 1970-01-03 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7.0 ┆ 1970-01-04 ┆
└────────────┴────────────┘
I think this is best accomplished via LazyFrames but I'm not sure how to get this exact result.
To get the results you're looking for, you can union the three DataFrames using the vstack method:
let mut unioned = df_one.vstack(&df_two).unwrap();
unioned = unioned.vstack(&df_three).unwrap();
Once you have a single DataFrame with all the records, you can group and aggregate them:
let aggregated = unioned.lazy()
.groupby(["date"])
.agg([
col("a").median().alias("a_median")
])
.sort(
"date",
SortOptions {
descending: false,
nulls_last: true
}
)
.collect()
.unwrap();
Which gives the expected results:
┌────────────┬──────────┐
│ date ┆ a_median │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪══════════╡
│ 1970-01-01 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-02 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-03 ┆ 6.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-04 ┆ 7.0 │
└────────────┴──────────┘

How to produce grouped summary statistics without explicitly naming the variables

Given a Julia dataframe with many variables and a final class column:
julia> df
5×3 DataFrame
Row │ v1 v2 cl
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 10 1 2
2 │ 20 2 2
3 │ 300 10 1
4 │ 400 20 1
5 │ missing 30 1
I want to obtain a grouped df with summary statistics by class, with the classes by column and variables by rows, like:
julia> dfByCl
11×3 DataFrame
Row │ var cl_1 cl_2
│ String Float64? Float64?
─────┼───────────────────────────────
1 │ nrow 3.0 2.0
2 │ v1_mean 350.0 15.0
3 │ v1_std 70.7107 7.07107
4 │ v1_lb 252.002 5.20018
5 │ v1_ub 447.998 24.7998
6 │ v1_nm 2.0 2.0
7 │ v2_mean 20.0 1.5
8 │ v2_std 10.0 0.707107
9 │ v2_lb 8.68414 0.520018
10 │ v2_ub 31.3159 2.47998
11 │ v2_nm 3.0 2.0
and I don't want to explicitly name all the variables.
Is there something simpler/more elegant than the above code ?
using Statistics, DataFrames, Distributions
meansk(data) = mean(skipmissing(data))
stdsk(data) = std(skipmissing(data))
nm(data) = sum(.! ismissing.(data))
ci(data::AbstractVector,α=0.05) = meansk(data) - quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data)), meansk(data) + quantile(Normal(),1-α/2)*stdsk(data)/sqrt(nm(data))
cilb(data) = ci(data)[1]
ciub(data) = ci(data)[2]
df = DataFrame(v1=[10,20,300,400,missing],v2=[1,2,10,20,30],cl=[2,2,1,1,1])
dfByCl_w = combine(groupby(df,["cl"]),
nrow,
names(df) .=> meansk .=> col -> col * "_mean",
names(df) .=> stdsk .=> col -> col * "_std",
names(df) .=> cilb .=> col -> col * "_lb",
names(df) .=> ciub .=> col -> col * "_ub",
names(df) .=> nm .=> col -> col * "_nm",
)
orderedNames = vcat("cl","nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_w = dfByCl_w[:, orderedNames]
toStack = vcat("nrow",[ ["$(n)_mean", "$(n)_std", "$(n)_lb", "$(n)_ub", "$(n)_nm"] for n in names(df)[1:end-1]]...)
dfByCl_l = stack(dfByCl_w,toStack)
dfByCl = unstack(dfByCl_l,"cl","value")
rename!(dfByCl,vcat("var",["cl_$(c)" for c in unique(dfByCl_w.cl)]))
Here is what I would normally do in such a case:
julia> cilb(data,α=0.05) = mean(data) - quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
cilb (generic function with 2 methods)
julia> ciub(data,α=0.05) = mean(data) + quantile(Normal(),1-α/2)*std(data)/sqrt(count(x -> true, data))
ciub (generic function with 2 methods)
julia> combine(groupby(df, :cl),
nrow,
sdf -> describe(sdf, :mean, :std, cilb => :lb, ciub => :ub, :nmissing, cols=r"v"))
4×8 DataFrame
Row │ cl nrow variable mean std lb ub nmissing
│ Int64 Int64 Symbol Float64 Float64 Float64 Float64 Int64
─────┼─────────────────────────────────────────────────────────────────────────────
1 │ 1 3 v1 350.0 70.7107 252.002 447.998 1
2 │ 1 3 v2 20.0 10.0 8.68414 31.3159 0
3 │ 2 2 v1 15.0 7.07107 5.20018 24.7998 0
4 │ 2 2 v2 1.5 0.707107 0.520018 2.47998 0
Later you can reshape it as you like, but maybe this layout is something you would actually want?

How to select columns by data type in Polars?

In pandas we have the pandas.DataFrame.select_dtypes method that selects certain columns depending on the dtype. Is there a similar way to do such a thing in Polars?
One can pass a data type to pl.col:
import polars as pl
df = pl.DataFrame(
{
"id": [1, 2, 3],
"name": ["John", "Jane", "Jake"],
"else": [10.0, 20.0, 30.0],
}
)
print(df.select([pl.col(pl.Utf8), pl.col(pl.Int64)]))
Output:
shape: (3, 2)
┌──────┬─────┐
│ name ┆ id │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪═════╡
│ John ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jane ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jake ┆ 3 │
└──────┴─────┘