How to merge two DataFrames with different columns / sizes - dataframe

Looking for a way to combine two DataFrames.
df1:
shape: (2, 2)
┌────────┬──────────────────────┐
│ Fruit ┆ Phosphorus (mg/100g) │
│ --- ┆ --- │
│ str ┆ i32 │
╞════════╪══════════════════════╡
│ Apple ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Banana ┆ 22 │
└────────┴──────────────────────┘
df2:
shape: (1, 3)
┌──────┬─────────────────────┬──────────────────────┐
│ Name ┆ Potassium (mg/100g) ┆ Phosphorus (mg/100g) │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 │
╞══════╪═════════════════════╪══════════════════════╡
│ Pear ┆ 115 ┆ 12 │
└──────┴─────────────────────┴──────────────────────┘
Result should be:
shape: (3, 3)
+--------+----------------------+---------------------+
| Fruit | Phosphorus (mg/100g) | Potassium (mg/100g) |
| --- | --- | --- |
| str | i32 | i32 |
+========+======================+=====================+
| Apple | 11 | null |
+--------+----------------------+---------------------+
| Banana | 22 | null |
+--------+----------------------+---------------------+
| Pear | 12 | 115 |
+--------+----------------------+---------------------+
Here is the code sniplet I try to make work:
use polars::prelude::*;
fn main() {
let df1: DataFrame = df!("Fruit" => &["Apple", "Banana"],
"Phosphorus (mg/100g)" => &[11, 22])
.unwrap();
let df2: DataFrame = df!("Name" => &["Pear"],
"Potassium (mg/100g)" => &[115],
"Phosphorus (mg/100g)" => &[12]
)
.unwrap();
let df3: DataFrame = df1
.join(&df2, ["Fruit"], ["Name"], JoinType::Left, None)
.unwrap();
assert_eq!(df3.shape(), (3, 3));
println!("{}", df3);
}
It's a FULL OUTER JOIN I am looking for ...
The ERROR I get:
thread 'main' panicked at 'assertion failed: (left == right)
left: (2, 4),
right: (3, 3)', src\main.rs:18:5

You need to explicitly specify the columns you are going to merge, and use JoinType::Outer for the outer join:
use polars::prelude::*;
fn main() {
let df1: DataFrame = df!("Fruit" => &["Apple", "Banana"],
"Phosphorus (mg/100g)" => &[11, 22])
.unwrap();
let df2: DataFrame = df!("Name" => &["Pear"],
"Potassium (mg/100g)" => &[115],
"Phosphorus (mg/100g)" => &[12]
)
.unwrap();
let df3: DataFrame = df1
.join(
&df2,
["Fruit", "Phosphorus (mg/100g)"],
["Name", "Phosphorus (mg/100g)"],
JoinType::Outer,
None).unwrap();
assert_eq!(df3.shape(), (3, 3));
println!("{}", df3);
}

Thanks to #Ayaz :) I was able to make a generic version, one where I do not need to specify the shared column names each time.
Here is my version of the FULL OUTER JOIN of two DataFrames:
use polars::prelude::*;
use array_tool::vec::{Intersect};
fn concat_df(df1: &DataFrame, df2: &DataFrame) -> Result<DataFrame, PolarsError> {
if df1.is_empty() {
return Ok(df2.clone());
}
let df1_column_names = df1.get_column_names();
let df2_column_names = df2.get_column_names();
let common_column_names = &df1_column_names.intersect(df2_column_names)[..];
df1.join(
df2,
common_column_names,
common_column_names,
JoinType::Outer,
None,
)
}

Related

Add column/series to dataframe in polars rust

I had a hard time finding the answer for such simple question. I was stuck at trying to use "append", "extend" or other methods.
Finally I found/realized that the with_column method is the way to go in polars.
I figure that I should put out my solution here for others who get stuck on the same problem.
use polars::prelude::*;
fn main() {
let a = Series::new("A", vec![1, 2, 3]);
let b = Series::new("B", vec!["a", "b", "c"]);
let mut df = DataFrame::new(vec![a, b]).unwrap();
println!("{:?}", df);
let c = Series::new("C", vec![true, false, false]);
df.with_column(c).unwrap();
println!("{:?}", df);
let d = Series::new("D", vec![1.0, 2.0, 3.0]);
let e = Series::new("E", vec![false, true, true]);
// Also works with lazy and multiple series at once
let df_lazy = df
.lazy()
.with_columns([d.lit(), e.lit()])
.collect()
.unwrap();
println!("{:?}", df_lazy);
}
Output
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i32 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ c │
└─────┴─────┘
shape: (3, 3)
┌─────┬─────┬───────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ bool │
╞═════╪═════╪═══════╡
│ 1 ┆ a ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c ┆ false │
└─────┴─────┴───────┘
shape: (3, 5)
┌─────┬─────┬───────┬─────┬───────┐
│ A ┆ B ┆ C ┆ D ┆ E │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ bool ┆ f64 ┆ bool │
╞═════╪═════╪═══════╪═════╪═══════╡
│ 1 ┆ a ┆ true ┆ 1.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b ┆ false ┆ 2.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c ┆ false ┆ 3.0 ┆ true │
└─────┴─────┴───────┴─────┴───────┘

Compare two Pola-rs dataframes by position

Suppose I have two dataframes like:
let df_1 = df! {
"1" => [1, 2, 2, 3, 4, 3],
"2" => [1, 4, 2, 3, 4, 3],
"3" => [1, 2, 6, 3, 4, 3],
}
.unwrap();
let mut df_2 = df_1.clone();
for idx in 0..df_2.width() {
df_2.apply_at_idx(idx, |s| {
s.cummax(false)
.shift(1)
.fill_null(FillNullStrategy::Zero)
.unwrap()
})
.unwrap();
}
println!("{:#?}", df_1);
println!("{:#?}", df_2);
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:
shape: (6, 3)
┌───────┬───────┬───────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘
In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?
<edit>
If you actually have a single dataframe you can do:
let mask =
when(all().gt_eq(
all().cummax(false).shift(1).fill_null(0)))
.then(all())
.otherwise(lit(NULL));
let out =
df_1.lazy().select(&[mask])
//.sum()
.collect();
</edit>
From https://stackoverflow.com/a/72899438
Masking out values by columns in another DataFrame is a potential for errors
caused by different lengths. For this reason polars does not encourage such
operations
It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.
.with_context() has been added since that answer which can be used to access both dataframes.
In Python:
df1.lazy().with_context(
df2.lazy().select(pl.all().suffix("_right"))
).select([
pl.when(pl.col(name) >= pl.col(f"{name}_right"))
.then(pl.col(name))
for name in df1.columns
]).collect()
I've not used rust - but my attempt at a translation:
let mask =
df_1.get_column_names().iter().map(|name|
when(col(name).gt_eq(col(&format!("{name}_right"))))
.then(col(name))
.otherwise(lit(NULL))
).collect::<Vec<Expr>>();
let out =
df_1.lazy()
.with_context(&[
df_2.lazy().select(&[all().suffix("_right")])
])
.select(&mask)
//.sum()
.collect();
println!("{:#?}", out);
Output:
Ok(shape: (6, 3)
┌──────┬──────┬──────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞══════╪══════╪══════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ null ┆ 6 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘)
It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.
You can't call zip and map on the dataframe directly. That doesn't work.
But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.
Long story short
let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();
let df3 = DataFrame::new(
df.iter()
.zip(df2.iter())
.map(|(series1, series2)| series1.gt(series2).unwrap())
.collect());
That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.

How do I take the median of several columns in polars-rs?

Say I have a DataFrame consisting of the following four Series:
use polars::prelude::*;
use chrono::prelude::*;
use chrono::Duration;
fn main() {
let series_one = Series::new(
"a",
(0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_two = Series::new(
"a",
(4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_three = Series::new(
"a",
(8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_dates = Series::new(
"date",
(0..4)
.into_iter()
.map(|v| NaiveDate::default() + Duration::days(v))
.collect::<Vec<_>>(),
);
and I join them as such:
let df_one = DataFrame::new(vec![series_one, series_dates.clone()]).unwrap();
let df_two = DataFrame::new(vec![series_two, series_dates.clone()]).unwrap();
let df_three = DataFrame::new(vec![series_three, series_dates.clone()]).unwrap();
let df = df_one
.join(
&df_two,
["date"],
["date"],
JoinType::Outer,
Some("1".into()),
)
.unwrap()
.join(
&df_three,
["date"],
["date"],
JoinType::Outer,
Some("2".into()),
)
.unwrap();
which produces the following DataFrame:
shape: (4, 4)
┌─────┬────────────┬─────┬──────┐
│ a ┆ date ┆ a1 ┆ a2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ date ┆ f64 ┆ f64 │
╞═════╪════════════╪═════╪══════╡
│ 0.0 ┆ 1970-01-01 ┆ 4.0 ┆ 8.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ 1970-01-02 ┆ 5.0 ┆ 9.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 1970-01-03 ┆ 6.0 ┆ 10.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 1970-01-04 ┆ 7.0 ┆ 11.0 │
└─────┴────────────┴─────┴──────┘
How can I make a new DataFrame which contains a date column and a a_median column like so?:
┌────────────┬────────────┐
│ a_median ┆ date ┆
│ --- ┆ --- ┆
│ f64 ┆ date ┆
╞════════════╪════════════╡
│ 4.0 ┆ 1970-01-01 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ 1970-01-02 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.0 ┆ 1970-01-03 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7.0 ┆ 1970-01-04 ┆
└────────────┴────────────┘
I think this is best accomplished via LazyFrames but I'm not sure how to get this exact result.
To get the results you're looking for, you can union the three DataFrames using the vstack method:
let mut unioned = df_one.vstack(&df_two).unwrap();
unioned = unioned.vstack(&df_three).unwrap();
Once you have a single DataFrame with all the records, you can group and aggregate them:
let aggregated = unioned.lazy()
.groupby(["date"])
.agg([
col("a").median().alias("a_median")
])
.sort(
"date",
SortOptions {
descending: false,
nulls_last: true
}
)
.collect()
.unwrap();
Which gives the expected results:
┌────────────┬──────────┐
│ date ┆ a_median │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪══════════╡
│ 1970-01-01 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-02 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-03 ┆ 6.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-04 ┆ 7.0 │
└────────────┴──────────┘

How to select columns by data type in Polars?

In pandas we have the pandas.DataFrame.select_dtypes method that selects certain columns depending on the dtype. Is there a similar way to do such a thing in Polars?
One can pass a data type to pl.col:
import polars as pl
df = pl.DataFrame(
{
"id": [1, 2, 3],
"name": ["John", "Jane", "Jake"],
"else": [10.0, 20.0, 30.0],
}
)
print(df.select([pl.col(pl.Utf8), pl.col(pl.Int64)]))
Output:
shape: (3, 2)
┌──────┬─────┐
│ name ┆ id │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪═════╡
│ John ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jane ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jake ┆ 3 │
└──────┴─────┘

Groupby with sum on Julia Dataframe

I am trying to make a groupby + sum on a Julia Dataframe with Int and String values
For instance, df :
│ Row │ A │ B │ C │ D │
│ │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1 │ x1 │ a │ 12 │ green │
│ 2 │ x2 │ a │ 7 │ blue │
│ 3 │ x1 │ b │ 5 │ red │
│ 4 │ x2 │ a │ 4 │ blue │
│ 5 │ x1 │ b │ 9 │ yellow │
To do this in Python, the command could be :
df_group = df.groupby(['A', 'B']).sum().reset_index()
I will obtain the following output result with the initial column labels :
A B C
0 x1 a 12
1 x1 b 14
2 x2 a 11
I would like to do the same thing in Julia. I tried this way, unsuccessfully :
df_group = aggregate(df, ["A", "B"], sum)
MethodError: no method matching +(::String, ::String)
Have you any idea of a way to do this in Julia ?
Try (actually instead of non-string columns, probably you want columns that are numeric):
numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)
and if you want to allow missing values (and skip them when doing a summation) then:
numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)
Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like
using DataFrames
df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"],
:B => ["a", "a", "b", "a", "b"],
:C => [12, 7, 5, 4, 9],
:D => ["green", "blue", "red", "blue", "yellow"])
gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)
with the result
julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A │ B │ C_sum │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │
You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl
using Underscores
#_ groupby(df, [:A, :B]) |> combine(__, :C => sum)
You can give name to the new column with the following syntax
julia> #_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A │ B │ C │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │