Polars column from conditioned look up of dictionary values - dataframe

I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377]
})
DF2 = = pl.DataFrame({
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,15377,17376,20177,22254, 24357],
'end': [14534,15403,17399,20195,22274,24377]})
What I want is:
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})
I'd like to assign the key from DF2 to DF1 when chrom matches chr and the start and end in DF1 are contained within the begin and end in DF2.
I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:
sz = len(DF1[:,0])
for i in range(sz):
DF1[i,"key"] = DF2.filter(
(pl.col("chrom") == DF1[i,"chr"])\
& (pl.col("begin") <= DF1[i,"start"])\
& (pl.col("end") >= DF1[i,"end"])
).select('key')[0,0]
Row iteration through a DF is incredibly slow. This takes about 10 hours.
I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.
I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join and join_asof strategies don't seem to work.

join and filter should give you what you need:
(
df1.join(df2, left_on="chr", right_on="chrom")
.filter(
(pl.col("start") >= pl.col("start_right"))
& (pl.col("end") <= pl.col("end_right"))
)
.drop(["start_right", "end_right"])
)
shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr ┆ start ┆ end ┆ key │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 │
└────────────┴───────┴───────┴─────┘

Using a join_asof might provide a performant solution:
(
DF1
.sort('start')
.join_asof(
DF2.sort('start'),
by_left="chr",
by_right="chrom",
on="start",
strategy="backward")
.filter(
pl.col('end') <= pl.col('end_right')
)
)
shape: (6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr ┆ start ┆ end ┆ key ┆ end_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 ┆ 14534 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 ┆ 20195 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 ┆ 22274 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 ┆ 24377 │
└────────────┴───────┴───────┴─────┴───────────┘
Note: this assumes that your start-end intervals in DF2 do not overlap.

Related

Add column/series to dataframe in polars rust

I had a hard time finding the answer for such simple question. I was stuck at trying to use "append", "extend" or other methods.
Finally I found/realized that the with_column method is the way to go in polars.
I figure that I should put out my solution here for others who get stuck on the same problem.
use polars::prelude::*;
fn main() {
let a = Series::new("A", vec![1, 2, 3]);
let b = Series::new("B", vec!["a", "b", "c"]);
let mut df = DataFrame::new(vec![a, b]).unwrap();
println!("{:?}", df);
let c = Series::new("C", vec![true, false, false]);
df.with_column(c).unwrap();
println!("{:?}", df);
let d = Series::new("D", vec![1.0, 2.0, 3.0]);
let e = Series::new("E", vec![false, true, true]);
// Also works with lazy and multiple series at once
let df_lazy = df
.lazy()
.with_columns([d.lit(), e.lit()])
.collect()
.unwrap();
println!("{:?}", df_lazy);
}
Output
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i32 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ c │
└─────┴─────┘
shape: (3, 3)
┌─────┬─────┬───────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ bool │
╞═════╪═════╪═══════╡
│ 1 ┆ a ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c ┆ false │
└─────┴─────┴───────┘
shape: (3, 5)
┌─────┬─────┬───────┬─────┬───────┐
│ A ┆ B ┆ C ┆ D ┆ E │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ bool ┆ f64 ┆ bool │
╞═════╪═════╪═══════╪═════╪═══════╡
│ 1 ┆ a ┆ true ┆ 1.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b ┆ false ┆ 2.0 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c ┆ false ┆ 3.0 ┆ true │
└─────┴─────┴───────┴─────┴───────┘

Writing long rows using polars DataFrame throws runtime error

I have the following async block of code that runs as part of a larger program, and it runs successfully when the dataframe has a row with length 10, or 30, but when i put it to a larger number like 300, it tries to write the dataframe as parquet and throws a runtime error for each async thread:
Here is an example of the df it tries to write but fails.
df: Ok(shape: (300, 4)
┌───────────────┬─────────┬────────────────────────────────┬───────────────────────────────────────┐
│ timestamp ┆ ticker ┆ bids ┆ asks │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ list[list[f64]] ┆ list[list[f64]] │
╞═══════════════╪═════════╪════════════════════════════════╪═══════════════════════════════════════╡
│ 1674962575119 ┆ ETHUSDT ┆ [[1589.51, 4.731], [1589.31, ┆ [[1590.93, 39.234], [1592.1, 51.... │
│ ┆ ┆ 93.... ┆ │
│ 1674962575220 ┆ ETHUSDT ┆ [[1589.51, 22.094], [1589.31, ┆ [[1590.93, 39.234], [1592.1, 51.... │
│ ┆ ┆ 24... ┆ │
│ 1674962575319 ┆ ETHUSDT ┆ [[1589.51, 12.324], [1589.31, ┆ [[1590.93, 39.309], [1592.1, 52.... │
│ ┆ ┆ 24... ┆ │
│ 1674962575421 ┆ ETHUSDT ┆ [[1589.51, 0.0], [1589.31, ┆ [[1590.93, 26.735], [1592.1, 52.... │
│ ┆ ┆ 24.26... ┆ │
│ ... ┆ ... ┆ ... ┆ ... │
│ 1674962604998 ┆ ETHUSDT ┆ [[1440.0, 5138.446], [1558.38, ┆ [[1617.28, 40.969], [1593.72, 3.... │
│ ┆ ┆ 0... ┆ │
│ 1674962605101 ┆ ETHUSDT ┆ [[1440.0, 5138.446], [1558.38, ┆ [[1617.28, 40.969], [1593.72, 3.... │
│ ┆ ┆ 0... ┆ │
│ 1674962605201 ┆ ETHUSDT ┆ [[1440.0, 5138.446], [1558.38, ┆ [[1617.28, 40.969], [1593.72, 3.... │
│ ┆ ┆ 0... ┆ │
│ 1674962605301 ┆ ETHUSDT ┆ [[1440.0, 5138.446], [1558.38, ┆ [[1617.28, 40.969], [1593.72, 3.... │
│ ┆ ┆ 0... ┆ │
└───────────────┴─────────┴────────────────────────────────┴───────────────────────────────────────┘)
Here is the error.
thread '<unnamed>' panicked at 'range end index 131373 out of range for slice of length 301', C:\Users\username\.cargo\git\checkouts\arrow2-945af624853845da\baa2618\src\io\parquet\write\mod.rs:171:37
thread '<unnamed>' panicked at 'range end index 131373 out of range for slice of length 301', C:\Users\username\.cargo\git\checkouts\arrow2-945af624853845da\baa2618\src\io\parquet\write\mod.rs:171:37
Here is the code.
async {
if main_vec.length() >= ROWS {
let df = main_vec.to_df();
let time = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_secs();
let file = std::fs::File::create(&(time.to_string().trim() + ".parquet")).unwrap();
println!("Wrote parquet file: {}", &(time.to_string().trim().to_owned() + ".parquet"));
// ERROR OCCURS HERE
// ERROR OCCURS HERE
// ERROR OCCURS HERE
ParquetWriter::new(file)
.with_compression(ParquetCompression::Snappy)
.with_statistics(true)
.finish(&mut df.collect().unwrap())
.expect("Failed to write parquet file");
main_vec.clear();
}
}.await;

Compare two Pola-rs dataframes by position

Suppose I have two dataframes like:
let df_1 = df! {
"1" => [1, 2, 2, 3, 4, 3],
"2" => [1, 4, 2, 3, 4, 3],
"3" => [1, 2, 6, 3, 4, 3],
}
.unwrap();
let mut df_2 = df_1.clone();
for idx in 0..df_2.width() {
df_2.apply_at_idx(idx, |s| {
s.cummax(false)
.shift(1)
.fill_null(FillNullStrategy::Zero)
.unwrap()
})
.unwrap();
}
println!("{:#?}", df_1);
println!("{:#?}", df_2);
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:
shape: (6, 3)
┌───────┬───────┬───────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘
In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?
<edit>
If you actually have a single dataframe you can do:
let mask =
when(all().gt_eq(
all().cummax(false).shift(1).fill_null(0)))
.then(all())
.otherwise(lit(NULL));
let out =
df_1.lazy().select(&[mask])
//.sum()
.collect();
</edit>
From https://stackoverflow.com/a/72899438
Masking out values by columns in another DataFrame is a potential for errors
caused by different lengths. For this reason polars does not encourage such
operations
It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.
.with_context() has been added since that answer which can be used to access both dataframes.
In Python:
df1.lazy().with_context(
df2.lazy().select(pl.all().suffix("_right"))
).select([
pl.when(pl.col(name) >= pl.col(f"{name}_right"))
.then(pl.col(name))
for name in df1.columns
]).collect()
I've not used rust - but my attempt at a translation:
let mask =
df_1.get_column_names().iter().map(|name|
when(col(name).gt_eq(col(&format!("{name}_right"))))
.then(col(name))
.otherwise(lit(NULL))
).collect::<Vec<Expr>>();
let out =
df_1.lazy()
.with_context(&[
df_2.lazy().select(&[all().suffix("_right")])
])
.select(&mask)
//.sum()
.collect();
println!("{:#?}", out);
Output:
Ok(shape: (6, 3)
┌──────┬──────┬──────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞══════╪══════╪══════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ null ┆ 6 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘)
It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.
You can't call zip and map on the dataframe directly. That doesn't work.
But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.
Long story short
let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();
let df3 = DataFrame::new(
df.iter()
.zip(df2.iter())
.map(|(series1, series2)| series1.gt(series2).unwrap())
.collect());
That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.

How do I take the median of several columns in polars-rs?

Say I have a DataFrame consisting of the following four Series:
use polars::prelude::*;
use chrono::prelude::*;
use chrono::Duration;
fn main() {
let series_one = Series::new(
"a",
(0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_two = Series::new(
"a",
(4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_three = Series::new(
"a",
(8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_dates = Series::new(
"date",
(0..4)
.into_iter()
.map(|v| NaiveDate::default() + Duration::days(v))
.collect::<Vec<_>>(),
);
and I join them as such:
let df_one = DataFrame::new(vec![series_one, series_dates.clone()]).unwrap();
let df_two = DataFrame::new(vec![series_two, series_dates.clone()]).unwrap();
let df_three = DataFrame::new(vec![series_three, series_dates.clone()]).unwrap();
let df = df_one
.join(
&df_two,
["date"],
["date"],
JoinType::Outer,
Some("1".into()),
)
.unwrap()
.join(
&df_three,
["date"],
["date"],
JoinType::Outer,
Some("2".into()),
)
.unwrap();
which produces the following DataFrame:
shape: (4, 4)
┌─────┬────────────┬─────┬──────┐
│ a ┆ date ┆ a1 ┆ a2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ date ┆ f64 ┆ f64 │
╞═════╪════════════╪═════╪══════╡
│ 0.0 ┆ 1970-01-01 ┆ 4.0 ┆ 8.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ 1970-01-02 ┆ 5.0 ┆ 9.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 1970-01-03 ┆ 6.0 ┆ 10.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 1970-01-04 ┆ 7.0 ┆ 11.0 │
└─────┴────────────┴─────┴──────┘
How can I make a new DataFrame which contains a date column and a a_median column like so?:
┌────────────┬────────────┐
│ a_median ┆ date ┆
│ --- ┆ --- ┆
│ f64 ┆ date ┆
╞════════════╪════════════╡
│ 4.0 ┆ 1970-01-01 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ 1970-01-02 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.0 ┆ 1970-01-03 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7.0 ┆ 1970-01-04 ┆
└────────────┴────────────┘
I think this is best accomplished via LazyFrames but I'm not sure how to get this exact result.
To get the results you're looking for, you can union the three DataFrames using the vstack method:
let mut unioned = df_one.vstack(&df_two).unwrap();
unioned = unioned.vstack(&df_three).unwrap();
Once you have a single DataFrame with all the records, you can group and aggregate them:
let aggregated = unioned.lazy()
.groupby(["date"])
.agg([
col("a").median().alias("a_median")
])
.sort(
"date",
SortOptions {
descending: false,
nulls_last: true
}
)
.collect()
.unwrap();
Which gives the expected results:
┌────────────┬──────────┐
│ date ┆ a_median │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪══════════╡
│ 1970-01-01 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-02 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-03 ┆ 6.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-04 ┆ 7.0 │
└────────────┴──────────┘

How to select columns by data type in Polars?

In pandas we have the pandas.DataFrame.select_dtypes method that selects certain columns depending on the dtype. Is there a similar way to do such a thing in Polars?
One can pass a data type to pl.col:
import polars as pl
df = pl.DataFrame(
{
"id": [1, 2, 3],
"name": ["John", "Jane", "Jake"],
"else": [10.0, 20.0, 30.0],
}
)
print(df.select([pl.col(pl.Utf8), pl.col(pl.Int64)]))
Output:
shape: (3, 2)
┌──────┬─────┐
│ name ┆ id │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪═════╡
│ John ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jane ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jake ┆ 3 │
└──────┴─────┘