How to set node properties as incrementing numbers, but resetting the increment when the value of a different property changes? - properties

From the answer of How to set node properties as incrementing numbers?, I can set node properties as increasing numbers:
MATCH (n) where n.gid="A"
WITH collect(n) as nodes
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
SET (pair[0]).id = pair[1]
return pair[0].gid, pair[0].id
╒═════════════╤════════════╕
│"pair[0].gid"│"pair[0].id"│
╞═════════════╪════════════╡
│"A" │0 │
├─────────────┼────────────┤
│"A" │1 │
├─────────────┼────────────┤
│"A" │2 │
├─────────────┼────────────┤
│"A" │3 │
├─────────────┼────────────┤
│"A" │4 │
├─────────────┼────────────┤
But since I have a list of gid: ["A", "B", "C", "D", ...], and I want to run through all the nodes, and each time the gid value changes the incrementing numbers reset. So the result would be:
╒═════════════╤════════════╕
│"pair[0].gid"│"pair[0].id"│
╞═════════════╪════════════╡
│"A" │0 │
├─────────────┼────────────┤
│"A" │1 │
├─────────────┼────────────┤
│"A" │2 │
├─────────────┼────────────┤
│... │... │
├─────────────┼────────────┤
│"A" │15 │
├─────────────┼────────────┤
│"B" │1 │
├─────────────┼────────────┤
│"B" │2 │
I use
MATCH (p) with collect(DISTINCT p.gid) as gids
UNWIND gids as gid
MATCH (n) where n.gid=gid
WITH collect(n) as nodes
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
SET (pair[0]).id = pair[1]
return pair[0].name, pair[0].id
and it doesn't reset the number, i.e.
╒═════════════╤════════════╕
│"pair[0].gid"│"pair[0].id"│
╞═════════════╪════════════╡
│"A" │0 │
├─────────────┼────────────┤
│"A" │1 │
├─────────────┼────────────┤
│"A" │2 │
├─────────────┼────────────┤
│... │... │
├─────────────┼────────────┤
│"A" │15 │
├─────────────┼────────────┤
│"B" │16 │
├─────────────┼────────────┤
│"B" │17 │
Why is that?

The answer to the question "Why is that?" is that your cypher only results in a single list.
I think that when you split the lists by adding a n.gid on line 4
MATCH (p) with collect(DISTINCT p.gid) as gids
UNWIND gids as gid
MATCH (n) where n.gid=gid
// <<< do a "group by"
WITH n.gid AS gid,
collect(n) as nodes // <<< do a "group by"
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
SET (pair[0]).id = pair[1]
return pair[0].name, pair[0].id
it could work.

Related

Polars column from conditioned look up of dictionary values

I want to map a key in one Polars DataFrame to another Polars DF base on the relationships between columns. This is just a sample, the full DF1 and DF2 is much larger (2.5 million and 1.5 million rows respectively.
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377]
})
DF2 = = pl.DataFrame({
'key' : [1,2,3,4,5,6],
'chrom' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,15377,17376,20177,22254, 24357],
'end': [14534,15403,17399,20195,22274,24377]})
What I want is:
DF1 = = pl.DataFrame({
'chr' : ["GL000008.2", "GL000008.2", "GL000008.2", "GL000008.2","GL000008.2", "GL000008.2"],
'start': [14516,17380,17381,20177,22254,24357],
'end': [14534,17399,17399,20195,22274,24377],
'key': [1,3,3,4,5,6]
})
I'd like to assign the key from DF2 to DF1 when chrom matches chr and the start and end in DF1 are contained within the begin and end in DF2.
I first attempted to iterate through the rows of DF1, looking up the matching entry in DF2:
sz = len(DF1[:,0])
for i in range(sz):
DF1[i,"key"] = DF2.filter(
(pl.col("chrom") == DF1[i,"chr"])\
& (pl.col("begin") <= DF1[i,"start"])\
& (pl.col("end") >= DF1[i,"end"])
).select('key')[0,0]
Row iteration through a DF is incredibly slow. This takes about 10 hours.
I also tried using a np.array instead of directly into the df. thats a little faster, but still very slow.
I'm looking for a way to accomplish this using the native Polar's data structure. I don't have key to join on so the join and join_asof strategies don't seem to work.
join and filter should give you what you need:
(
df1.join(df2, left_on="chr", right_on="chrom")
.filter(
(pl.col("start") >= pl.col("start_right"))
& (pl.col("end") <= pl.col("end_right"))
)
.drop(["start_right", "end_right"])
)
shape: (6, 4)
┌────────────┬───────┬───────┬─────┐
│ chr ┆ start ┆ end ┆ key │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 │
└────────────┴───────┴───────┴─────┘
Using a join_asof might provide a performant solution:
(
DF1
.sort('start')
.join_asof(
DF2.sort('start'),
by_left="chr",
by_right="chrom",
on="start",
strategy="backward")
.filter(
pl.col('end') <= pl.col('end_right')
)
)
shape: (6, 5)
┌────────────┬───────┬───────┬─────┬───────────┐
│ chr ┆ start ┆ end ┆ key ┆ end_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═════╪═══════════╡
│ GL000008.2 ┆ 14516 ┆ 14534 ┆ 1 ┆ 14534 │
│ GL000008.2 ┆ 17380 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 17381 ┆ 17399 ┆ 3 ┆ 17399 │
│ GL000008.2 ┆ 20177 ┆ 20195 ┆ 4 ┆ 20195 │
│ GL000008.2 ┆ 22254 ┆ 22274 ┆ 5 ┆ 22274 │
│ GL000008.2 ┆ 24357 ┆ 24377 ┆ 6 ┆ 24377 │
└────────────┴───────┴───────┴─────┴───────────┘
Note: this assumes that your start-end intervals in DF2 do not overlap.

Compare two Pola-rs dataframes by position

Suppose I have two dataframes like:
let df_1 = df! {
"1" => [1, 2, 2, 3, 4, 3],
"2" => [1, 4, 2, 3, 4, 3],
"3" => [1, 2, 6, 3, 4, 3],
}
.unwrap();
let mut df_2 = df_1.clone();
for idx in 0..df_2.width() {
df_2.apply_at_idx(idx, |s| {
s.cummax(false)
.shift(1)
.fill_null(FillNullStrategy::Zero)
.unwrap()
})
.unwrap();
}
println!("{:#?}", df_1);
println!("{:#?}", df_2);
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:
shape: (6, 3)
┌───────┬───────┬───────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘
In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?
<edit>
If you actually have a single dataframe you can do:
let mask =
when(all().gt_eq(
all().cummax(false).shift(1).fill_null(0)))
.then(all())
.otherwise(lit(NULL));
let out =
df_1.lazy().select(&[mask])
//.sum()
.collect();
</edit>
From https://stackoverflow.com/a/72899438
Masking out values by columns in another DataFrame is a potential for errors
caused by different lengths. For this reason polars does not encourage such
operations
It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.
.with_context() has been added since that answer which can be used to access both dataframes.
In Python:
df1.lazy().with_context(
df2.lazy().select(pl.all().suffix("_right"))
).select([
pl.when(pl.col(name) >= pl.col(f"{name}_right"))
.then(pl.col(name))
for name in df1.columns
]).collect()
I've not used rust - but my attempt at a translation:
let mask =
df_1.get_column_names().iter().map(|name|
when(col(name).gt_eq(col(&format!("{name}_right"))))
.then(col(name))
.otherwise(lit(NULL))
).collect::<Vec<Expr>>();
let out =
df_1.lazy()
.with_context(&[
df_2.lazy().select(&[all().suffix("_right")])
])
.select(&mask)
//.sum()
.collect();
println!("{:#?}", out);
Output:
Ok(shape: (6, 3)
┌──────┬──────┬──────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞══════╪══════╪══════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ null ┆ 6 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘)
It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.
You can't call zip and map on the dataframe directly. That doesn't work.
But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.
Long story short
let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();
let df3 = DataFrame::new(
df.iter()
.zip(df2.iter())
.map(|(series1, series2)| series1.gt(series2).unwrap())
.collect());
That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.

How do I take the median of several columns in polars-rs?

Say I have a DataFrame consisting of the following four Series:
use polars::prelude::*;
use chrono::prelude::*;
use chrono::Duration;
fn main() {
let series_one = Series::new(
"a",
(0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_two = Series::new(
"a",
(4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_three = Series::new(
"a",
(8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_dates = Series::new(
"date",
(0..4)
.into_iter()
.map(|v| NaiveDate::default() + Duration::days(v))
.collect::<Vec<_>>(),
);
and I join them as such:
let df_one = DataFrame::new(vec![series_one, series_dates.clone()]).unwrap();
let df_two = DataFrame::new(vec![series_two, series_dates.clone()]).unwrap();
let df_three = DataFrame::new(vec![series_three, series_dates.clone()]).unwrap();
let df = df_one
.join(
&df_two,
["date"],
["date"],
JoinType::Outer,
Some("1".into()),
)
.unwrap()
.join(
&df_three,
["date"],
["date"],
JoinType::Outer,
Some("2".into()),
)
.unwrap();
which produces the following DataFrame:
shape: (4, 4)
┌─────┬────────────┬─────┬──────┐
│ a ┆ date ┆ a1 ┆ a2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ date ┆ f64 ┆ f64 │
╞═════╪════════════╪═════╪══════╡
│ 0.0 ┆ 1970-01-01 ┆ 4.0 ┆ 8.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ 1970-01-02 ┆ 5.0 ┆ 9.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 1970-01-03 ┆ 6.0 ┆ 10.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 1970-01-04 ┆ 7.0 ┆ 11.0 │
└─────┴────────────┴─────┴──────┘
How can I make a new DataFrame which contains a date column and a a_median column like so?:
┌────────────┬────────────┐
│ a_median ┆ date ┆
│ --- ┆ --- ┆
│ f64 ┆ date ┆
╞════════════╪════════════╡
│ 4.0 ┆ 1970-01-01 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ 1970-01-02 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.0 ┆ 1970-01-03 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7.0 ┆ 1970-01-04 ┆
└────────────┴────────────┘
I think this is best accomplished via LazyFrames but I'm not sure how to get this exact result.
To get the results you're looking for, you can union the three DataFrames using the vstack method:
let mut unioned = df_one.vstack(&df_two).unwrap();
unioned = unioned.vstack(&df_three).unwrap();
Once you have a single DataFrame with all the records, you can group and aggregate them:
let aggregated = unioned.lazy()
.groupby(["date"])
.agg([
col("a").median().alias("a_median")
])
.sort(
"date",
SortOptions {
descending: false,
nulls_last: true
}
)
.collect()
.unwrap();
Which gives the expected results:
┌────────────┬──────────┐
│ date ┆ a_median │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪══════════╡
│ 1970-01-01 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-02 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-03 ┆ 6.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-04 ┆ 7.0 │
└────────────┴──────────┘

How to select columns by data type in Polars?

In pandas we have the pandas.DataFrame.select_dtypes method that selects certain columns depending on the dtype. Is there a similar way to do such a thing in Polars?
One can pass a data type to pl.col:
import polars as pl
df = pl.DataFrame(
{
"id": [1, 2, 3],
"name": ["John", "Jane", "Jake"],
"else": [10.0, 20.0, 30.0],
}
)
print(df.select([pl.col(pl.Utf8), pl.col(pl.Int64)]))
Output:
shape: (3, 2)
┌──────┬─────┐
│ name ┆ id │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪═════╡
│ John ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jane ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ Jake ┆ 3 │
└──────┴─────┘

Groupby with sum on Julia Dataframe

I am trying to make a groupby + sum on a Julia Dataframe with Int and String values
For instance, df :
│ Row │ A │ B │ C │ D │
│ │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1 │ x1 │ a │ 12 │ green │
│ 2 │ x2 │ a │ 7 │ blue │
│ 3 │ x1 │ b │ 5 │ red │
│ 4 │ x2 │ a │ 4 │ blue │
│ 5 │ x1 │ b │ 9 │ yellow │
To do this in Python, the command could be :
df_group = df.groupby(['A', 'B']).sum().reset_index()
I will obtain the following output result with the initial column labels :
A B C
0 x1 a 12
1 x1 b 14
2 x2 a 11
I would like to do the same thing in Julia. I tried this way, unsuccessfully :
df_group = aggregate(df, ["A", "B"], sum)
MethodError: no method matching +(::String, ::String)
Have you any idea of a way to do this in Julia ?
Try (actually instead of non-string columns, probably you want columns that are numeric):
numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)
and if you want to allow missing values (and skip them when doing a summation) then:
numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)
Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like
using DataFrames
df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"],
:B => ["a", "a", "b", "a", "b"],
:C => [12, 7, 5, 4, 9],
:D => ["green", "blue", "red", "blue", "yellow"])
gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)
with the result
julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A │ B │ C_sum │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │
You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl
using Underscores
#_ groupby(df, [:A, :B]) |> combine(__, :C => sum)
You can give name to the new column with the following syntax
julia> #_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A │ B │ C │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │