I have a graph database in AWS neptune which has 60M nodes. A simple count query to count all nodes takes ~6-7 minutes.
query is:
MATCH (n)
RETURN count(n) as count
Is 6 minutes to count 60 million nodes normal? what can I do to make it faster?
Explain (debug) result for this query:
Query:
// all nodes count
MATCH (n)
RETURN count(n) as cnt
ID Out #1 Out #2 Name Arguments Mode Units In Units Out Ratio Time (ms) Chunks In Chunks Out Units In (per chunk) Units Out (per chunk) Invocations OutWaitMs Out Blocked Rate [M/s] GCElapsedMs blocksIncGC blocksDecGC progressCount init [ms] done [ms] finalize [ms]
0 1 - SolutionInjection solutions=[{}] - 0 1 0.00 0 0 0 0.00 0.00 0 0 0 NaN
1 2 - DFESubquery subQuery=subQuery1
partitionId=0
details= ====> DFE execution time toPASTModel [micros]=213 accepted [micros]=60 ready [micros]=214 running [micros]=68035522 finished [micros]=0 ===> DFE execution time (measured in DFENode) -> setupTime [ms]=0 -> executionTime [ms]=68038 -> resultReadTime [ms]=0 ====> Original AST: DFEJoinGroupNode[]( children=[ DFEProjectionNode[NONE]( projectedVars=[?cnt], child=DFEAggregationNode[NONE]( groupByVars=[], aggregateExpressions=[ DFEAggregateExpression(aggregate=DFEBindNode(countWithoutNulls(?n) AS ?cnt), isDistinct=false)], child=DFEJoinGroupNode[]( children=[ DFEPatternNode((?n, TermId(782U)[http://www.w3.org/1999/02/22-rdf-syntax-ns#type], ?n_label1, TermId(526U)[http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph]) . project DISTINCT[?n] {rangeCountEstimate=78682901})], opInfo=none), opInfo=none), opInfo=none)], opInfo=none) ====> Preprocessed AST: DFEProjectionNode[NONE]( projectedVars=[?cnt], child=DFEAggregationNode[NONE]( groupByVars=[], aggregateExpressions=[ DFEAggregateExpression(aggregate=DFEBindNode(countWithoutNulls(?n) AS ?cnt), isDistinct=false)], child=DFEPatternNode((?n, TermId(782U)[http://www.w3.org/1999/02/22-rdf-syntax-ns#type], ?n_label1, TermId(526U)[http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph]) . project DISTINCT[?n] {rangeCountEstimate=78682901}), opInfo=(type=NoneOperatorStub, cost=(exp=(empty),wc=(empty)))), opInfo=(type=SubQuery, cost=(exp=(empty),wc=(empty)))) ===> DFE configuration (given) solutionChunkSize=100000 outputQueueSize=20 numComputeCores=3 maxParallelIO=5 numInitialPermits=0 frontiersAsInFilters=true partitionId=0 isExplainRequested=true languageSpecifier=Open_Cypher planVariant=all/BLOCKING readForUpdate=false ====> DFE configuration (reported) numComputeCores=3 numIOThreads=1 numInitialPermits=1 permitsSecured=1732 ===> Top level Statistics & operator histogram ==> Statistics -> 68032185 / 68026188 micros total elapsed (incl. wait / excl. wait) -> 68032 / 68026 millis total elapsed (incl. wait / excl. wait) -> 68 / 68 secs total elapsed (incl. wait / excl. wait) ==> GC Summary -> 40.68ms spent in GC (0.06% of total time) ==> Operator histogram (all times are excluding wait) -> Total Operator #instances: 4 Operator │ Time(ms) │ Time(%) │ rowsIn │ rowsOut │ chunksIn │ chunksOut │ instances │ invocation │ in(M/s) │ out(M/s) │ time/chunkIn(ms) │ time/chunkOut(ms) │ time/invoc(ms) │ GC(ms) │ GC(%) ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── pipelineScan │ 67518 │ 99.25% │ 0 │ 60341570 │ 0 │ 1751 │ 1 │ 1 │ 0.00 │ 0.89 │ - │ 38.56 │ 67518 │ 17.67 │ 0.03 merge │ 415 │ 0.61% │ 60341570 │ 60341570 │ 1751 │ 1 │ 1 │ 1 │ 145.53 │ 145.53 │ 0.24 │ 415 │ 415 │ 6.05 │ 1.46 daslReduce │ 93.86 │ 0.14% │ 60341570 │ 1 │ 1 │ 1 │ 1 │ 1 │ 642.91 │ 0.00 │ 93.86 │ 93.86 │ 93.86 │ 16.94 │ 18.05 drain │ 0.06 │ 0.00% │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0.02 │ 0.00 │ 0.06 │ - │ 0.06 │ 0.02 │ 25.00 - 0 1 0.00 68048 0 1 0.00 1.00 2 0 0 0.00
2 - - TermResolution vars=[?cnt] id2value_opencypher 1 1 1.00 1.00 1 1 1.00 1.00 1 0 0 0.00
subQuery1
ID Out #1 Out #2 Name Arguments Mode Units In Units Out Ratio Time (ms) Chunks In Chunks Out Units In (per chunk) Units Out (per chunk) Invocations OutWaitMs Out Blocked Rate [M/s] GCElapsedMs blocksIncGC blocksDecGC progressCount init [ms] done [ms] finalize [ms]
0 1 - DFEPipelineScan pattern=distinct ?n (?n,rdf:type,?n_label1,neptune:DefaultNamedGraph)
outSchema=[?n]
patternEstimate=78682901 - 0 60341570 0.00 67518 0 1751 0.00 34461.21 1 5.97 0 0.89
1 2 - DFEMergeChunks inSchema=[?n]
outSchema=[?n] - 60341570 60341570 1.00 415 1751 1 34461.21 60341570.00 1 0.01 0 145.53
2 3 - DFEReduce functor=countWithoutNulls(?n)
inSchema=[?n]
outSchema=[?cnt] - 60341570 1 0.00 93.86 1 1 60341570.00 1.00 1 0.01 0 0.00
3 - - DFEDrain inSchema=[?cnt]
outSchema=[?cnt] - 1 0 0.00 0.06 1 0 1.00 0.00 1 0 0 0.00
Explain (detail) results:
RUN THIS SNIPPET BELOW FOR BETTER READABILITY
<!DOCTYPE html>
<html>
<body>
<h3>Query:</h3>
<p>// all nodes count<br/>MATCH (n)<br/>RETURN count(n) as cnt<br/></p>
<br/>
<table border="1px">
<thead>
<tr>
<th>ID</th>
<th>Out #1</th>
<th>Out #2</th>
<th>Name</th>
<th>Arguments</th>
<th>Mode</th>
<th>Units In</th>
<th>Units Out</th>
<th>Ratio</th>
<th>Time (ms)</th>
<th>Chunks In</th>
<th>Chunks Out</th>
<th>Units In (per chunk)</th>
<th>Units Out (per chunk)</th>
<th>Invocations</th>
<th>OutWaitMs</th>
<th>Out Blocked</th>
<th>Rate [M/s]</th>
<th>GCElapsedMs</th>
<th>blocksIncGC</th>
<th>blocksDecGC</th>
<th>progressCount</th>
<th>init [ms]</th>
<th> done [ms]</th>
<th>finalize [ms]</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>-</td>
<td>SolutionInjection</td>
<td>solutions=[{}]</td>
<td>-</td>
<td>0</td>
<td>1</td>
<td>0.00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0.00</td>
<td>0.00</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>NaN</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>-</td>
<td>DFESubquery</td>
<td>subQuery=subQuery1<br>partitionId=0<br>details=
====> DFE execution time
toPASTModel [micros]=213
accepted [micros]=60
ready [micros]=214
running [micros]=68035522
finished [micros]=0
===> DFE execution time (measured in DFENode)
-> setupTime [ms]=0
-> executionTime [ms]=68038
-> resultReadTime [ms]=0
====> Original AST:
DFEJoinGroupNode[](
children=[
DFEProjectionNode[NONE](
projectedVars=[?cnt],
child=DFEAggregationNode[NONE](
groupByVars=[],
aggregateExpressions=[
DFEAggregateExpression(aggregate=DFEBindNode(countWithoutNulls(?n) AS ?cnt), isDistinct=false)],
child=DFEJoinGroupNode[](
children=[
DFEPatternNode((?n, TermId(782U)[http://www.w3.org/1999/02/22-rdf-syntax-ns#type], ?n_label1, TermId(526U)[http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph]) . project DISTINCT[?n] {rangeCountEstimate=78682901})],
opInfo=none),
opInfo=none),
opInfo=none)],
opInfo=none)
====> Preprocessed AST:
DFEProjectionNode[NONE](
projectedVars=[?cnt],
child=DFEAggregationNode[NONE](
groupByVars=[],
aggregateExpressions=[
DFEAggregateExpression(aggregate=DFEBindNode(countWithoutNulls(?n) AS ?cnt), isDistinct=false)],
child=DFEPatternNode((?n, TermId(782U)[http://www.w3.org/1999/02/22-rdf-syntax-ns#type], ?n_label1, TermId(526U)[http://aws.amazon.com/neptune/vocab/v01/DefaultNamedGraph]) . project DISTINCT[?n] {rangeCountEstimate=78682901}),
opInfo=(type=NoneOperatorStub, cost=(exp=(empty),wc=(empty)))),
opInfo=(type=SubQuery, cost=(exp=(empty),wc=(empty))))
===> DFE configuration (given)
solutionChunkSize=100000
outputQueueSize=20
numComputeCores=3
maxParallelIO=5
numInitialPermits=0
frontiersAsInFilters=true
partitionId=0
isExplainRequested=true
languageSpecifier=Open_Cypher
planVariant=all/BLOCKING
readForUpdate=false
====> DFE configuration (reported)
numComputeCores=3
numIOThreads=1
numInitialPermits=1
permitsSecured=1732
===> Top level Statistics & operator histogram
==> Statistics
-> 68032185 / 68026188 micros total elapsed (incl. wait / excl. wait)
-> 68032 / 68026 millis total elapsed (incl. wait / excl. wait)
-> 68 / 68 secs total elapsed (incl. wait / excl. wait)
==> GC Summary
-> 40.68ms spent in GC (0.06% of total time)
==> Operator histogram (all times are excluding wait)
-> Total Operator #instances: 4
Operator │ Time(ms) │ Time(%) │ rowsIn │ rowsOut │ chunksIn │ chunksOut │ instances │ invocation │ in(M/s) │ out(M/s) │ time/chunkIn(ms) │ time/chunkOut(ms) │ time/invoc(ms) │ GC(ms) │ GC(%)
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
pipelineScan │ 67518 │ 99.25% │ 0 │ 60341570 │ 0 │ 1751 │ 1 │ 1 │ 0.00 │ 0.89 │ - │ 38.56 │ 67518 │ 17.67 │ 0.03
merge │ 415 │ 0.61% │ 60341570 │ 60341570 │ 1751 │ 1 │ 1 │ 1 │ 145.53 │ 145.53 │ 0.24 │ 415 │ 415 │ 6.05 │ 1.46
daslReduce │ 93.86 │ 0.14% │ 60341570 │ 1 │ 1 │ 1 │ 1 │ 1 │ 642.91 │ 0.00 │ 93.86 │ 93.86 │ 93.86 │ 16.94 │ 18.05
drain │ 0.06 │ 0.00% │ 1 │ 0 │ 1 │ 0 │ 1 │ 1 │ 0.02 │ 0.00 │ 0.06 │ - │ 0.06 │ 0.02 │ 25.00</td>
<td>-</td>
<td>0</td>
<td>1</td>
<td>0.00</td>
<td>68048</td>
<td>0</td>
<td>1</td>
<td>0.00</td>
<td>1.00</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0.00</td>
</tr>
<tr>
<td>2</td>
<td>-</td>
<td>-</td>
<td>TermResolution</td>
<td>vars=[?cnt]</td>
<td>id2value_opencypher</td>
<td>1</td>
<td>1</td>
<td>1.00</td>
<td>1.00</td>
<td>1</td>
<td>1</td>
<td>1.00</td>
<td>1.00</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0.00</td>
</tr>
</tbody>
</table>
<h3>subQuery1</h3>
<table border="1px">
<thead>
<tr>
<th>ID</th>
<th>Out #1</th>
<th>Out #2</th>
<th>Name</th>
<th>Arguments</th>
<th>Mode</th>
<th>Units In</th>
<th>Units Out</th>
<th>Ratio</th>
<th>Time (ms)</th>
<th>Chunks In</th>
<th>Chunks Out</th>
<th>Units In (per chunk)</th>
<th>Units Out (per chunk)</th>
<th>Invocations</th>
<th>OutWaitMs</th>
<th>Out Blocked</th>
<th>Rate [M/s]</th>
<th>GCElapsedMs</th>
<th>blocksIncGC</th>
<th>blocksDecGC</th>
<th>progressCount</th>
<th>init [ms]</th>
<th> done [ms]</th>
<th>finalize [ms]</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>-</td>
<td>DFEPipelineScan</td>
<td>pattern=distinct ?n (?n,rdf:type,?n_label1,neptune:DefaultNamedGraph)<br>outSchema=[?n]<br>patternEstimate=78682901</td>
<td>-</td>
<td>0</td>
<td>60341570</td>
<td>0.00</td>
<td>67518</td>
<td>0</td>
<td>1751</td>
<td>0.00</td>
<td>34461.21</td>
<td>1</td>
<td>5.97</td>
<td>0</td>
<td>0.89</td>
</tr>
<tr>
<td>1</td>
<td>2</td>
<td>-</td>
<td>DFEMergeChunks</td>
<td>inSchema=[?n]<br>outSchema=[?n]</td>
<td>-</td>
<td>60341570</td>
<td>60341570</td>
<td>1.00</td>
<td>415</td>
<td>1751</td>
<td>1</td>
<td>34461.21</td>
<td>60341570.00</td>
<td>1</td>
<td>0.01</td>
<td>0</td>
<td>145.53</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
<td>-</td>
<td>DFEReduce</td>
<td>functor=countWithoutNulls(?n)<br>inSchema=[?n]<br>outSchema=[?cnt]</td>
<td>-</td>
<td>60341570</td>
<td>1</td>
<td>0.00</td>
<td>93.86</td>
<td>1</td>
<td>1</td>
<td>60341570.00</td>
<td>1.00</td>
<td>1</td>
<td>0.01</td>
<td>0</td>
<td>0.00</td>
</tr>
<tr>
<td>3</td>
<td>-</td>
<td>-</td>
<td>DFEDrain</td>
<td>inSchema=[?cnt]<br>outSchema=[?cnt]</td>
<td>-</td>
<td>1</td>
<td>0</td>
<td>0.00</td>
<td>0.06</td>
<td>1</td>
<td>0</td>
<td>1.00</td>
<td>0.00</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0.00</td>
</tr>
</tbody>
</table>
</body>
</html>
Related
Suppose I have two dataframes like:
let df_1 = df! {
"1" => [1, 2, 2, 3, 4, 3],
"2" => [1, 4, 2, 3, 4, 3],
"3" => [1, 2, 6, 3, 4, 3],
}
.unwrap();
let mut df_2 = df_1.clone();
for idx in 0..df_2.width() {
df_2.apply_at_idx(idx, |s| {
s.cummax(false)
.shift(1)
.fill_null(FillNullStrategy::Zero)
.unwrap()
})
.unwrap();
}
println!("{:#?}", df_1);
println!("{:#?}", df_2);
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 │
└─────┴─────┴─────┘
shape: (6, 3)
┌─────┬─────┬─────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 0 ┆ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 4 ┆ 6 │
└─────┴─────┴─────┘
and I want to compare them such that I end up with a boolean dataframe I can use as a predicate for a selection and aggregation:
shape: (6, 3)
┌───────┬───────┬───────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪═══════╪═══════╡
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ true │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ false ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ true ┆ true ┆ false │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ false ┆ false ┆ false │
└───────┴───────┴───────┘
In Python Pandas I might do df.where(df_1.ge(df_2)).sum().sum(). What's the idiomatic way to do that with Rust Pola-rs?
<edit>
If you actually have a single dataframe you can do:
let mask =
when(all().gt_eq(
all().cummax(false).shift(1).fill_null(0)))
.then(all())
.otherwise(lit(NULL));
let out =
df_1.lazy().select(&[mask])
//.sum()
.collect();
</edit>
From https://stackoverflow.com/a/72899438
Masking out values by columns in another DataFrame is a potential for errors
caused by different lengths. For this reason polars does not encourage such
operations
It appears the recommended way is to add a suffix to one of the dataframes, "concat" them and use when/then/otherwise.
.with_context() has been added since that answer which can be used to access both dataframes.
In Python:
df1.lazy().with_context(
df2.lazy().select(pl.all().suffix("_right"))
).select([
pl.when(pl.col(name) >= pl.col(f"{name}_right"))
.then(pl.col(name))
for name in df1.columns
]).collect()
I've not used rust - but my attempt at a translation:
let mask =
df_1.get_column_names().iter().map(|name|
when(col(name).gt_eq(col(&format!("{name}_right"))))
.then(col(name))
.otherwise(lit(NULL))
).collect::<Vec<Expr>>();
let out =
df_1.lazy()
.with_context(&[
df_2.lazy().select(&[all().suffix("_right")])
])
.select(&mask)
//.sum()
.collect();
println!("{:#?}", out);
Output:
Ok(shape: (6, 3)
┌──────┬──────┬──────┐
│ 1 ┆ 2 ┆ 3 │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞══════╪══════╪══════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ null ┆ 6 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘)
It took me the longest time to figure out how to even do elementwise addition in polars. I guess that's just not the "normal" way to use these things as in principle the columns can have different data types.
You can't call zip and map on the dataframe directly. That doesn't work.
But. df has a method iter() that gives you an iterater over all the columns. The columns are Series, and for those you have all sorts of elementwise operations implemented.
Long story short
let df = df!("A" => &[1, 2, 3], "B" => &[4, 5, 6]).unwrap();
let df2 = df!("A" => &[6, 5, 4], "B" => &[3, 2, 1]).unwrap();
let df3 = DataFrame::new(
df.iter()
.zip(df2.iter())
.map(|(series1, series2)| series1.gt(series2).unwrap())
.collect());
That gives you your boolean array. From here, I assume it should be possible to figure out how to use that for indexing. Probably another use of df.iter().zip(df3) or whatever.
Say I have a DataFrame consisting of the following four Series:
use polars::prelude::*;
use chrono::prelude::*;
use chrono::Duration;
fn main() {
let series_one = Series::new(
"a",
(0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_two = Series::new(
"a",
(4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_three = Series::new(
"a",
(8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_dates = Series::new(
"date",
(0..4)
.into_iter()
.map(|v| NaiveDate::default() + Duration::days(v))
.collect::<Vec<_>>(),
);
and I join them as such:
let df_one = DataFrame::new(vec![series_one, series_dates.clone()]).unwrap();
let df_two = DataFrame::new(vec![series_two, series_dates.clone()]).unwrap();
let df_three = DataFrame::new(vec![series_three, series_dates.clone()]).unwrap();
let df = df_one
.join(
&df_two,
["date"],
["date"],
JoinType::Outer,
Some("1".into()),
)
.unwrap()
.join(
&df_three,
["date"],
["date"],
JoinType::Outer,
Some("2".into()),
)
.unwrap();
which produces the following DataFrame:
shape: (4, 4)
┌─────┬────────────┬─────┬──────┐
│ a ┆ date ┆ a1 ┆ a2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ date ┆ f64 ┆ f64 │
╞═════╪════════════╪═════╪══════╡
│ 0.0 ┆ 1970-01-01 ┆ 4.0 ┆ 8.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ 1970-01-02 ┆ 5.0 ┆ 9.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 1970-01-03 ┆ 6.0 ┆ 10.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ 1970-01-04 ┆ 7.0 ┆ 11.0 │
└─────┴────────────┴─────┴──────┘
How can I make a new DataFrame which contains a date column and a a_median column like so?:
┌────────────┬────────────┐
│ a_median ┆ date ┆
│ --- ┆ --- ┆
│ f64 ┆ date ┆
╞════════════╪════════════╡
│ 4.0 ┆ 1970-01-01 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5.0 ┆ 1970-01-02 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6.0 ┆ 1970-01-03 ┆
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7.0 ┆ 1970-01-04 ┆
└────────────┴────────────┘
I think this is best accomplished via LazyFrames but I'm not sure how to get this exact result.
To get the results you're looking for, you can union the three DataFrames using the vstack method:
let mut unioned = df_one.vstack(&df_two).unwrap();
unioned = unioned.vstack(&df_three).unwrap();
Once you have a single DataFrame with all the records, you can group and aggregate them:
let aggregated = unioned.lazy()
.groupby(["date"])
.agg([
col("a").median().alias("a_median")
])
.sort(
"date",
SortOptions {
descending: false,
nulls_last: true
}
)
.collect()
.unwrap();
Which gives the expected results:
┌────────────┬──────────┐
│ date ┆ a_median │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪══════════╡
│ 1970-01-01 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-02 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-03 ┆ 6.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1970-01-04 ┆ 7.0 │
└────────────┴──────────┘
From the answer of How to set node properties as incrementing numbers?, I can set node properties as increasing numbers:
MATCH (n) where n.gid="A"
WITH collect(n) as nodes
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
SET (pair[0]).id = pair[1]
return pair[0].gid, pair[0].id
╒═════════════╤════════════╕
│"pair[0].gid"│"pair[0].id"│
╞═════════════╪════════════╡
│"A" │0 │
├─────────────┼────────────┤
│"A" │1 │
├─────────────┼────────────┤
│"A" │2 │
├─────────────┼────────────┤
│"A" │3 │
├─────────────┼────────────┤
│"A" │4 │
├─────────────┼────────────┤
But since I have a list of gid: ["A", "B", "C", "D", ...], and I want to run through all the nodes, and each time the gid value changes the incrementing numbers reset. So the result would be:
╒═════════════╤════════════╕
│"pair[0].gid"│"pair[0].id"│
╞═════════════╪════════════╡
│"A" │0 │
├─────────────┼────────────┤
│"A" │1 │
├─────────────┼────────────┤
│"A" │2 │
├─────────────┼────────────┤
│... │... │
├─────────────┼────────────┤
│"A" │15 │
├─────────────┼────────────┤
│"B" │1 │
├─────────────┼────────────┤
│"B" │2 │
I use
MATCH (p) with collect(DISTINCT p.gid) as gids
UNWIND gids as gid
MATCH (n) where n.gid=gid
WITH collect(n) as nodes
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
SET (pair[0]).id = pair[1]
return pair[0].name, pair[0].id
and it doesn't reset the number, i.e.
╒═════════════╤════════════╕
│"pair[0].gid"│"pair[0].id"│
╞═════════════╪════════════╡
│"A" │0 │
├─────────────┼────────────┤
│"A" │1 │
├─────────────┼────────────┤
│"A" │2 │
├─────────────┼────────────┤
│... │... │
├─────────────┼────────────┤
│"A" │15 │
├─────────────┼────────────┤
│"B" │16 │
├─────────────┼────────────┤
│"B" │17 │
Why is that?
The answer to the question "Why is that?" is that your cypher only results in a single list.
I think that when you split the lists by adding a n.gid on line 4
MATCH (p) with collect(DISTINCT p.gid) as gids
UNWIND gids as gid
MATCH (n) where n.gid=gid
// <<< do a "group by"
WITH n.gid AS gid,
collect(n) as nodes // <<< do a "group by"
WITH apoc.coll.zip(nodes, range(0, size(nodes))) as pairs
UNWIND pairs as pair
SET (pair[0]).id = pair[1]
return pair[0].name, pair[0].id
it could work.
I tried to dispose RxJava disposable in ViewModel. I'm pretry sure it is a good practice to clear disposables in onCleard method of ViewModel, so it doesn't leakage ViewModel itself. But, when there's an UnknownHostException, right before dispose action happened, it's make ViewModel leakage. I opened leak canary and I it says that onError related stuff has made this leaks.
I figured out that UnknownHostException is acctually catch under Global error handler of RxJava, so why this exception keep referencing onError callback and makes leakage, even after I dispose the disposable.
UseCase.kt
abstract class UseCase {
protected var lastDisposable: Disposable? = null
private val compositeDisposable: CompositeDisposable = CompositeDisposable()
fun disposeLast() {
lastDisposable?.let {
if (!it.isDisposed) {
it.dispose()
}
}
}
fun dispose() {
compositeDisposable.clear()
}
fun Disposable.addDisposable() {
compositeDisposable.add(this)
}
class None()
}
FlowableUseCase.kt
abstract class FlowableUseCase<in Params, Model> #Inject constructor(private val flowableRxTransformer: FlowableRxTransformer<Model>) :
UseCase() {
internal abstract fun buildUseCaseFlowable(params: Params): Flowable<Model>
operator fun invoke(
params: Params,
onLoading: () -> Unit = {},
onError: ((error: Throwable) -> Unit) = {},
onSuccess: ((entity: Model) -> Unit) = {},
onFinished: () -> Unit = {}
) {
onLoading()
disposeLast()
lastDisposable = buildUseCaseFlowable(params)
.compose(flowableRxTransformer)
.doAfterTerminate(onFinished)
.subscribe(onSuccess, onError) // leaks happened here
lastDisposable?.addDisposable()
}
}
If I tried to remove onError Callback there is no error anymore. So I'm pretty sure that its caused memory leaks, but this action is pottentially make OnErrorNotImplementedException so I don't want to do that.
Any kind of advice is acceptable. Thanks in advance.
Leak Canary Head Dump
┬───
│ GC Root: System class
│
├─ android.provider.FontsContract class
│ Leaking: NO (App↓ is not leaking and a class is never leaking)
│ ↓ static FontsContract.sContext
├─ com.potatocandie.cleanprayertime.App instance
│ Leaking: NO (Application is a singleton)
│ mBase instance of android.app.ContextImpl, not wrapping known Android
│ context
│ ↓ App.componentManager
│ ~~~~~~~~~~~~~~~~
├─ dagger.hilt.android.internal.managers.ApplicationComponentManager instance
│ Leaking: UNKNOWN
│ Retaining 40 bytes in 3 objects
│ ↓ ApplicationComponentManager.component
│ ~~~~~~~~~
├─ com.potatocandie.cleanprayertime.DaggerApp_HiltComponents_SingletonC instance
│ Leaking: UNKNOWN
│ Retaining 7731 bytes in 279 objects
│ ↓ DaggerApp_HiltComponents_SingletonC.okHttpClientBuilder
│ ~~~~~~~~~~~~~~~~~~~
├─ okhttp3.OkHttpClient$Builder instance
│ Leaking: UNKNOWN
│ Retaining 203 bytes in 4 objects
│ ↓ OkHttpClient$Builder.dispatcher
│ ~~~~~~~~~~
├─ okhttp3.Dispatcher instance
│ Leaking: UNKNOWN
│ Retaining 288 bytes in 7 objects
│ ↓ Dispatcher.runningAsyncCalls
│ ~~~~~~~~~~~~~~~~~
├─ java.util.ArrayDeque instance
│ Leaking: UNKNOWN
│ Retaining 84 bytes in 2 objects
│ ↓ ArrayDeque.elements
│ ~~~~~~~~
├─ java.lang.Object[] array
│ Leaking: UNKNOWN
│ Retaining 64 bytes in 1 objects
│ ↓ Object[].[0]
│ ~~~
├─ okhttp3.internal.connection.RealCall$AsyncCall instance
│ Leaking: UNKNOWN
│ Retaining 3815 bytes in 123 objects
│ ↓ RealCall$AsyncCall.responseCallback
│ ~~~~~~~~~~~~~~~~
├─ retrofit2.OkHttpCall$1 instance
│ Leaking: UNKNOWN
│ Retaining 3795 bytes in 122 objects
│ Anonymous class implementing okhttp3.Callback
│ ↓ OkHttpCall$1.val$callback
│ ~~~~~~~~~~~~
├─ retrofit2.adapter.rxjava3.CallEnqueueObservable$CallCallback instance
│ Leaking: UNKNOWN
│ Retaining 3705 bytes in 119 objects
│ ↓ CallEnqueueObservable$CallCallback.observer
│ ~~~~~~~~
├─ retrofit2.adapter.rxjava3.BodyObservable$BodyObserver instance
│ Leaking: UNKNOWN
│ Retaining 3687 bytes in 118 objects
│ ↓ BodyObservable$BodyObserver.observer
│ ~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableFromObservable$SubscriberObserver instance
│ Leaking: UNKNOWN
│ Retaining 3674 bytes in 117 objects
│ ↓ FlowableFromObservable$SubscriberObserver.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableOnBackpressureLatest$BackpressureLatestSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 3658 bytes in 116 objects
│ ↓ FlowableOnBackpressureLatest$BackpressureLatestSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.FlowableMap$MapSubscriber
│ instance
│ Leaking: UNKNOWN
│ Retaining 3596 bytes in 113 objects
│ ↓ FlowableMap$MapSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableDoOnEach$DoOnEachSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 3564 bytes in 112 objects
│ ↓ FlowableDoOnEach$DoOnEachSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableRetryWhen$RetryWhenSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 3400 bytes in 105 objects
│ ↓ FlowableRetryWhen$RetryWhenSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.subscribers.SerializedSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 2860 bytes in 79 objects
│ ↓ SerializedSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableDoOnEach$DoOnEachSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 2837 bytes in 78 objects
│ ↓ FlowableDoOnEach$DoOnEachSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableFlatMap$InnerSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 2781 bytes in 76 objects
│ ↓ FlowableFlatMap$InnerSubscriber.parent
│ ~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableFlatMap$MergeSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 2732 bytes in 75 objects
│ ↓ FlowableFlatMap$MergeSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.FlowableMap$MapSubscriber
│ instance
│ Leaking: UNKNOWN
│ Retaining 1837 bytes in 58 objects
│ ↓ FlowableMap$MapSubscriber.downstream
│ ~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.operators.flowable.
│ FlowableDoOnEach$DoOnEachSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 1793 bytes in 56 objects
│ ↓ FlowableDoOnEach$DoOnEachSubscriber.onNext
│ ~~~~~~
├─ com.potatocandie.domain.base.FlowableUseCase$invoke$5 instance
│ Leaking: UNKNOWN
│ Retaining 12 bytes in 1 objects
│ Anonymous class implementing io.reactivex.rxjava3.functions.Consumer
│ ↓ FlowableUseCase$invoke$5.this$0
│ ~~~~~~
├─ com.potatocandie.domain.usecases.GetCurrentMonthsPrayerTimesUseCase instance
│ Leaking: UNKNOWN
│ Retaining 45 bytes in 3 objects
│ ↓ GetCurrentMonthsPrayerTimesUseCase.lastDisposable
│ ~~~~~~~~~~~~~~
├─ io.reactivex.rxjava3.internal.subscribers.LambdaSubscriber instance
│ Leaking: UNKNOWN
│ Retaining 800 bytes in 34 objects
│ ↓ LambdaSubscriber.onError
│ ~~~~~~~
├─ com.potatocandie.domain.base.
│ FlowableUseCase$sam$io_reactivex_rxjava3_functions_Consumer$0 instance
│ Leaking: UNKNOWN
│ Retaining 760 bytes in 32 objects
│ Anonymous class implementing io.reactivex.rxjava3.functions.Consumer
│ ↓ FlowableUseCase$sam$io_reactivex_rxjava3_functions_Consumer$0.function
│ ~~~~~~~~
├─ com.potatocandie.cleanprayertime.features.today.
│ PrayerTimesViewModel$getCurrentMonthsPrayerTimes$2 instance
│ Leaking: UNKNOWN
│ Retaining 748 bytes in 31 objects
│ Anonymous subclass of kotlin.jvm.internal.Lambda
│ ↓ PrayerTimesViewModel$getCurrentMonthsPrayerTimes$2.this$0
│ ~~~~~~
╰→ com.potatocandie.cleanprayertime.features.today.PrayerTimesViewModel instance
Leaking: YES (ObjectWatcher was watching this because com.potatocandie.
cleanprayertime.features.today.PrayerTimesViewModel received
ViewModel#onCleared() callback)
Retaining 732 bytes in 30 objects
key = 1a238931-2884-4446-a5c2-66d46390134f
watchDurationMillis = 7824
retainedDurationMillis = 2822
I am trying to make a groupby + sum on a Julia Dataframe with Int and String values
For instance, df :
│ Row │ A │ B │ C │ D │
│ │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1 │ x1 │ a │ 12 │ green │
│ 2 │ x2 │ a │ 7 │ blue │
│ 3 │ x1 │ b │ 5 │ red │
│ 4 │ x2 │ a │ 4 │ blue │
│ 5 │ x1 │ b │ 9 │ yellow │
To do this in Python, the command could be :
df_group = df.groupby(['A', 'B']).sum().reset_index()
I will obtain the following output result with the initial column labels :
A B C
0 x1 a 12
1 x1 b 14
2 x2 a 11
I would like to do the same thing in Julia. I tried this way, unsuccessfully :
df_group = aggregate(df, ["A", "B"], sum)
MethodError: no method matching +(::String, ::String)
Have you any idea of a way to do this in Julia ?
Try (actually instead of non-string columns, probably you want columns that are numeric):
numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)
and if you want to allow missing values (and skip them when doing a summation) then:
numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)
Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like
using DataFrames
df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"],
:B => ["a", "a", "b", "a", "b"],
:C => [12, 7, 5, 4, 9],
:D => ["green", "blue", "red", "blue", "yellow"])
gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)
with the result
julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A │ B │ C_sum │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │
You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl
using Underscores
#_ groupby(df, [:A, :B]) |> combine(__, :C => sum)
You can give name to the new column with the following syntax
julia> #_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A │ B │ C │
│ │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1 │ x1 │ a │ 12 │
│ 2 │ x2 │ a │ 11 │
│ 3 │ x1 │ b │ 14 │