Normalize node property cypher linear and log - cypher

Just when I thought I had understood the basics of cypher ....
I would like to create 2 new properties of a node and set a normalized value and a log normalized value based on an existing property that contains an integer, (call it count). Setting the properties is easy. I am having problems calculating them.
So I tried (first with a linear normalization)
match (n:MYLABEL) where n.count > 0
with n
set n.count_n = n.count/max(n.count)
When I run
match (n:MYLABEL) where n.count > 0 return max(n.count)
I get the largest nr of count. But if I run
match (n:MYLABEL) where n.count > 0 return n.count, max(n.count)
I get the same values for n.count and max(n.count). So I realized max needs to operate on a list of all counts. So I tried
match (n:MYLABEL) where n.count > 0
with n, collect(n.count) as cl
return n.count, max(cl)
and I STILL get a count and [count] as output (same value). I think I am missing something fundamental here. Can anyone assist with what the cypher would look like for linear and log normalization? Grateful for your help.

Thanks to Andrew Bowman #neo4j
The short answer is
MATCH (n:MYLABEL)
WHERE n.count > 0
WITH max(n.count) as maxCount
MATCH (n:MYLABEL)
WHERE n.count > 0
SET n.count_n = n.count/maxCount
A different approach is
MATCH (n:MYLABEL)
WHERE n.count > 0
WITH max(n.count) as maxCount, collect(n) as nodes
UNWIND nodes as n
SET n.count_n = n.count/maxCount
The reason of my misunderstanding:
the grouping key provides context for what you're aggregating over. n.count was the grouping key. So per that value, you asked for the max of that value, which is itself.
You need to remove that as the grouping key so max () is calculated with respect to all results. So either remove it completely, or collect () the values (it would become an aggregation term so not part of the grouping key), then UNWIND it back to rows afterwards.

Related

Redis Secondary Indexes and Performance Question

I know that Redis doesn't really have the concept of secondary indexes, but that you can use the Z* commands to simulate one. I have a question about the best way to handle the following scenario.
We are using Redis to keep track of orders. But we also want to be able to find those orders by phone number or email ID. So here is our data:
> set 123 7245551212:dlw#email.com
> set 456 7245551212:dlw#email.com
> set 789 7245559999:kdw#email.com
> zadd phone-index 0 7245551212:123:dlw#email.com
> zadd phone-index 0 7245551212:456:dlw#email.com
> zadd phone-index 0 7245559999:789:kdw#email.com
I can see all the orders for a phone number via the following (is there a better way to get the range other than adding a 'Z' to the end?):
> zrangebylex phone-index [7245551212 (7245551212Z
1) "7245551212:123:dlw#dcsg.com"
2) "7245551212:456:dlw#dcsg.com"
My question is, is this going to perform well? Or should we just create a list that is keyed by phone number, and add an order ID to that list instead?
> rpush phone:7245551212 123
> rpush phone:7245551212 456
> rpush phone:7245559999 789
> lrange phone:7245551212 0 -1
1) "123"
2) "456"
Which would be the preferred method, especially related to performance?
RE: is there a better way to get the range other than adding a 'Z' to the end?
Yes, use the next immediate character instead of adding Z:
zrangebylex phone-index [7245551212 (7245551213
But certainly the second approach offers better performance.
Using a sorted set for lexicographical indexing, you need to consider that:
The addition of elements, ZADD, is O(log(N))
The query, ZRANGEBYLEX, is O(log(N)+M) with N being the number of elements in the sorted set and M the number of elements being returned
In contrast, using lists:
The addition, RPUSH, is O(1)
The query, LRANGE, is O(N) as you are starting in zero.
You can also use sets (SADD and SMEMBERS), the difference is lists allows duplicates and preserves order, sets ensure uniqueness and doesn't respect insertion order.
ZSet use skiplist for score and dict for hashset. And if you add all elements with same score, skiplist will be turned to B-TREE like structure, which have a O(logN) time complexity for lexicographical order search.
So if you don't always perform range query for phone number, you should use list for orders which phone number as key for precise query. Also this will work for email(you can use hash to combine these 2 list). In this way performance for query will be much better than ZSET.

This is the query I am trying to run in neo4j but it takes too long to run:

I am trying to run this query using Neo4j but it takes too long (more than 30 min, for almost 2500 nodes and 1.8 million relationships) to run:
Match (a:Art)-[r1]->(b:Art)
with collect({start:a.url,end:b.url,score:r1.ed_sc}) as row1
MATCH (a:Art)-[r1]->(b:Art)-[r2]->(c:Art)
Where a.url<>c.url
with row1 + collect({start:a.url,end:c.url,score:r1.ed_sc*r2.ed_sc}) as row2
Match (a:Art)-[r1]->(b:Art)-[r2]->(c:Art)-[r3]->(d:Art)
WHERE a.url<>c.url and b.url<>d.url and a.url<>d.url
with row2+collect({start:a.url,end:d.url,score:r1.ed_sc*r2.ed_sc*r3.ed_sc}) as allRows
unwind allRows as row
RETURN row.start as start ,row.end as end , sum(row.score) as final_score limit 10;
Here :Art is the label under which there are 2500 nodes, and there are bidirectional relationships between these nodes which has a property called ed_sc. So basically I am trying to find the score between two nodes by traversing one, two and three degree paths, and then sum these scores.
Is there a more optimized way to do this?
For one I'd discourage use of bidirectional relationships. If your graph is densely connected, this kind of modeling will play havoc on most queries like this.
Assuming url is unique for each :Art node, it would be better to compare the nodes themselves rather than their properties.
We should also be able to use variable-length relationships in place of your current approach:
MATCH p = (start:Art)-[*..3]->(end:Art)
WHERE all(node in nodes(p) WHERE single(t in nodes(p) where node = t))
WITH start, end, reduce(score = 1, rel in relationships(p) | score * rel.ed_sc) as score
WITH start, end, sum(score) as final_score
LIMIT 10
RETURN start.url as start, end.url as end, final_score

What is the use case that makes EAVT index preferable to EATV?

From what I understand, EATV (which Datomic does not have) would be great fit for as-of queries. On the other hand, I see no use-case for EAVT.
This is analogous to row/primary key access. From the docs: "The EAVT index provides efficient access to everything about a given entity. Conceptually this is very similar to row access style in a SQL database, except that entities can possess arbitrary attributes rather then being limited to a predefined set of columns."
The immutable time/history side of Datomic is a motivating use case for it, but in general, it's still optimized around typical database operations, e.g. looking up an entity's attributes and their values.
Update:
Datomic stores datoms (in segments) in the index tree. So you navigate to a particular E's segment using the tree and then retrieve the datoms about that E in the segment, which are EAVT datoms. From your comment, I believe you're thinking of this as the navigation of more b-tree like structures at each step, which is incorrect. Once you've navigated to the E, you are accessing a leaf segment of (sorted) datoms.
You are not looking for a single value at a specific point in time. You are looking for a set of values up to a specific point in time T. History is on a per value basis (not attribute basis).
For example, assert X, retract X then assert X again. These are 3 distinct facts over 3 distinct transactions. You need to compute that X was added, then removed and then possibly added again at some point.
You can do this with SQL:
create table Datoms (
E bigint not null,
A bigint not null,
V varbinary(1536) not null,
T bigint not null,
Op bit not null --assert/retract
)
select E, A, V
from Datoms
where E = 1 and T <= 42
group by E, A, V
having 0 < sum(case Op when 1 then +1 else -1 end)
The fifth component Op of the datom tells you whether the value is asserted (1) or retracted (0). By summing over this value (as +1/-1) we arrive at either 1 or 0.
Asserting the same value twice does nothing, and you always retract the old value before you assert a new value. The last part is a prerequisite for the algorithm to work out this nicely.
With an EAVT index, this is a very efficient query and it's quite elegant. You can build a basic Datomic-like system in just 150 lines of SQL like this. It is the same pattern repeated for any permutation of EAVT index that you want.

How to use multiple conditions (With AND) in IIF expressions in ssrs

I want to hide rows in SSRS report having Zero Quantity.
There are following multiple Quantity Columns like Opening Stock, Gross Dispatched,Transfer Out, Qty Sold, Stock Adjustment and Closing Stock etc.
I am doing this task by using following expression:
=IIF(Fields!OpeningStock.Value=0 AND Fields!GrossDispatched.Value=0 AND
Fields!TransferOutToMW.Value=0 AND Fields!TransferOutToDW.Value=0 AND
Fields!TransferOutToOW.Value=0 AND Fields!NetDispatched.Value=0 AND Fields!QtySold.Value=0
AND Fields!StockAdjustment.Value=0 AND Fields!ClosingStock.Value=0,True,False)
But by using this expression in row visibility, report hides all the rows except Totals Row. Even though report should show rows having Quantities of above mentioned columns.
Total values are shown correct.
Note: I set this row visibility expression on Detail Row.
Without using expression result is as following.
For the first 2 rows all the quantities are 0 (ZERO), i want to hide these 2 rows.
How can I fix this problem, or which expression must I use to get required results?
Could you try this out?
=IIF((Fields!OpeningStock.Value=0) AND (Fields!GrossDispatched.Value=0) AND
(Fields!TransferOutToMW.Value=0) AND (Fields!TransferOutToDW.Value=0) AND
(Fields!TransferOutToOW.Value=0) AND (Fields!NetDispatched.Value=0) AND (Fields!QtySold.Value=0)
AND (Fields!StockAdjustment.Value=0) AND (Fields!ClosingStock.Value=0),True,False)
Note: Setting Hidden to False will make the row visible
You don't need an IIF() at all here. The comparisons return true or false anyway.
Also, since this row visibility is on a group row, make sure you use the same aggregate function on the fields as you use in the fields in the row. So if your group row shows sums, then you'd put this in the Hidden property.
=Sum(Fields!OpeningStock.Value) = 0 And
Sum(Fields!GrossDispatched.Value) = 0 And
Sum(Fields!TransferOutToMW.Value) = 0 And
Sum(Fields!TransferOutToDW.Value) = 0 And
Sum(Fields!TransferOutToOW.Value) = 0 And
Sum(Fields!NetDispatched.Value) = 0 And
Sum(Fields!QtySold.Value) = 0 And
Sum(Fields!StockAdjustment.Value) = 0 And
Sum(Fields!ClosingStock.Value) = 0
But with the above version, if one record has value 1 and one has value -1 and all others are zero then sum is also zero and the row could be hidden. If that's not what you want you could write a more complex expression:
=Sum(
IIF(
Fields!OpeningStock.Value=0 AND
Fields!GrossDispatched.Value=0 AND
Fields!TransferOutToMW.Value=0 AND
Fields!TransferOutToDW.Value=0 AND
Fields!TransferOutToOW.Value=0 AND
Fields!NetDispatched.Value=0 AND
Fields!QtySold.Value=0 AND
Fields!StockAdjustment.Value=0 AND
Fields!ClosingStock.Value=0,
0,
1
)
) = 0
This is essentially a fancy way of counting the number of rows in which any field is not zero. If every field is zero for every row in the group then the expression returns true and the row is hidden.
Here is an example that should give you some idea..
=IIF(First(Fields!Gender.Value,"vw_BrgyClearanceNew")="Female" and
(First(Fields!CivilStatus.Value,"vw_BrgyClearanceNew")="Married"),false,true)
I think you have to identify the datasource name or the table name where your data is coming from.

Pig FILTER returns empty bag that I can't COUNT

I'm trying to count how many values in a data set match a filter condition, but I'm running into issues when the filter matches no entries.
There are a lot of columns in my data structure, but there's only three of use for this example: key - data key for the set (not unique), value - float value as recorded, nominal_value - float representing the nominal value.
Our use case right now is to find the number of values that are 10% or more below the nominal value.
I'm doing something like this:
filtered_data = FILTER data BY value <= (0.9 * nominal_value);
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT(filtered_data.value);
DUMP filtered_count;
In most cases, there are no values that fall outside of the nominal range, so filtered_data is empty (or null. Not sure how to tell which.). This results in filtered_count also being empty/null, which is not desirable.
How can I construct a statement that will return a value of 0 when filtered_data is empty/null? I've tried a couple of options that I've found online:
-- Extra parens in COUNT required to avoid syntax error
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE COUNT((filtered_data.value is null ? {} : filtered_data.value));
which results in:
Two inputs of BinCond must have compatible schemas. left hand side: #1259:bag{} right hand side: #1261:bag{#1260:tuple(cf#1038:float)}
And:
filtered_count = FOREACH (GROUP filtered_data BY key) GENERATE (filtered_data.value is null ? 0 : COUNT(filtered_data.value));
which results in an empty/null result.
The way you have it set up right now, you will lose information about any keys for which the count of bad values is 0. Instead, I'd recommend preserving all keys, so that you can see positive confirmation that the count was 0, instead of inferring it by absence. To do that, just use an indicator and then SUM that:
data2 =
FOREACH data
GENERATE
key,
((value <= 0.9*nominal_value) ? 1 : 0) AS bad;
bad_count = FOREACH (GROUP data2 BY key) GENERATE group, SUM(data2.bad);