Get list of distinct keys and count of nodes that have them - cypher

Using a cypher query, I can get a list of keys for a type of node:
MATCH (n:Category) RETURN keys(n);
The result:
╒═══════════════════════════════════════════════════════════════╕
│"keys(n)" │
╞═══════════════════════════════════════════════════════════════╡
│["CreatedDate","Category","_lastModified","_x","_y","_created"]│
├───────────────────────────────────────────────────────────────┤
│["CreatedDate","_lastModified","_x","_y","_created","Category"]│
└───────────────────────────────────────────────────────────────┘
In the above example, there are only two nodes, and they have the same keys. Sometimes there's a lot of nodes, and they don't all have the same keys.
How can I return an aggregate of the keys and the count of nodes that have that key?
In this case the result would be:
2 CreatedDate
2 Category
2 _lastModified
2 _x
2 _y
2 _created

Given some basic sample data:
MERGE (c1: Category { name: 'A', OnlyA: 'ValueOnlyA', Both: 'ValueBoth' })
MERGE (c2: Category { name: 'B', OnlyB: 'ValueOnlyB', Both: 'AnotherValueBoth' } )
You can get the set of property keys seen on any of the nodes in the set, and the number of times they were seen:
MATCH (c: Category)
UNWIND keys(c) as k
RETURN k, count(k)
╒═══════╤══════════╕
│"k" │"count(k)"│
╞═══════╪══════════╡
│"Both" │2 │
├───────┼──────────┤
│"name" │2 │
├───────┼──────────┤
│"OnlyA"│1 │
├───────┼──────────┤
│"OnlyB"│1 │
└───────┴──────────┘

Related

Pandas - Count row with a specific value when grouped

I have a certain use case and I cannot do it well in pandas.
order_id asset_id
1 A
1 B
1 C
2 A
2 C
3 A
4 B
4 C
I would like to know in how many orders I have the asset A alone? In that case: 1 time (In order 3)
I would like to know in how many orders I have the asset A with others? In that case: 2 times (In order 1 and 2)
It could be great to have some help with that. I can't figure how to do it.
If need count unique values membership per groups order_id first aggregate sets and then compare values by set A:
s = df.groupby('order_id')['asset_id'].agg(set)
print (s)
order_id
1 {A, B, C}
2 {A, C}
3 {A}
4 {B, C}
Name: asset_id, dtype: object
alone = (s == {'A'}).sum()
print (alone)
1
with_others = (s > {'A'}).sum()
print (with_others)
2
Use groupby.agg with set operations:
(df.groupby('order_id')['asset_id']
.agg(alone=lambda x: set(x)=={'A'},
others=lambda x: set(x)>{'A'}
)
.sum()
)
Output:
alone 1
others 2
dtype: int64

SQL Query to return which columns have different values given two rows

I have one table like this:
id status time days ...
1 optimal 60 21
2 optimal 50 21
3 no solution 60 30
4 optimal 21 31
5 no solution 34 12
.
.
.
There are many more rows and columns.
I need to make a query that will return which columns have different information, given two IDs.
Rephrasing it, I'll provide two IDs, for example 1 and 5 and I need to know if these two rows have any columns with different values. In this case, the result should be something like:
id status time days
1 optimal 60 21
5 no solution 34 12
If I provide IDs 1 and 2, for example, the result should be:
id time
1 60
2 50
The output format doesn't need to be like this, it only needs to show clearly which columns are different and their values
I can tell you off the bat that processing this data in some sort of programming language will greatly help you out in terms of simplicity and readability for this type of solution, but here a thread of how it can be done in SQL.
Compare two rows and identify columns whose values are different
If you are looking for the solution in R. Here is my solution:
df <- read.csv(file = "sf.csv", header = TRUE)
diff.eval <- function(first.id, second.id, eval.df) {
res <- eval.df[c(first.id, second.id), ]
cols <- colnames(eval.df)
for (col in cols) {
if (res[1, col] == res[2, col]) {
res[, col] <- NULL
}
}
return(res)
}
print(diff.eval(1, 5, df))
print(diff.eval(1, 2, df))
You just need to create a dataframe out of table. I just created a .csv for ease locally and used the data by importing into a dataframe.

How to get same rank for same scores in Redis' ZRANK?

If I have 5 members with scores as follows
a - 1
b - 2
c - 3
d - 3
e - 5
ZRANK of c returns 2, ZRANK of d returns 3
Is there a way to get same rank for same scores?
Example: ZRANK c = 2, d = 2, e = 3
If yes, then how to implement that in spring-data-redis?
Any real solution needs to fit the requirements, which are kind of missing in the original question. My 1st answer had assumed a small dataset, but this approach does not scale as dense ranking is done (e.g. via Lua) in O(N) at least.
So, assuming that there are a lot of users with scores, the direction that for_stack suggested is better, in which multiple data structures are combined. I believe this is the gist of his last remark.
To store users' scores you can use a Hash. While conceptually you can use a single key to store a Hash of all users scores, in practice you'd want to hash the Hash so it will scale. To keep this example simple, I'll ignore Hash scaling.
This is how you'd add (update) a user's score in Lua:
local hscores_key = KEYS[1]
local user = ARGV[1]
local increment = ARGV[2]
local new_score = redis.call('HINCRBY', hscores_key, user, increment)
Next, we want to track the current count of users per discrete score value so we keep another hash for that:
local old_score = new_score - increment
local hcounts_key = KEYS[2]
local old_count = redis.call('HINCRBY', hcounts_key, old_score, -1)
local new_count = redis.call('HINCRBY', hcounts_key, new_score, 1)
Now, the last thing we need to maintain is the per score rank, with a sorted set. Every new score is added as a member in the zset, and scores that have no more users are removed:
local zdranks_key = KEYS[3]
if new_count == 1 then
redis.call('ZADD', zdranks_key, new_score, new_score)
end
if old_count == 0 then
redis.call('ZREM', zdranks_key, old_score)
end
This 3-piece-script's complexity is O(logN) due to the use of the Sorted Set, but note that N is the number of discrete score values, not the users in the system. Getting a user's dense ranking is done via another, shorter and simpler script:
local hscores_key = KEYS[1]
local zdranks_key = KEYS[2]
local user = ARGV[1]
local score = redis.call('HGET', hscores_key, user)
return redis.call('ZRANK', zdranks_key, score)
You can achieve the goal with two Sorted Set: one for member to score mapping, and one for score to rank mapping.
Add
Add items to member to score mapping: ZADD mem_2_score 1 a 2 b 3 c 3 d 5 e
Add the scores to score to rank mapping: ZADD score_2_rank 1 1 2 2 3 3 5 5
Search
Get score first: ZSCORE mem_2_score c, this should return the score, i.e. 3.
Get the rank for the score: ZRANK score_2_rank 3, this should return the dense ranking, i.e. 2.
In order to run it atomically, wrap the Add, and Search operations into 2 Lua scripts.
Then there's this Pull Request - https://github.com/antirez/redis/pull/2011 - which is dead, but appears to make dense rankings on the fly. The original issue/feature request (https://github.com/antirez/redis/issues/943) got some interest so perhaps it is worth reviving it /cc #antirez :)
The rank is unique in a sorted set, and elements with the same score are ordered (ranked) lexically.
There is no Redis command that does this "dense ranking"
You could, however, use a Lua script that fetches a range from a sorted set, and reduces it to your requested form. This could work on small data sets, but you'd have to devise something more complex for to scale.
unsigned long zslGetRank(zskiplist *zsl, double score, sds ele) {
zskiplistNode *x;
unsigned long rank = 0;
int i;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
while (x->level[i].forward &&
(x->level[i].forward->score < score ||
(x->level[i].forward->score == score &&
sdscmp(x->level[i].forward->ele,ele) <= 0))) {
rank += x->level[i].span;
x = x->level[i].forward;
}
/* x might be equal to zsl->header, so test if obj is non-NULL */
if (x->ele && x->score == score && sdscmp(x->ele,ele) == 0) {
return rank;
}
}
return 0;
}
https://github.com/redis/redis/blob/b375f5919ea7458ecf453cbe58f05a6085a954f0/src/t_zset.c#L475
This is the piece of code redis uses to compute the rank in sorted sets. Right now ,it just gives rank based on the position in the Skiplist (which is sorted based on scores).
What does the skiplistnode variable "span" mean in redis.h? (what is span ?)

How to filter after group by and aggregate in Spark dataframe?

I have a spark dataframe df with schema as such:
[id:string, label:string, tags:string]
id | label | tag
---|-------|-----
1 | h | null
1 | w | x
1 | v | null
1 | v | x
2 | h | x
3 | h | x
3 | w | x
3 | v | null
3 | v | null
4 | h | null
4 | w | x
5 | w | x
(h,w,v are labels. x can be any non-empty values)
For each id, there is at most one label "h" or "w", but there might be multiple "v". I would like to select all the ids that satisfies following conditions:
Each id has:
1. one label "h" and its tag = null,
2. one label "w" and its tag != null,
3. at least one label "v" for each id.
I am thinking that I need to create three columns checking each above conditions. And then I need to do a group by "id".
val hCheck = (label: String, tag: String) => {if (label=="h" && tag==null) 1 else 0}
val udfHCheck = udf(hCheck)
val wCheck = (label: String, tag: String) => {if (label=="w" && tag!=null) 1 else 0}
val udfWCheck = udf(wCheck)
val vCheck = (label: String) => {if (label==null) 1 else 0}
val udfVCheck = udf(vCheck)
dfx = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select("id","hCheck","wCheck","vCheck")
.groupBy("id")
Somehow I need to group three columns {"hCheck","wCheck","vCheck"} into vector of list [x,0,0],[0,x,0],[0,0,x]. And check if these vector contain all three {[1,0,0],[0,1,0],[0,0,1]}
I have not been able to solve this problem yet. And there might be a better approach than this one. Hope someone can give me suggestions. Thanks
To convert the three checks to vectors you can do:
Specifically you can do:
val df1 = df.withColumn("hCheck", udfHCheck(col("label"), col("tag")))
.withColumn("wCheck", udfWCheck(col("label"), col("tag")))
.withColumn("vCheck", udfVCheck(col("label")))
.select($"id",array($"hCheck",$"wCheck",$"vCheck").as("vec"))
Next the groupby returns a grouped object on which you need to perform aggregations. Specifically to get all the vectors you should do something like:
.groupBy("id").agg(collect_list($"vec"))
Also you do not need udfs for the various checks. You can do it with column semantics. For example udfHCheck can be written as:
with($"label" == lit("h") && tag.isnull 1).otherwise(0)
BTW, you said you wanted a label 'v' for each but in vcheck you just check if the label is null.
Update: Alternative solution
Upon looking on this question again, I would do something like this:
val grouped = df.groupBy("id", "label").agg(count("$label").as("cnt"), first($"tag").as("tag"))
val filtered1 = grouped.filter($"label" === "v" || $"cnt" === 1)
val filtered2 = filtered.filter($"label" === "v" || ($"label" === "h" && $"tag".isNull) || ($"label" === "w" && $"tag".isNotNull))
val ids = filtered2.groupBy("id").count.filter($"count" === 3)
The idea is that first we groupby BOTH id and label so we have information on the combination. The information we collect is how many values (cnt) and the first element (doesn't matter which).
Now we do two filtering steps:
1. we need exactly one h and one w and any number of v so the first filter gets us these cases.
2. we make sure all the rules are met for each of the cases.
Now we have only combinations of id and label which match the rules so in order for the id to be legal we need to have exactly three instances of label. This leads to the second groupby which simply counts the number of labels which matched the rules. We need exactly three to be legal (i.e. matched all the rules).

What is the structure of a node for this B-Tree specification?

I am trying to create a B-tree with the following properties:
Every node x contains following attributes:
x.n is the number of keys present in node x
x.key1,x.key2,.....x.keyx.n are the keys present in the node
x.c1,x.c2,.........x.cx.n,x.cx.n+1 are the pointers to the child nodes
x.leaf is a boolean variable that shows whether the node is a leaf node or not
Based on this specification, how would I implement the structure for a node:
struct Node{
...?
}
The notional structure when drawn is something like this.
a b c d
/ | | | \
la bab bbc bcd gd
la = less than a
bab = between a and b
bbc = between b and c
bcd = between c and d
gd = greater than d
Where there are more pointers than elements.
So a b-tree of order N has at most N children. So using BTREE_ORDER as this value, and ensuring BTREE_ORDER is greater than 1.
The structure is most efficiently done as
struct Node{
size_t numNodes;
KEY_TYPE Key[BTREE_ORDER -1];
struct Node * Children[BTREE_ORDER];
}
So it has space for BTREE_ORDER-1 keys and BTREE_ORDER child nodes. The arangement is up to the code, and is
Children[0] Key[0] Children[1] Key[1] .... Key[numNodes - 2] Children[ numNodes - 1]