OrientDB group by query using graph - sql

I need to perform an grouped aggregate on a property of vertices of a certain class, the group by field is however a vertex two steps away from my current node and I can't make it work.
My case:
The vertex A contains the property I want to aggregate on and have n number of vertices with label references. The vertex I want to group on is any of those vertices (B, C or D) if that vertex has a defined by edge to vertex F.
A ----references--> B --defined by--> E
\---references--> C --defined by--> F
\--references--> D --defined by--> G
...
The query I thought would work is:
select sum(property), groupOn from (
select property, out('references')[out('definedBy').#rid = F] as groupOn from AClass
) group by groupOn
But it doesn't work, the inner statement gives me a strange response which isn't correct (returns no vertices) and I suspect that out() isn't supported for bracket conditions (the reason for the .#rid is that the docs I found stated that only "=" are supported.
out('references')[out('definedBy') contains F] doesn't work either, that returns the out('definedBy') for the $current vertex).
Anyone with an idea how to achieve this? In my example, the result I would like is the property in one column and the #rid of the C vertex in another. Then I can happily perform my group by aggregates.

Solved it! In OrientDB 2.1 (I'm trying rc4) there's a new clause called UNWIND (see SELECT in docs).
Using UNWIND I can do:
SELECT sum(property), groupOn from (
SELECT property, out('references') as groupOn
FROM AClass
UNWIND groupOn
) WHERE groupOn.out('definedBy')=F
GROUP BY groupOn
It could potentially be a slow function depending on the number of vertices of AClass and its references, I'll report back if I find any performance issues.

Related

Can I rewrite this Cypher query to be compatible with Redis Graph?

My use-case is that I have some agents in organisation structure. I want select for some agent (can by me) to see sum (amount of money) of all contracts that that agents subordinates (and subordinates of their subordinates and so on...) created with clients grouped by contract category.
Problem is that Redis Graph do not currently support all predicate. But I need to filter relations between agents because we have multiple "modules" with different organisation structures and I need report just from one module at the time.
My current Cypher query is:
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WHERE all(rel IN relationships(path) WHERE
rel.module_id = 1
AND rel.valid_from < '2020-05-29'
AND '2020-05-29' < rel.valid_to)
WITH b as mediators
MATCH (mediators)-[:mediated]->(c:contract)
RETURN
c.category as category,
count(c) as contract_count,
sum(c.sum) as sum
ORDER BY sum DESC, category
This query works in Neo4j.
I don't event know if this query is correctly written for the type of result that I want.
My boss would really like to use Redis Graph instead Neo4j because of performance reasons but I can't find any way to rewrite this query to be functional in the Redis graph. Is it even possible?
Edit 1: I was told that we will be using graph just for currently valid data and just for one module so I no longer need functional all predicate but I am still interested in answer.
The ALL function isn't supported at the moment, we do intend to add it in the near future, an awkward way of achieving the same effect as the ALL function would be a combination of UNWIND and count
MATCH path = (:agent {id: 482})<-[:supervised *]-(b:agent)
WITH b AS b, relationships(path) AS edges, size(relationships(path)) AS edge_count
UNWIND edges AS r
WITH b AS b, edge_count AS edge_count, r AS r
WHERE r.module_id = 1 AND r.valid_from < '2020-05-29' AND '2020-05-29' < r.valid_to
WITH b AS b, edge_count AS edge_count, count(r) AS filter_edge_count
WHERE edge_count = filter_edge_count
....

How to find all rows matching one of multiple conditions?

I have a table, called V (as seen in the screenshot below). How would I find all the rows with a given value in either the IN or OUT columns?
For example, finding all rows with "#10:0" in IN or OUT below.
My best attempt is
SELECT FROM V WHERE ???(OUT OR IN) = '#10:0'
but I don't know what should be in place of the ???.
in your case you have a defined edge #rid so it would be better to start your query from E and not V.
By this way you'll have to use the inV(), outV() and bothV() functions.
Examples:
1) Getting the IN vertex (#12:1)
select expand(inV()) from #10:0
2) Getting the OUT vertex (#12:0)
select expand(outV()) from #10:0
2) Getting both vertices connected to #10:0 (#12:0 and #12:1)
select expand(bothV()) from #10:0
Hope it helps
based on your query i guess:
SELECT FROM V WHERE OUT = '#10:0' OR IN = '#10:0'
but the screenshot really hurt my eyes

Sum two counts in a new column without repeating the code

I have one maybe stupid question.
Look at the query :
select count(a) as A, count(b) as b, count(a)+count(b) as C
From X
How can I sum up the two columns without repeating the code:
Something like:
select count(a) as A, count(b) as b, A+B as C
From X
For the sake of completeness, using a CTE:
WITH V AS (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
)
SELECT A, B, A + B as C
FROM V
This can easily be handled by making the engine perform only two aggregate functions and a scalar computation. Try this.
SELECT A, B, A + B as C
FROM (
SELECT COUNT(a) as A, COUNT(b) as B
FROM X
) T
You may get the two individual counts of a same table and then get the summation of those counts, like bellow
SELECT
(SELECT COUNT(a) FROM X )+
(SELECT COUNT(b) FROM X )
AS C
Let's agree on one point: SQL is not an Object-Oriented language. In fact, when we think of computer languages, we are thinking of procedural languages (you use the language to describe step by step how you want the data to be manipulated). SQL is declarative (you describe the desired result and the system works out how to get it).
When you program in a procedural languages your main concerns are: 1) is this the best algorithm to arrive at the correct result? and 2) do these steps correctly implement the algorithm?
When you program in a declarative language your main concern is: is this the best description of the desired result?
In SQL, most of your effort will be going into correctly forming the filtering criteria (the where clause) and the join criteria (any on clauses). Once that is done correctly, you're pretty much just down to aggregating and formating (if applicable).
The first query you show is perfectly formed. You want the number of all the non-null values in A, the number of all the non-null values in B, and the total of both of those amounts. In some systems, you can even use the second form you show, which does nothing more than abstract away the count(x) text. This is convenient in that if you should have to change a count(x) to sum(x), you only have to make a change in one place rather than two, but it doesn't change the description of the data -- and that is important.
Using a CTE or nested query may allow you to mimic the abstraction not available in some systems, but be careful making cosmetic changes -- changes that do not alter the description of the data. If you look at the execution plan of the two queries as you show them, the CTE and the subquery, in most systems they will probably all be identical. In other words, you've painted your car a different color, but it's still the same car.
But since it now takes you two distinct steps in 4 or 5 lines to explain what it originally took only one step in one line to express, it's rather difficult to defend the notion that you have made an improvement. In fact, I'll bet you can come up with a lot more bullet points explaining why it would be better if you had started with the CTE or subquery and should change them to your original query than the other way around.
I'm not saying that what you are doing is wrong. But in the real world, we are generally short of the spare time to spend on strictly cosmetic changes.

Limit dimension values displayed in QlikView Pivot Table Chart

I have a pivot table chart in QlikView that has a dimension and an expression. The dimension is a column with 5 possible values: 'a','b','c','d','e'.
Is there a way to restrict the values to 'a','b' and 'c' only?
I would prefer to enforce this from the chart properties with a condition, instead of choosing the values from a listbox if possible.
Thank you very much, I_saw_drones! There is an problem I have though. I have different expressions defined depending on the category, like this:
IF( ([Category]) = 'A' , COUNT( {<[field1] = {'x','y'} >} [field2]), IF ([Category]) = 'B' , SUM( {<[field3] = {'z'} >} [field4]), IF (Category='C', ..., 0)))
In this case, where would I add $<Category={'A','B','C'} ? My expression so far doesn't help because although I tell QV to use a different formula/calculation for each category, the category overall (all 5 values) represents the dimension.
One possible method to do this is to use QlikView's Set Analysis to create an expression which sums only your desired values.
For this example, I have a very simple load script:
LOAD * INLINE [
Category, Value
A, 1
B, 2
C, 3
D, 4
E, 5
];
I then have the following Pivot Table Chart set up with a single expression which just sums the values:
What we need to do is to modify the expression, so that it only sums A, B and C from the Category field.
If I then use QlikView's Set Analysis to modify the expression to the following:
=sum({$<Category={A,B,C}>} Value)
I then achieve my desired result:
This then restricts my Pivot Table Chart to displaying only these three values for Category without me having to make a selection in a Listbox. The form of this expression also allows other dimensions to be filtered at the same time (i.e. the selections "add up"), so I could say, filter on a Country dimension, and my restriction for Category would still be applied.
How this works
Let's pick apart the expression:
=sum({$<Category={A,B,C}>} Value)
Here you can recognise the original form we had before (sum(Value)), but with a modification. The part {$<Category={A,B,C}>} is the Set Analysis part and has this format: {set_identifier<set_modifier>}. Coming back to our original expression:
{: Set Analysis expressions always start with a {.
$: Set Identifier: This symbol represents the current selections in the QlikView document. This means that any subsequent restrictions are applied on top of the existing selections. 1 can also be used, this represents the full set of data in your document irrespective of selections.
<: Start of the set modifiers.
Category={A,B,C}: The dimension that we wish to place a restriction on. The values required are contained within the curly braces and in this case they are ORed together.
>: End of the set modifiers.
}: End of the set analysis expression.
Set Analysis can be quite complex and I've only scratched the surface here, I would definitely recommend checking the QlikView topic "Set Analysis" in both the installed helpfile and the reference manual (PDF).
Finally, Set Analysis in QlikView is quite powerful, however it should be used sparingly as it can lead to some performance problems. In this case, as this is a fairly simple expression the performance should be reasonable.
Woa! a year later, but what you are loking for is osmething near this:
Go to the dimension sheet, then select the Category Dimension, and click on the Edit Dimesnion button
there you can use something like this:
= If(Match(Category, 'a', 'b', 'c'), Category, Null())
This will make the object display only a b and c Categories, and a line for the Null value.
What leasts is that you check the "Suppress value when null" option on the Dimension sheet.
c ya around
Just thought another solution to this which may still be useful to people looking for this.
How about creating a bookmark with the categories that you want and then setting the expressions to be evaluated in the context of that bookmark only?
(Will expand on this later, but take a look at how set analysis can be affected by a bookmark)

What is the difference between `::` and `.` in pig?

What is the difference between :: and . in pig?
When do I use one vs the other?
E.g., I know that :: is need in join when a field exists in both aliases:
A = foreach (join B by (x), C by (y)) generate B::y as b_y, C::y as c_y;
and I need . when accessing group fields:
A = foreach (group B by (x,y)) generate group.x as x, group.y as y, SUM(B?z) as z;
However, do I pass B::z or B.z to SUM above instead of B?z?
In Pig, :: is used as a disambiguation tool after operations which could possibly create naming collisions. Notably, this happens with JOIN, CROSS, and FLATTEN. Consider two relations, A:{(id:int, name:chararray)} and B:{(id:int, location:chararray)}. If you want to associate names with locations, naturally you would do:
C = JOIN A BY id, B BY id;
Without the disambiguation operator, your schema would be
C:{(id:int, name:chararray, id:int, location:chararray)}
Now you can't tell which field id refers to. To avoid this, Pig will instead do
C:{(A::id:int, A::name:chararray, B::id:int, B::location:chararray)}
Likewise, you could FLATTEN two bags whose tuples have fields with the same name, and they would also collide. So the same operator is used in this case as well. When there is no such conflict, you do not need to use the full name: name is unambiguous here. To simplify C, then, you can do this:
D = FOREACH C GENERATE A::id, name, location;
The . operator, by contrast, projects fields from bags and tuples. If you have a bag b with schema {(x:int, y:int, z:int)}, the projection b.y yields a bag with just the specified field: {(y:int)}. You can project multiple fields at once with parentheses: b.(y,z) yields {(y:int, z:int)}.
When used with tuples, the result is a tuple with just the specified fields. If the tuple t has schema (x:int, y:int, z:int), then t.x is the tuple (x:int) and t.(y,z) is the tuple (y:int, z:int).
To your specific question about SUM, note that SUM along with the other summary statistic UDFs, takes a bag as its argument. Therefore, you need to create a bag with just the one field per tuple that you want to sum. Using the projection operator, .: B.z.
IIRC you get :: as a side effect after some statements. You cannot bother about it, unless (as you mentioned) a name exists inside two different prefixes.
The . is different in that you are going inside the structure.
group.x as x, group.y as y is equivalent to FLATTEN(group)
SUM(B?z) - here you should do SUM(B.z), to specify that you need a particular field to SUM.