What is the correct way to count number of nodes using Cypher? - cypher

For both of these queries, I get the same result.
Query 1:
MATCH (e:Episode)
RETURN COUNT(e);
Query 2:
MATCH (e:Episode)
WITH COUNT(e) AS count
RETURN count;
What would be the correct way to count the number of nodes?

There's no functional difference for such a simple query. Go with the first option, it is shorter and expresses very clearly what you want.
If you run both your queries with EXPLAIN or PROFILE, you will see that the executions plans are identical.

Related

Filter on Count Aggregation

I have been looking for the solution on the internet from quite a while and I'm still not sure that if it is possible on Kibana or not.
Suppose I apply filter on term and it gives me count of the respective terms but I want the results to show only those terms where count equals a specific value.
Being more specific,
I want to find out the number of tills which are the most busy (most number of transactions). Currently when I apply a filter on term and count it shows me the all the tills with their respective transaction count. What I want is that to show only those tills where the count is equal to let's say 10.
In other words a similar functionality like HAVING clause in relational dbms.
I have found a lot of work arounds of the same usecase but I'm looking for a solution.
I hope I understand what you're asking. I think You can search the field in question with proper parameters. For example, for the field 'field_name' with more than 10 hits, try following Lucene query:
field_name:(*) AND count:[10 TO *]
For the exact result of field_name with count=10, query:
field_name:(*) AND count:[10]
Let me know if this was what you were looking for!

Passing reduced ES query results to SQL

This is a follow-up question to How to pass ElasticSearch query to hadoop.
Basically, I want to do a full-text-search in ElasticSearch and then pass the result set to SQL to run an aggregation query. Here's an example:
Let's say we search "Terminator" in a financials database that has 10B records. It has the following matches:
"Terminator" (1M results)
"Terminator 2" (10M results)
"XJ4-227" (1 result ==> Here "Terminator" is in the synopsis of the title)
Instead of passing back the 10+M ids, we'd pass back the following 'reduced query' --
...WHERE name in ('Terminator', 'Terminator 2', 'XJ4-227')
How could we write such an algorithm to reduce the ES result set to a smallest possible filter query that we could send back to SQL? Does ES have any sort of match-metadata that would help us in this?
If you know that which "not analyzed" (keyword at 5.x) field would be suitable for your use case you could get their distinct values and number of matches by terms aggregation. sum_other_doc_count even tells you if your search resulted in too many distinct values, as only top N are returned.
Naturally you could run terms aggregation on multiple fields and use the one in SQL which had fewest distinct values. And actually it could be more efficient to first run cardinality aggregation to know to which field you should run terms aggregation.
If your search is a pure filter then its result should be cached but please benchmark both solutions as your ES cluster has quite a lot of data.

When does Google BigQuery's TOP function return approximate results?

I have a table and I want to return the most frequent value of a certain column. Usually, one would do that using the classical GROUP BY ... ORDER BY ... LIMIT. I stumbled upon the BigQuery's TOP function and I got interested in it, since the documentation states that it is generally faster. However, the documentation also says that it "may only return approximate results". When does this happen and is the usage of TOP function generally worth it when one needs accurate results?
Full description from the documentation:
TOP is a function that is an
alternative to the GROUP BY clause. It is used as simplified syntax
for GROUP BY ... ORDER BY ... LIMIT .... Generally, the TOP function
performs faster than the full ... GROUP BY ... ORDER BY ... LIMIT ...
query, but may only return approximate results.
below might more fit for comment - but too lengthy, so I put it into answer
So far, from my experience it is just good as to have simplified alternative to GROUP BY - that is, btw, applicable only in simple scenarios: A query that uses the TOP() function can return only two fields: the TOP field, and the COUNT(*) value.
That said - I don't see discrepancy in counts, while I do see it runs faster.
So, check below comparison that I run against table with 2.5B rows. As you can see - counts are exactly the same and run-time is 15% faster
At the same time if you will run similar queries and check Query Plan Explanation - you will see totally different execution pattern that might lead to different result but i was not able to catch such use case

Filtered and excluded results in same query

I am working on a use case, such that I need to filter a query set , perform a function on the results and then perform the same function on the rest of the query set.
# Example (since everybody is having quizzes these days)
query_set = Quiz.objects.filter()
todays_quizzes = query_set.filter(created__startswith=today)
not_todays_quizzes = query_set.exclude(created__starswith=today)
perform_function(todays_quizzes)
perform_function(not_todays_quizzes)
What I would like to know is, is there a better way to get not_todays_quizzes rather than having to perform almost the same query again but with just the opposite condition. i.e can I get both results in one query. Is it even possible at sql level?
Thanks for reading!

Which ActiveRecord query is faster?

These three queries return the same result. How can i find out which one is faster?
IssueStatus.where 'issue_statuses.name != ?', IssueStatus::CLOSED
IssueStatus.where({name: ["Open", "Ice Box", "Submitted for merge", "In Progress"]})
IssueStatus.where.not(name: "Closed")
There is no single answer: it depends on whether you have indexes on the field, and the number of records. You can append .explain at the end of the query to get the result of the Query Plan for the query.
puts IssueStatus.where.not(name: "Closed").explain
That will help you to understand, at database level, which one is the faster. From a database POV, the first and the third query are actually the same.
The third chains one more methods call, therefore it involves some additional object allocation at Ruby level (without mentioning that "Closed" causes the creation of a new string, whereas using IssueStatus::CLOSED does not).
At first glance, I would probably suggest to use the first version. But as I said, the query plan will give you more details about the query execution.