Can I see what exactly will a Cypher query do in Memgraph when it is executed? - cypher

I'm learning Cypher query language. I have the following query:
MATCH path = (:Disease {name: 'influenza'})-[:PRESENTS_DpS]->(:Symptom)<-[:PRESENTS_DpS]-(:Disease {name: 'asthma'})
RETURN path
I want a better understanding of what does this query do, e.g. in which order are commands executed, is there an order like in mathematics (priority of operators, etc.).
I use Memgraph Lab for querying.

If you mean if there is a way for Memgraph Lab to explain to you in "human spoken language" what is going on, the answer is no.
What you can do is to use EXPLAIN Cypher clause. Simply prefix your code with it so that it looks like this:
EXPLAIN MATCH path = (:Disease {name: 'influenza'})-[:PRESENTS_DpS]->(:Symptom)<-[:PRESENTS_DpS]-(:Disease {name: 'asthma'})
RETURN path
Before a Cypher query is executed, it is converted into an internal form suitable for execution, known as a plan. A plan is a tree-like data structure describing a pipeline of operations which will be performed on the database in order to yield the results for a given query. Every node within a plan is known as a logical operator and describes a particular operation.
Because a plan represents a pipeline, the logical operators are iteratively executed as data passes from one logical operator to the other. Every logical operator pulls data from the logical operator(s) preceding it, processes it and passes it onto the logical operator next in the pipeline for further processing.
Using the EXPLAIN operator, it is possible for the user to inspect the produced plan and gain insight into the execution of a query.
The output of the EXPLAIN query is a representation of the produced plan. Every logical operator within the plan starts with an asterisk character (*) and is followed by its name (and sometimes additional information). The execution of the query proceeds iteratively (generating one entry of the result set at a time), with data flowing from the bottom-most logical operator(s) (the start of the pipeline) to the top-most logical operator(s) (the end of the pipeline).
For more details take a look at the Inspecting queries documentation.

Related

In which order does SQL Server apply filters - top down or bottom up (first to last or last to first)?

When I write code I like to make sure I'm optimizing performance. I would assume that this includes ordering the filters to have the heavy reducers (filter out lots of rows) at the top and the lighter reducers (filter out a few rows) at the bottom.
But when I have errors in my filters I have noticed that SQL Server first catches the errors in the filters at the bottom and then catches the errors in the filters at the top. Does this mean that SQL Server processes filters from the bottom up?
For example (for clarity I'm the filter - with intentional typos - in the WHERE clause rather than the JOIN clause):
select
l.Loan_Number
,l.Owner_First_Name
,l.Owner_Last_Name
,l.Street
,l.City
,l.State
,p.Balance
,p.Delinquency_Bucket
,p.Next_Due_Date
from
Location l
join Payments p on l.Account_Number = p.Account_Number
where
l.OOOOOwner_Last_Name = 'Kostoryz' -- I assume this would reduce the most, so I put it first
and p.DDDDelinquency = '90+' -- I assume this would reduce second most, so I put it second
and l.SSSState <> 'WY' -- I assume this would reduce the least, so I put it last
Yet the first error SQL Server would return would be ERROR - THERE IS NO COLUMN SSSState IN Location TABLE
The next error it would return would be ERROR - THERE IS NO COLUMN DDDDelinquency IN Payments TABLE
Does this mean that the State filter would be applied before the Delinquency filter and the Delinquency filter would be applied before the Last_Name filter?
There are roughly three stages that happen, when a query is received in text form by the DBMS until you get its result.
The text is usually transformed into some internal format, the DBMS can easier work with.
From the internal format the DBMS tries to compute an optimal way of actual execution, you can think of it as a little program that is developed there.
That program is actually executed and the result is written somewhere (in the memory) you can fetch it from.
(These stages possibly can be divided in even smaller substages, but that level of detail isn't needed here, I guess.)
Now with that in mind, note that for one the errors you mention are emitted in stage 1, when the DBMS tries to bind actual objects in the DB and cannot find them. The query is far from execution at that point and the order that binding is done has got nothing to do with the order the filters are actually applied later. Additionally thereafter is stage 2. In order to find an optimal way of execution, the DBMS can and will reorder things (not necessarily only filters). So it usually doesn't matter how you ordered the filters or how the order of binding went. The DBMS will look at them and decide which one is better to be applied earlier and which one may wait until later.
Keep in mind, that SQL is a descriptive language. Rather than telling the machine what to do -- what we'd typically do when writing programs in imperative languages -- we describe what result we want and let the machine figure out how to calculate it and how to do this in the best possible way or at least a good way.
(Of course, that optimization may not always work a 100%. Sometimes there are some tricks in queries, that help the DBMS to find a better solution. But with a query of the kind you posted, any DBMS should cope pretty well in finding a good order to apply the filters no matter how you ordered them.)
Before SQL Server attempts to run the query, it creates a Query Execution Plan (QEP). The errors you are seeing are happening while the QEP is being built. You cannot infer any information about the sequence of "filters" based on the order you get these errors.
Once you have provided a valid query, SQL Server will build a QEP and that will govern the operations it uses to satisfy the query. The QEP will be based on many factors including what indexes and statistics are available on the table - though not usually the order that you specify conditions in the WHERE clause. There are ways to do this, but it is usually not recommended.
In Short, NO. The order of the filters don't matter.
At a high level, the query goes through multiple stages before execution. The stages are:
Parsing & Normalization (where the syntax is checked and tables are validated)
Compilation & Optimization (Where the code is compiled and optimized for execution)
In the Optimization stage, the table statistics, index statistics are checked to arrive at the optimal execution plan for executing the query. So, the filers are checked based on the statistics and are applied in order based on the statistics. So, the order of filters in the query DON'T MATTER. The column statistics DO MATTER.
Read more on Stages of query execution

Does Oracle chose a default execution plan when parsing a prepared statement?

According to this Oracle documentation, I can assume that the Optimizer postpones the hard parse and it doesn't generate an execution plan until the first time a prepared statement is executed:
"The answer is a phenomenon called bind peeking. Earlier, when you ran that query with the bind variable value set to 'NY', the optimizer had to do a hard parse for the first time and while doing so it peeked at the bind variable to see what value had been assigned to it."
But when executing an EXPLAIN PLAN for a prepared statement with bind parameters, we get an executed plan. On his site, Markus Winand says that:
"When using bind parameters, the optimizer has no concrete values available to determine their frequency. It then just assumes an equal distribution and always gets the same row count estimates and cost values. In the end, it will always select the same execution plan."
Which one is true? Does an execution plan get generated when the statement is prepared using an evenly distribution value model, or is the hard parsing postponed until the first execution time.
This discussion misses a very important point about bind variables, parsing and bind peeking; and this is Histograms! Bind variables only becomes an issue when the column in question have histograms. Without histograms there is no need to peek at the value. Oracle have no information then about the distribution of the data, and will only use pure math (distinct values, number of null values, number of rows etc) to find the selectivity of the filter in question.
Binds and histograms are logical opposites. You use bind variables to get one execution plan for all your queries. You use histograms to get different execution plans for different search values. Bind peeking tried to overcome this issue. But it does not do a very good job at it. Many people have actually characterized the bind peeking feature as "a bug". Adaptive Cursor Sharing that comes around in Oracle 11g does a better job of solving this.
Actually I see to many histograms around. I usually disable histograms (method opt=>'for all columns size 1', and only create them when I truly need them.
And then to the original question: "Does Oracle choose a default execution plan when parsing a prepared statement?"
Parsing is not one activity. Parsing involves syntax checking, semantic analysis (does the tables and columns exist, do you have access to the tables), query rewrite (Oracle might rewrite the query in a better way - for instance - if we use the filters a=b and b=c, then Oracle can add the filter a=c), and of course finding an execution plan. We actually differ between different types of parsing - soft parse and hard parse. Hard parsing is where Oracle also have to create the execution plan for the query. This is a very costly activity.
Back to the question. The parsing doesn't really care if you are using bind variables or not. The difference is that if you use bind, you probably only have to do a soft parse. Using bind variables your query will look the same every time you run it (therefor getting the same hash_value). When you run a query Oracle will check (in the library cache) to see if there all ready is an execution plan for your query. This is not a default plan, but a plan that allready exist because someone else has executed the same query (and made Oracle do a hard parse generating an execution plan for this query) and the execution plan hasn't aged out of the cache yet. This is not a default plan. It's just the plan the optimizer at parse time considered the best choice for your query.
When you come to Oracle 12c it actually gets even more complicated. In 12 Oracle have Adaptive Execution plans - this means that the execution plan has an alternative. It can start out with a nested loop, and if it realize that it got the cardinality estimates wrong it can switch to a hash join in the middle of the execution of the query. It also have something called adaptive statistics and sql plan directives. All to make the optimizer and Oracle to make better choises when running your SQLs :-)
The first bind peek actually happens at the first execution. The plan optimization is deferred it doesn't happen at the prepare phase. And later on another bind peek might happen. Typically for VARCHAR2 when you bind two radically different values (i. e. in length of first value 1 byte and later 10 bytes) the optimizer peeks again and it might produce a new plan. In Oracle 12 it's extended even more, it has adaptive join methods. So optimizer suggest NESTED LOOPs but when it's actually being executed after many more rows than estimated comes it switches to HASH join immediately. It's not like adaptive cursor sharing where you need to make a mistake first to produce new execution plan.
Also one very important thing to prepared statements. Since these just re-executes the same cursor as is created with the first execution. They will always execute the same plan, there cannot be any adaptation. For adaptation and alternative execution plans at least SOFT parse must occur. So if the plan is aged out from shared pool or invalidated for any reason.
Explain plan is not cursor it will never respect bind variables. It's only display cursor where you can see bind variable information.
You can find actual information about captured bind values in V$SQL_BIND_CAPTURE.
According to Tom Kyte bind peeking takes place at the hard-parse stage, which chimes with the first quote in your post. In 11g the optimizer is even able to come up with different plans for different bind ranges, which directly contradicts the second quote (although to be fair he is talking about bind variables and not peeking specifically).
The query in the application uses bind values that drive it to one plan or the other consistently. It is only when the plan flip-flops between two radically different execution paths, and for some segment of users, that you have a really bad plan. In such cases, Oracle Database 11g might be the right answer for you, because it accommodates multiple plans.
In general, Oracle behavior starting from 11g is best described by adaptive cursor sharing (see http://docs.oracle.com/database/121/TGSQL/tgsql_cursor.htm#BGBJGDJE)
For JDBC (Thin Driver) specifically: When using PreparedStatements, no plan is generated before the execution step.
See the following example:
String metrics[] = new String[OracleConnection.END_TO_END_STATE_INDEX_MAX];
metrics[OracleConnection.END_TO_END_MODULE_INDEX] = "adaptiveCSTest";
((OracleConnection) conn).setEndToEndMetrics(metrics, (short) 0);
String getObjectNames = "select object_name from key.objects where object_type=?";
PreparedStatement objectNamesStmt = conn.prepareStatement(getObjectNames);
// module set, but statement not parsed
objectNamesStmt.setString(1, "CLUSTER");
// same state
ResultSet rset1 = objectNamesStmt.executeQuery();
// statement parsed and executed

neo4j queries response time

Testing queries responses time returns interesting results:
When executing the same query several times in a row, at first the response times get better until a certain point, then in each execute it gets a little slower or jumps inconsistently.
Running the same query while using the USING INDEX and in other times not using the USING INDEX, returns almost the same responses times range (as described in clause 1), although the profile is getting better (less db hits while using the USING INDEX).
Dropping the index and re-running the query returns the same profile as executing the query while the index exists but the query has been executed without the USING INDEX.
Is there an explanation to the above results?
What will be the best way to know if the query has been improved if although the db hits are getting better, the response times aren't?
The best way to understand how a query executes is probably to use the PROFILE command, which will actually explain how the database goes about executing the query. This should give you feedback on what cypher does with USING INDEX hints. You can also compare different formulations of the same query to see which result in fewer dbHits.
There probably is no comprehensive answer to why the query takes a variable amount of time in various situations. You haven't provided your model, your data, or your queries. It's dependent on a whole host of factors outside of just your query, for example your data model, your cache settings, whether or not the JVM decides to garbage collect at certain points, how full your heap is, what kind of indexes you have (whether or not you use USING INDEX hints) -- and those are only the factors at the neo4j/java level. At the OS level there are many other possibilities/contingencies that make precise performance measurement difficult.
In general when I'm concerned about these things I find it's good to gather a large data sample (run the query 10,0000 times) and then take an average. All of the factors that are outside of your control tend to average out in a sample like that, but if you're looking for a concrete prediction of exactly how long this next query is going to take, down to the milliseconds, that may not be realistically possible.

Optimize query to fetch tuples from index directly

I want to optimize a large SQL query that has around 500 SQL lines and is a little slow, it takes 1 to 5 seconds to execute in an interactive system.
I saw this munin graph
That is not the same as this graph
What I understand from the first graph (showing scans) is that the indexes are being used in where or order by sentences, only to search a tuple that matches some rules (boolean expression).
The second graph I'm not really sure what it means by "tuple access"
Question1: What is the meaning of "tuple access"?
So I'm thinking that I can make an optimization step forward if I could rewrite some parts of this big query to fetch more tuples using the indexes and less sequentially, using the information in the second graph.
Question2: Am I correct? Would it be better that the second graph show more index fetched and less sequentially read?
Question3: In case this is correct, could you provide a SQL example in which the tuples are index-fetched opposed to one in which they are sequentially read?
Note: In the questions, I'm only referring to the second graph
In general, trying to optimize graphs like this is a mistake unless you have a specific performance problem. It is not in fact always better to retrieve tuples from the indexes. These things are very complex decisions which depend on specifics of table, table access, the sort of material you are retrieving and more.
The fact is that a query plan that works for one quantity of data may not work as well for another.
Particularly if you have a lot of small tables, sequential scans will, for example, always beat index scans.
So what you want to do is to start by finding the slow queries, running them under EXPLAIN ANALYZE and looking for opportunities to add appropriate indexes. You can't do this without looking at the query plan and the actual query, which is why you always want to look at that.
In other words, your graph just gives you a sense of access patterns. It does not give you enough information to do any sort of real performance optimizations.

Indexing does not work with LIKE and NOT operator in Sql Server, Is it a MYTH

I have seen in many articles on Sql Server which states that when we are writing a query we should avoid using NOT and LIKE operator in our query because indexing is not applied on it.
eg.
SELECT [CityId]
,[CityName]
,[StateId]
FROM [LIMS].[dbo].[City]
WHERE CityId NOT IN(1, 2)
I executed above query and found that indexing was getting used to filter the records.
Following is the execution plan, which clearly shows Clustured Index seek. This clearly violates what I used to think and read.
Was my previous understanding incorrect ?
That indexing does not work with LIKE and NOT operators is just a rule of thumb. SQL Server (or any competent RDBMS, for that matter) will use the best algorithm it can in most cases. So, if you could manually seek an index, so could SQL Server.
In the particular example you provided, it is unclear whether a seek is any more efficient than a scan because most of the records are going to be returned anyhow. So, I wouldn't read much into that particular execution plan.
Bottom line: Learn and understand how your database system internally organizes data and indices so you won't have to rely on rules of thumb.