Understanding indices in Neo4j - why is resampling necessary?

Understanding indices in Neo4j - why is resampling necessary? - indexing

At https://neo4j.com/docs/operations-manual/3.3/performance/statistics-execution-plans/, we see the following table:
Controls the percentage of the index that has to have been updated
before a new sampling run is triggered.
What does "sampling" mean? Why do updates invalidate the index? I know indices from relational databases and in this case there is no need to maintain indices (adding/deleting row corresponds to adding/deleting node in some BTree).
Could someone why resampling of indices in Neo4j is necessary?

The index is always valid.
Periodic sampling generates statistics used by the Cypher execution planner, so that is can generate plans that are more optimal for the current state of the DB.
To quote from the operations manual (a little earlier than the table in your question):
When a Cypher query is issued, it gets compiled to an execution plan
that can run and answer the query. The Cypher query engine uses
available information about the database, such as schema information
about which indexes and constraints exist in the database. Neo4j also
uses statistical information about the database to optimize the
execution plan.

Related

Indexes in SQL vs ORDER BY clause

What is the difference between using Indexes in SQL vs Using the ORDER BY clause?
Since from what I understand , the Indexes arrange the specified column(s) in an ordered manner that helps the query engine in looking through the tables quickly (and hence prevents table scan).
My question - why can't the query engine simply use the ORDER BY for improving performance?
Thanks !

You put the tag as sql-server-2008 but the question has nothing to do with SQL server. This question will apply to all databases.
From wikipedia:
Indexing is a technique some storage engines use for improving
database performance. The many types of indexes share the common
property that they reduce the need to examine every entry when running
a query. In large databases, this can reduce query time/cost by orders
of magnitude. The simplest form of index is a sorted list of values
that can be searched using a binary search with an adjacent reference to the location of the entry, analogous to the index in the back of a book. The same data can have multiple indexes (an employee database could be indexed by last name and hire date).
From a related thread in StackExchange
In the SQL world, order is not an inherent property of a set of data.
Thus, you get no guarantees from your RDBMS that your data will come
back in a certain order -- or even in a consistent order -- unless you
query your data with an ORDER BY clause.
To answer why the indexes are necessary?
Note the bolded text about indexing regarding the reduction in the need to examine every entry. In the absence of an index when an ORDER BY is issued in SQL, every entry need to be examined which increases the number of entries.
ORDER BY is applied only when reading. A single column may be used in indexes in which case there could be several different kinds of ordering in sql query requests. It is not possible to define the indexes unless we understand how the query requests are made.
A lot of times indexes are added once new patterns of querying emerge so as to keep the query performant which mean index creation is driven by how you defined your ORDER BY in SQL.
Query engine which processes your SQL with/without ORDER BY, defines your execution plan and does not understand Storage of data. The Data retrieved from a query engine may be partly from memory if the data was in cache and partly/fully from disk. When reading from disk in the storage engine will uses the indexes to figure the quickly read data.
ORDER BY effects the performance of a query when reading. Index effects the performance of a query when doing all the Create, Read, Update and Delete operations.
A query engine may choose to use an index or totally ignore the index based on the data characteristics.

SQL Server missing index in execution plan

We have created a view which contain 50 joins and some correlated subqueries.
When I am trying to look at the execution plan, it is not recommended missing index.
Could you please let me know why SQL Server is not showing any missing index statements for the running statement?

Here is my understanding. SQL is declarative language.
It means you only need to specify what data you want and from where you want it.
The rest is the task of server to do.
SQL server is using CBO (cost based optimizer) engine to determine which access method should be used. If you use select * from tablename without where it will be not used any index.
Index while increasing performance on certain case, it will be hindrance.
SQL uses the statistics to determine which access method will be used. Index seek, index scan, cluster index seek, cluster index scan etc.
So to answer your question is probably because of :
1. the statistics is not updated
2. you use select without where
3. your database is fit in memory so no index will be used

give the db2 a hint which index to use?

moin-moin,
I have a join over some tables, and I want to give the db2-database a hint, which index i want her to use. I know, this may result in a slow query, but I have a production and a test database, and I want the same behaviour in both databases (even if in one db the amount of data is significantly different or what state the (index-)cache has.
Is this possible (and how)? I did not find anything in the online manual, which could mean, I had the wron searching criteria.
Thanks a million.

This is not something that is commonly done with DB2. However, you can use selectivity. It should still be around in present versions. Adding selectivity clauses to queries will affect the decisions made by the query optimizer.
Also what Gilbert Le Blanc noted above will work. You can UPDATE the syscat.tables colums and fool the DB2 to optimize the queries for non-existent amounts of data in the rows. Also the rest of your DB / DBM CFG should match (ie. the calculated disk and cpu speeds, memory usage related settings etc) because in some situations they might also matter to some degree.

You can influence the optimizer via a Profile:
http://www.ibm.com/developerworks/data/library/techarticle/dm-1202storedprocedure/index.html
It was recently asked here: Is it possible to replace NL join with HS join in sql
However, I haven't heard about the selectivity clause, and I think you should try first this option, before create a profile. But you should do this just after having tried other options. Follow the steps as indicated in the DeveloperWorks tutorial before influence the optimizer:
Experiment with different SQL optimization classes. The default optimization class is controlled by the DFT_QUERYOPT parameter in the database configuration file.
Attempt to resolve any performance problems by ensuring that proper database statistics have been collected. The more detailed the statistics, the better the optimizer can perform. (See RUNSTATS in the DB2 Command Reference).
If the poor access plan is the result of rapidly changing characteristics of the table (i.e. grows very quickly such that statistics get out of date quickly), try marking the table as VOLATILE using the ALTER TABLE command.
Try explaining the query using literal values instead of parameter markers in your predicates. If you are getting different access plans when using parameter markers, it will help you understand the nature of the performance problem better. You may find that using literals in your application will yield a better plan (and therefore better performance) at the cost of SQL compilation overhead.
Try using DB2’s index advisor (db2advis) to see if there are any useful indexes which you may have overlooked.

Are SQL Execution Plans based on Schema or Data or both?

I hope this question is not too obvious...I have already found lots of good information on interpreting execution plans but there is one question I haven't found the answer to.
Is the plan (and more specifically the relative CPU cost) based on the schema only, or also the actual data currently in the database?
I am try to do some analysis of where indexes are needed in my product's database, but am working with my own test system which does not have close to the amount of data a product in the field would have. I am seeing some odd things like the estimated CPU cost actually going slightly UP after adding an index, and am wondering if this is because my data set is so small.
I am using SQL Server 2005 and Management Studio to do the plans

It will be based on both Schema and Data. The Schema tells it what indexes are available, the Data tells it which is better.
The answer can change in small degrees depending on the DBMS you are using (you have not stated), but they all maintain statistics against indexes to know whether an index will help. If an index breaks 1000 rows into 900 distinct values, it is a good index to use. If an index only results in 3 different values for 1000 rows, it is not really selective so it is not very useful.

SQL Server is 100% cost-based optimizer. Other RDBMS optimizers are usually a mix of cost-based and rules-based, but SQL Server, for better or worse, is entirely cost driven. A rules based optimizer would be one that can say, for example, the order of the tables in the FROM clause determines the driving table in a join. There are no such rules in SQL Server. See SQL Statement Processing:
The SQL Server query optimizer is a
cost-based optimizer. Each possible
execution plan has an associated cost
in terms of the amount of computing
resources used. The query optimizer
must analyze the possible plans and
choose the one with the lowest
estimated cost. Some complex SELECT
statements have thousands of possible
execution plans. In these cases, the
query optimizer does not analyze all
possible combinations. Instead, it
uses complex algorithms to find an
execution plan that has a cost
reasonably close to the minimum
possible cost.
The SQL Server query optimizer does
not choose only the execution plan
with the lowest resource cost; it
chooses the plan that returns results
to the user with a reasonable cost in
resources and that returns the results
the fastest. For example, processing a
query in parallel typically uses more
resources than processing it serially,
but completes the query faster. The
SQL Server optimizer will use a
parallel execution plan to return
results if the load on the server will
not be adversely affected.
The query optimizer relies on
distribution statistics when it
estimates the resource costs of
different methods for extracting
information from a table or index.
Distribution statistics are kept for
columns and indexes. They indicate the
selectivity of the values in a
particular index or column. For
example, in a table representing cars,
many cars have the same manufacturer,
but each car has a unique vehicle
identification number (VIN). An index
on the VIN is more selective than an
index on the manufacturer. If the
index statistics are not current, the
query optimizer may not make the best
choice for the current state of the
table. For more information about
keeping index statistics current, see
Using Statistics to Improve Query
Performance.

Both schema and data.
It takes the statistics into account when building a query plan, using them to approximate the number of rows returned by each step in the query (as this can have an effect on the performance of different types of joins, etc).
A good example of this is the fact that it doesn't bother to use indexes on very small tables, as performing a table scan is faster in this situation.

I can't speak for all RDBMS systems, but Postgres specifically uses estimated table sizes as part of its efforts to construct query plans. As an example, if a table has two rows, it may choose a sequential table scan for the portion of the JOIN that uses that table, whereas if it has 10000+ rows, it may choose to use an index or hash scan (if either of those are available.) Incidentally, it used to be possible to trigger poor query plans in Postgres by joining VIEWs instead of actual tables, since there were no estimated sizes for VIEWs.
Part of how Postgres constructs its query plans depend on tunable parameters in its configuration file. More information on how Postgres constructs its query plans can be found on the Postgres website.

For SQL Server, there are many factors that contribute to the final execution plan. On a basic level, Statistics play a very large role but they are based on the data but not always all of the data. Statistics are also not always up to date. When creating or rebuilding an Index, the statistics should be based on a FULL / 100% sample of the data. However, the sample rate for automatic statistics refreshing is much lower than 100% so it is possible to sample a range that is in fact not representative of much of the data. Estimated number of rows for the operation also plays a role which can be based on the number of rows in the table or the statistics on a filtered operation. So out-of-date (or incomplete) Statistics can lead the optimizer to choose a less-than-optimal plan just as a few rows in a table can cause it to ignore indexes entirely (which can be more efficient).
As mentioned in another answer, the more unique (i.e. Selective) the data is the more useful the index will be. But keep in mind that the only guaranteed column to have statistics is the leading (or "left-most" or "first") column of the Index. SQL Server can, and does, collect statistics for other columns, even some not in any Indexes, but only if AutoCreateStatistics DB option is set (and it is by default).
Also, the existence of Foreign Keys can help the optimizer when those fields are in a query.
But one area not considered in the question is that of the Query itself. A query, slightly changed but still returning the same results, can have a radically different Execution Plan. It is also possible to invalidate the use of an Index by using:
LIKE '%' + field
or wrapping the field in a function, such as:
WHERE DATEADD(DAY, -1, field) < GETDATE()
Now, keep in mind that read operations are (ideally) faster with Indexes but DML operations (INSERT, UPDATE, and DELETE) are slower (taking more CPU and Disk I/O) as the Indexes need to be maintained.
Lastly, the "estimated" CPU, etc. values for cost are not always to be relied upon. A better test is to do:
SET STATISTICS IO ON
run query
SET STATISTICS IO OFF
and focus on "logical reads". If you reduce Logical Reads then you should be improving performance.
You will, in the end, need a set of data that comes somewhat close to what you have in Production in order to performance tune with regards to both Indexes and the Queries themselves.

Oracle specifics:
The stated cost is actually an estimated execution time, but it is given in a somewhat arcane unit of measure that has to do with estimated time for block reads. It's important to realize that the calculated cost doesn't say much about the runtime anyway, unless each and every estimate made by the optimizer was 100% perfect (which is never the case).
The optimizer uses the schema for a lot of things when deciding what transformations/heuristics can be applied to the query. Some examples of schema things that matter a lot when evaluating xplans:
Foreign key constraints (can be used for table elimiation)
Partitioning (exclude entire ranges of data)
Unique constraints (index unique vs range scans for example)
Not null constraints (anti-joins are not available with not in() on nullable columns
Data types (type conversions, specialized date arithmetics)
Materialized views (for rewriting a query against an aggregate)
Dimension Hierarchies (to determine functional dependencies)
Check constraints (the constraint is injected if it lowers cost)
Index types (b-tree(?), bitmap, joined, function based)
Column order in index (a = 1 on {a,b} = range scan, {b,a} = skip scan or FFS)
The core of the estimates comes from using the statistics gathered on actual data (or cooked). Statistics are gathered for tables, columns, indexes, partitions and probably something else too.
The following information is gathered:
Nr of rows in table/partition
Average row/col length (important for costing full scans, hash joins, sorts, temp tables)
Number of nulls in col (is_president = 'Y' is pretty much unique)
Distinct values in col (last_name is not very unique)
Min/max value in col (helps unbounded range conditions like date > x)
...to help estimate the nr of expected rows/bytes returned when filtering data. This information is used to determine what access paths and join mechanisms are available and suitable given the actual values from the SQL query compared to the statistics.
On top of all that, there is also the physical row order which affects how "good" or attractive an index become vs a full table scan. For indexes this is called "clustering factor" and is a measure of how much the row order matches the order of the index entries.

What is the Big-O for SQL select?

What is the Big-O for SQL select, for a table with n rows and for which I want to return m result?
And What is the Big-O for an Update, or delete, or Create operation?
I am talking about mysql and sqlite in general.

As you don't control the algorithm selected, there is no way to know directly. However, without indexes a SELECT should be O(n) (a table scan has to inspect every record which means it will scale with the size of the table).
With an index a SELECT is probably O(log(n)) (although it would depend on the algorithm used for indexing and the properties of the data itself if that holds true for any real table). To determine your results for any table or query you have to resort to profiling real world data to be sure.
INSERT without indexes should be very quick (close to O(1)) while UPDATE needs to find the records first and so will be slower (slightly) than the SELECT that gets you there.
INSERT with indexes will probably again be in the ballpark of O(log(n)^2) when the index tree needs to be rebalanced, closer to O(log(n)) otherwise. The same slowdown will occur with an UPDATE if it affects indexed rows, on top of the SELECT costs.
Edit: O(log(n^2)) = O(2log(n)) = O(log(n)) did you mean O(log(n)^2)?
All bets are off once you are talking about JOIN in the mix: you will have to profile and use your databases query estimation tools to get a read on it. Also note that if this query is performance critical you should reprofile from time to time as the algorithms used by your query optimizer will change as the data load changes.
Another thing to keep in mind... big-O doesn't tell you about fixed costs for each transaction. For smaller tables these are probably higher than the actual work costs. As an example: the setup, tear down and communication costs of a cross network query for a single row will surely be more than the lookup of an indexed record in a small table.
Because of this I found that being able to bundle a group of related queries in one batch can have vastly more impact on performance than any optimization I did to the database proper.

I think the real answer can only be determined on a case by case basis (database engine, table design, indices, etc.).
However, if you are a MS SQL Server user, you can familiarize yourself with the Estimated Execution Plan in Query Analyzer (2000) or Management Studio (2005+). That gives you a lot of information you can use for analysis.

All depends on how (well) you write your SQL and how well your database is designed for the operation you are performing. Try to use the explain plan function to see how things will be executed by the db. The. You can calculate the big-O

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas