Infinispan cache statistics: cache_size not working for clustered caches? - infinispan

We noticed that while for local caches we can get the actual cache_size (number of entries in the cache) statistics, for clustered caches (we tried replicated-cache and invalidated-cache) the value is always -1. Is that a known limitation or is there any way how to get these statistics to work?
Note that statistics are of course enabled for all these caches.
We can accept that the data wouldn't be accurate, but at least getting approximate values (that aren't way off) would be still useful.

Related

Sql Azure stopped using index

Weirdest few hours using SQL Azure. We dropped the database from 50DTU to 20DTU and our CPU went through the roof. Turns out that one of our main indexes was simply no longer being used.
The index still existed. It had 27% fragmentation which I understand is not super-terrible, and any way shouldn't stop SQL from using it. So, these are the things I tried (in order):
Reorganized index - nothing.
Rebuilt index - nothing.
Dropped and recreated - nothing.
Cleared proc cache (using ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE) - nothing. - - - Dropped and recreated with a different name - worked.
I didn't take a screenshot of the execution plan when it was failing, but it was basically exactly the same as the execution plan below (of the final working index), except it did not include the usage of the index (circled) - ie. it was just doing the clustered index scan.
The final working query was:
CREATE NONCLUSTERED INDEX [IX_SyncDetail_SyncStatusID_EntityTypeID_ApiConnectionID] ON [client_2].[SyncDetail]
(
[SyncStatusID] ASC,
[EntityTypeID] ASC,
[ApiConnectionID] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO
This is exactly the same as the original index (which was being ignored), except that a) the name is different; and b) the order of the columns used to be EntityTypeID, SyncStatusID, ApiConnectionID.
I want to stress again that the index was working fine before our database downgrade.
So - any ideas what happened?
You have a parameterized query. You are getting different plans based on the values being sniffed during compilation. The various actions you are doing are causing (re-)compilations and you are getting the same plan sometimes and a different plan other times. When you are getting a selective value sniffed, it is picking the seek plan. When you get a non-selective value sniffed, it picks the scan plan. So, downgrading the SLO will force a failover and thus flush the plan cache --> new compilation, new opportunity to sniff a value during compilation. When you free the procedure cache --> new compilation. When you drop or add an index, it forces a recompile for plans touching that table on next use.
You can read more about this behavior here:
Blog post explaining parameter sniffing
You can also use the query store to see the history for query plans (it is on-by-default in SQL Azure). Here is a link explaining how you can use the query store. Note that with the query store you can easily force a specific query plan if you want only one plan to run through the SSMS UI.
The parameter sniffing space is one where we have desire to do some more work in the future to make this behavior more transparent/less surprising for customers but have not yet been able to do so. So, sorry this has been frustrating for you and I hope the workaround I've posted will get you unblocked. Rest assured we are very aware of the problem space and hope to do more here to make SQL better for workloads like this in the future.

By how much do SSDs narrow the performance gap between clustered and non clustered indices?

Most SQL relational databases support the concept of a clustered index in a table. A clustered index, usually implemented as a B-tree, represents the actual records in a given table, physically ordered by that index on disk/storage. One advantage of this special clustered index is that after traversing the B-tree in search for a record or set of records, the actual data can be found immediately at the leaf nodes.
This stands in contrast to a non clustered index. A non clustered index exists outside the clustered index, and also orders the underlying data using one or more columns. But, the leaf nodes may not have data for all the columns needed in the query. In this case, the database has to do a disk seek to the original data to get this information.
In most database resources I have seen on Stack Overflow and elsewhere, this additional disk seek is viewed as a substantial performance penalty. My question is how would this analysis change assuming that all database files were stored on a solid state drive (SSD)?
From the Wikipedia page for SSDs, the random access time for SSDs is less than 0.1 ms, while random access times for mechanical hard disks are typically 10-100 times slower.
Do SSDs narrow the gap between clustered and non clustered indices, such that the former become less important for overall performance?
First of all, a clustered index does not guarantee that the rows are physically stored in index order. InnoDB for example can store the clustered index in a non-sequential way. That is, two database pages containing consecutive rows of the table might be stored physically close to each other, or far apart in the tablespace, and in either order. The B-tree data structure for the clustered index has pointers to the leaf pages, but they don't have to be stored in any order.
SSD is helpful for speeding up IO-based operations, particularly involving disk seeks. It's way faster than an spinning magnetic disk. But RAM is still a couple of orders of magnitude faster than the best SSD.
The 2018 numbers:
Disk seek: 3,000,000ns
SSD random read: 16,000ns
Main memory reference: 100ns
RAM still trumps durable storage by a wide margin. If your dataset (or at least the active subset of your dataset) fits in RAM, you won't need to worry about the difference between magnetic disk storage and SSD storage.
Re your comment:
The clustered index helps because when a primary key lookup searches through the B-tree and finds a leaf node, right there are all the other fields of the row associated with that primary key value.
Compare with MyISAM, where a primary key index is separate from the rows of the table. A query searches the B-tree of the primary key index, and at the leaf node finds a pointer to the location in the data file where the corresponding row is stored. So it has to do a second seek into the data file.
This does not necessarily mean that the clustered index in InnoDB is stored consecutively. It might need to skip around a bit to read all the pages of the tablespace. This is why it's so helpful to have the pages in RAM in the buffer pool.
First, the additional disk seek is not really a "killer". This can be a big issue in high transaction environments where microseconds and milliseconds count. However, for longer running queries, it will make little difference.
This is especially true if the database intelligently does "look ahead" disk seeks. Databases are often not waiting for data because another thread is predicting what pages will be needed and working on bringing those back. This is usually done by just taking the "next" pages on a sequential scan.
SSDs are going to speed up pretty much all operations. They do change the optimization parameters. In particular, I think they are comparably fast on throughput (although I don't keep up with the technology specifically). Their big win is in latency -- the time from issuing the request for a disk block and the time when it is retrieved.
In my experience (which is a few years old), the performance using SSD was comparable to an in-memory database for most operations.
Whether this makes cluster indexes redundant is another matter. A key place where they are used is when you want to separate a related small amount of rows (say "undeleted") from a larger amount. By putting them in the same data pages, the clustered index reduces the overall number of rows being read -- it doesn't just make the reads faster.
Just sume suggestions (to broad for simple comment)
taking into account that everything depends on the distribution of the keys in the not clusterd index and in the respective nodes, (which is completely causal and can only be assessed in average terms) remains the fact that any access benefits from the performance of the SSD disk. In this case, the increase in prepositions is not linear but is nonetheless substantial. Therefore, on average, it should not be a factor of 1 to 100, precisely for issues related to the randomness of distribution, but for every circumstance in which this manifests itself. access is 100 times faster .. in this case it is all the more efficient the more causally .. the situation occurs.
There is however a fact at the base .. every action on disk is much more efficient and therefore in general the behavior of a not clusterd index comes to be explicit in an optimal context.
Taking this into account, the gap should be radically reduced and this should take place thanks to the context in which the entire filing system exists and which is the basis of the database; from accessing the logical files that compose it to the physical sectors in which the data are actually preserved

What happens when a SQL query runs out of memory?

I want to set up a Postgres server on AWS, the biggest table will be 10GB - do I have to select 10GB of memory for this instance?
What happens when my query result is larger than 10GB?
Nothing will happen, the entire result set is not loaded into memory. The maximum available memory will be used and re-used as needed while the result is prepared and will spill over to disk as needed.
See PostgreSQL resource documentation for more info.
Specifically, look at work_mem:
work_mem (integer)
Specifies the amount of memory to be used by internal sort operations and hash tables before writing to temporary disk files.
As long as you don't run out of working memory on a single operation or set of parallel operations you are fine.
Edit: The above was an answer to the question What happens when you query a 10GB table without 10GB of memory on the server/instance?
Here is an updated answer to the updated question:
Only server side resources are used to produce the result set
Assuming JDBC drivers are used, by default, the entire result set is sent to your local computer which could cause out of memory errors
This behavior can be changed by altering the fetch size through the use of a cursor.
Reference to this behavior here
Getting results based on a cursor
On the server side, with a simple query like yours it just keeps a "cursor" which points to where it's at, as it's spooling the results to you, and uses very little memory. Now if there were some "sorts" in there or what not, that didn't have indexes it could use, that might use up lots of memory, not sure there. On the client side the postgres JDBC client by default loads the "entire results" into memory before passing them back to you (overcomeable by specifying a fetch count).
With more complex queries (for example give me all 100M rows, but order them by "X" where X is not indexed) I don't know, but probably internally it creates a temp table (so it won't run out of RAM) which, treated as a normal table, uses disk backing. If there's a matching index then it can just traverse that, using a pointer, still uses little RAM.

Page fullness in SQL server: Is higher always better?

So I've got a very frequently-run query on my SQL server instance that's generating a lot of wait time. On examining the Plan, I was pointed in the direction of a clustered index seek that's accountable for 93% of the cost of the whole operation.
Examining the clustered index, I discovered that while it has 0% fragmentation listed, it has a page fullness value of only 23%. From all the research I've done, I can't really find any indication of why you would want page fullness to be low, and am anticipating that I'll want to do an index reorganize operation to set the value to something more like 90%. (This table is very frequently read from and written to, and I am led to believe too high a page fullness value creates slowdown during write operations, hence why I don't set it to something like 99% or 100%.)
My question is this: Is there any reason I shouldn't reorganize the index and set the page fill factor to 90% or so? Any downside to doing that? I need this query, and the SQL text itself has already been optimized through Lync.

When and why do I reindex a MSDE database

I understand that indexes should get updated automatically but when that does not happen we need to reindex.
My question is (1) Why this automatic udate fails or why an index become bad?
(2) How do I prgramatically know which table/index needs re-indexing at a point of time?
Indexes' statistics may be updated automatically. I do not believe that the indexes themselves would be rebuilt automatically when needed (although there may be some administrative feature that allows such a thing to take place).
Indexes associated with tables which receive a lot of changes (new rows, updated rows and deleted rows) may have their indexes that become fragmented, and less efficient. Rebuilding the index then "repacks" the index in a contiguous section of storage space, a bit akin to the way defragmentation of the file system makes file access faster...
Furthermore the Indexes (on several DBMS) have a FILL_FACTOR parameter, which determine how much extra space should be left in each node for growth. For example if you expect a given table to grow by 20% next year, by declaring the fill factor around 80%, the amount of fragmentation of the index should be minimal during the first year (there may be some if these 20% of growth are not evenly distributed,..)
In SQL Server, It is possible to query properties of the index that indicate its level of fragmentation, and hence it possible need for maintenance. This can be done by way of the interactive management console. It is also possible to do this programatically, by way of sys.dm_db_index_physical_stats in MSSQL 2005 and above (maybe even older versions?)