Our organisation has move on from Cassandra to ScyllaDB recently and since there's so little info about ScyllaDB, and as the title suggests, how often should we repair ScyllaDB nodes to maintain equal count of rows in each node as Cassandra's repair frequency is recommended as 5 Days?
Scylla Manager automates the repair process and allows you to configure how and when repair occurs. When you create a cluster a repair task is automatically scheduled. This task is set to occur each week by default, but you can change it to another time, change its parameters or add additional repair tasks if needed.
Source: https://manager.docs.scylladb.com/stable/repair/index.html
[Edi: Pointing to latest docs]
Related
We recently upgraded Particular.ServiceControl.Audit to v4.26 and recreated our Audit instance, since from v4.26 and up new audit instances will bump up RavenDB from 3.5 to 5.4.
We did this hoping to remedy a problem in 4.25.x where compacting the audit database would not free up any disk space.
As it is, the database is still growing really large (3-400GB). To test if the data contained should actually use up all this disk space, we tried to tamper with the ServiceControl.Audit/AuditRetentionPeriod config parameter, setting it to a small value like 1 day (before value was 30 days). Naively, maybe, waiting for db to shrink at some point - expecting that retention would somehow influence disk space used. The db file remained the same size (and growing).
The docs for compacting the database in ServiceControl only mention the esent based Raven 3.5, but Raven5+ uses Raven's own Voron storage engine. There does not seem to be Particular docs describing how to compact the Voron db.
So we tried to follow the RavenDB docs for compacting the db through Raven Studio after putting the service in maintenance mode. As the screenshot shows, this operation stopped or stalled very shortly after initializing (disregard time elapsed, as we actually left it running for several hours).
We tried this several times with no luck. We can see that the disk containing the specific db has zero reads or writes, so it is definitely stalled in some way. Pressing the "Abort" button also got stuck every time, after which we resolved to simply restarting the entire service (which then seemed to bring the db back to normal operation).
So the question is: how do we keep the audit db from growing indefinitely? We can't at any point see it not growing or staying the same size, causing SSD disk costs to grow in expenses.
I want to restart my Scylla db cluster. But I don't want to lose any data.
Do I lose any data if I restart one after other node?
No, you will not loose data if you are doing a rolling restart.
Scylla keeps the data replicated across multiple nodes (usually 3 or more)
Depending on your Replication Factor (RF) and Consistency Level (CL) you might see read or write operations failed during the restart. See interactive calc here https://docs.scylladb.com/getting-started/consistency/#consistency-level-calculator
If "restarting a node" just involves restarting Scylla or rebooting the kernel on which it runs, then you're safe: Scylla is a distributed database, and is designed to support durability and availability even when nodes temporarily disappear from the network. When a node is temporarily down, all its data is still available for reads (from two other replicas), and also writes continue to work normally and will be eventually replicated to the down node when it finally comes up (using the "hinted handoff" and/or "repair" mechanisms).
However, if by "restarting a node" you mean something more destructive - replacing it with a brand-new node with empty storage, as in some cloud setups where nodes have transient storage. In that case you have to be more careful: If the node's data is lost, we still have two more replicas and the database continues to be available, but you should tell the cluster to "stream" the data which the node lost back to the node - before continuing to do this destructive restart to additional nodes. If you have RF=3 and destroy three nodes at the same time, you will surely lose data.
I have a SQL Server (distributor and publisher) 2008 which is replicating using both snapshot and transactional replication to replicate to a couple of subscribers. There is plenty of information here https://learn.microsoft.com/en-us/sql/relational-databases/replication/disable-publishing-and-distribution on how to permanently disable replication.
I don't want to permanently disable replication, just temporarily for a network outage that is scheduled for later this week.
I have learned my lesson that when things go amuck it's a complete disable, remove, and re-setup to get everything working again, and there are too many publications to make this an option just to temporarily disable this.
It depends on whether there's going to be a network split between publisher and distributor or distributor and subscriber. Both of the below scenarios deal with transactional replication.
publisher and distributor - the log reader agent will not be able to mark records as delivered to the distribution database and so will stay in the transaction log of the publisher longer than normal. This may cause log growth (depending on how much free space is in your log file currently).
distributor and subscriber - assuming that the network outage is shorter than the minimum retention period for the distribution database, you should be able to just suspend the distribution jobs and everything should pick back up once the network is back online. Depending on the size of the backlog, it may be easier to re-initialize some (or all!) of your articles.
For snapshot replication, you shouldn't need to do much since the only time there's activity is when a snapshot is being created and delivered to the subscriber. You can just disable those jobs for the duration of your event and re-enable them when you're done.
This is the error I get from the Log while trying to process a SQL Server 2012 MOLAP Cube.
"Time-out occurred while waiting for buffer latch type 3 for page (1:2044928) database ID 2.; 42000." Source="Microsoft SQL Server 2012 Analysis Services" HelpFile="Error ErrorCode="3240034318" Description="Errors in the OLAP storage engine: An error occurred while processing the 'Measurement' partition of the measure group for the 'PE cube' cube from the Cube database."
I have scripted the processing task in XMLA and execute the processing via a SSAS Command in an Agent Job.
The first step is to Process Update all dimensions and this succeeds, but when I want to Process Data of the cube the load fails and this error pops up.
I first tried processing with an SSIS package, but this caused the whole server to crash instead of just the job failing. This leads me to believe this a performance issue, but the machine running the job is an Azure VM with 16 processors and 112 GB RAM so I don't know where to look. I also tried running the job without any other activities on the server, but that did not help.
The disk containing the SSAS Instance still has 500GB Free.
The measure group is querying a table containing 180 million records.
While processing the cube on a Dev server with way less data there are no issues. I once succeeded to Process Full the whole cube while processing the SSAS cube directly within SSAS, but via DTEXEC, SSISDB or using SSDT the processing results in a server crash.
Earlier I got different time-out errors, but after adjusting the SSAS ExternalCommandTimeOut, ExternalConnectionTimeOut and ForceCommitTimeout properties to 0 this did not occur anymore.
I have tried multiple processing settings, but because I think it is a performance issue I tried to make the processing as low as possible on performance.
Processing Settings:
Object: Cube; Option: Process Data;
Processing Order: Sequential with Seperate Transactions.
Writeback Table Option: Use Existing;
Do not process affected objects.
Update:
I have processed the measure which triggered the error on its own, this did not finish and in the Activity Monitor I saw a lot of Wait_Type IO_Completion and CXPacket. And when querying the sys.dm_exe_requests I see a Select with wait_type IO_Completion which is already running for a long time and a lot of reads.
Last night I tried to process all measurements excluding the measuregroup which triggered the error earlier, but unfortunately the whole server crashed again...
Update2:
We have looked into upgrading to premium storage, but this means we have to switch from A11 to a DS or GS serie. Meaning we need to resize the whole VM which contains live solutions resulting in down-time and effort to restore the VHDS and replacing the current OS disk which contains parts of live solutions.
Another option we identified is applying partitioning or improving the underlying queries from the measures. Unfortunately way more effort than anticipated, a quick work-around for now would help a lot in selling a long-term solution improvement.
Update3:
We have had contact with Microsoft and they advice to migrate from an A11 VM to a D14 V2 and upgrade to premium storage disks. This will be our next step and will be executed upcoming friday. After the migration I will update or close this post.
If you miss information, please let me know. Any suggestions that would help me pin-point the situation would be much appreciated!
The upgrade to a VM better suitable for the situation (DS14 V2) and upgrade to P30 premium storage disks have resolved the occuring issues. The issue was not in the way the cube was being processed or configured, but in the hardware used.
I am adding a monitoring script to check the size of my DB files so I can deliver a weekly report which shows each files size and how much it grew over the last week. In order to get the growth, I was simply going to log a record into a table each week with each DB's size, then compare to the previous week's results. The only trick is where to keep that table. What are the trade-offs in using the master DB instead of just creating a new DB to hold these logs? (I'm assuming there will be other monitors we will add in the future)
The main reason is that master is not calibrated for additional load: it is not installed on IO system with proper capacity planning, is hard to move around to new IO location, it's maintenance plan takes backups and log backups are as frequent as needed for a very low volume of activity, its initial size and growth rate are planned as if no changes are expected. Another reason against it is that many troubleshooting scenarios you would want a copy of the database to inspect, but you'd have to attach a new master to your instance. These are the main reasons why adding objects to master is discouraged. Also many admins understandably prefer an application to use it's own database so it can be properly accounted for, and ultimately easily uninstalled.
Similar problems exist for msdb, but if push comes to shove it would be better to store app data in msdb rather than master since the former is an ordinary database (despite widespread believe that is system, is actually not).
The Master DB is a system database that belongs to SQL Server. It should not be used for any other purposes. Create your own DB to hold your logs.
I would refrain from putting anything in master, it could be overwritten/recreated on an upgrade.
I have put a DBA only ServerInfo database on each server for uses like this, as well as any application specific environmental things (things that differ between prod and test and dev).
You should add a separat database for the logging. It is not garanteed that the master database is not breaking the next patch of sql server if you leave your objects in there.
And microsoft itself does advise you to not do it.
http://msdn.microsoft.com/en-us/library/ms187837.aspx