Dirty Reads in SQL Server AlwaysOn - sql

I have a pair of SQL Server 2014 databases set up as a synchronous AlwaysOn availability group.
Both servers are set to the Synchronous commit availability mode, with a session timeout of 50 seconds. The secondary is set to be a Read-intent only readable secondary.
If I write to the primary and then immediately read from the secondary (via ApplicationIntent=ReadOnly), I consistently read dirty data (i.e. the state before the write). If I wait for around a second between writing and reading, I get the correct data.
Is this expected behaviour? If so, is there something I can do to ensure that reads from the secondary are up-to-date?
I'd like to use the secondary as a read-only version of the primary (as well as a fail-over), to reduce the load on the primary.

There is no way you can get dirty reads unless you are using no-lock hint..
When you enable Read Only secondaries in AlwaysOn..Internally SQL uses rowversioning to store Previous version of the row..
further you are using Synchronous commit mode,this ensures log records are committed first on secondary,then on primary..
what you are seeing is Data latency..
This whitePaper deals with this scenario..Below is relevant part which helps in understanding more about it..
The reporting workload running on the secondary replica will incur some data latency, typically a few seconds to minutes depending upon the primary workload and the network latency.
The data latency exists even if you have configured the secondary replica to synchronous mode. While it is true that a synchronous replica helps guarantee no data loss in ideal conditions (that is, RPO = 0) by hardening the transaction log records of a committed transaction before sending an ACK to the primary, it does not guarantee that the REDO thread on secondary replica has indeed applied the associated log records to database pages.
So there is some data latency. You may wonder if this data latency is more likely when you have configured the secondary replica in asynchronous mode. This is a more difficult question to answer. If the network between the primary replica and the secondary replica is not able to keep up with the transaction log traffic (that is, if there is not enough bandwidth), the asynchronous replica can fall further behind, leading to higher data latency.
In the case of synchronous replica, the insufficient network bandwidth does not cause higher data latency on the secondary but it can slow down the transaction response time and throughput for the primary workload

Related

IBM MQ Multi-Instance Queues

My company uses IBM MQ's Multi-Instance Queues right now. We would like to replicate those queues to a different Data Center over the WAN for Disaster Recover purposes. I'm skeptical it will work simply due to all the message traffic and even a slight delay will cause the Queues to fail.
What is the technical reason why this will not work?
Are you talking about storage replication? If so are you planning to use synchronous or asynchronous replication?
Asynch will not cause any delay on the replicating end but there will be some amount of delay before the receiving end receives data depending on network distance. Your storage team should be able to tell you how many seconds the async replication delay could be.
With synch the data is sent over the network by the replicating end storage array and a confirmation comes back over the network before the the storage array returns to the OS that the write was successful. To be usable the two arrays have to be with in 6ms of each other. This type of replication adds a delay to each write equal to the network ms.
MQ application can batch messages into single units of work to improve performance with sync replication is in place, but this will slow down persistent message performance.
Define "Slight delay" in your statement?
Async replication will cause a delay and RPO will not be zero. Your storage team can advise on RPO value. If that is not acceptable, asynch replication is not an option for you.
Although it's pragmatic choice from cost and distance standpoint but could cause duplicate or missing transactions.
For synch replication, the distance in data-centers is limited. (Apart from hit on performance on Primary DC). Check with your storage team on the distance limit.

Zookeeper vs In-memory-data-grid vs Redis

I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apacheā„¢ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.

Aspstate SQL Server database mirroring high IO

We are currently having issues with aspstate database mirroring as we have around 10,000 active users online 9-5 day to day and the aspstate db is so heavy on writing and passing this to the mirror that the mirror's drive is very high on IO and keeps causing both servers to be inaccessible due to the latency of writing the data on the mirror. We're using SQL Server 2012 standard so not in asynchronous mode.
We're running the SQL Server on Amazon EC2 instances with EBS backed volumes and 1000IOPS, in your views should this be enough? As we seem to have very smooth times where we've had over 15,000 users online and then other times where only 10,000 users online and we have issues with disk queue lengths on the mirror (backup server not the principle server.)
The principle can be writing to the aspstate.mdf files at 10-20mbps constant when the disk queue length goes up.
We're going to increase the IOPS to 2000 in the mean time as currently we've had to disable mirroring, however would you expect this and has anyone handled this sort of volume before?
Regards
Liam
The bottleneck with a high transaction workload like ASPState is not the data file but the transaction log. In the case of synchronous mirroring, additional latency is introduced for both the network and synchronous commit at the mirror. This latency will not be tolerable if you have a large number of APSState requests. Keep in mind that unless specified otherwise with session state enabled, each ASP.NET page request will require 2 updates to a session state row. So if you have 10,000 active users clicking once every 15 seconds, that requires about 1,300 I/Os per second for transaction log writes alone on each database.
If you must have HA for session state, I suggest failover clustering to eliminate network latency. You might also consider tuning session state by specifying the read only or none directive for pages that don't need session state. Consider using an in-memory session state solution instead of the out-of-the box ASPSession state database if you need to support a large number of users. Also remember that session state data is temporary so you can forgo durability.

Cassandra Commit and Recovery on a Single Node

I am a newbie to Cassandra - I have been searching for information related to commits and crash recovery in Cassandra on a single node. And, hoping someone can clarify the details.
I am testing Cassandra - so, set it up on a single node. I am using stresstool on datastax to insert millions of rows. What happens if there is an electrical failure or system shutdown? Will all the data that was in Cassandra's memory get written to disk upon Cassandra restart (I guess commitlog acts as intermediary)? How long is this process?
Thanks!
Cassandra's commit log gives Cassandra durable writes. When you write to Cassandra, the write is appended to the commit log before the write is acknowledged to the client. This means every write that the client receives a successful response for is guaranteed to be written to the commit log. The write is also made to the current memtable, which will eventually be written to disk as an SSTable when large enough. This could be a long time after the write is made.
However, the commit log is not immediately synced to disk for performance reasons. The default is periodic mode (set by the commitlog_sync param in cassandra.yaml) with a period of 10 seconds (set by commitlog_sync_period_in_ms in cassandra.yaml). This means the commit log is synced to disk every 10 seconds. With this behaviour you could lose up to 10 seconds of writes if the server loses power. If you had multiple nodes in your cluster and used a replication factor of greater than one you would need to lose power to multiple nodes within 10 seconds to lose any data.
If this risk window isn't acceptable, you can use batch mode for the commit log. This mode won't acknowledge writes to the client until the commit log has been synced to disk. The time window is set by commitlog_sync_batch_window_in_ms, default is 50 ms. This will significantly increase your write latency and probably decrease the throughput as well so only use this if the cost of losing a few acknowledged writes is high. It is especially important to store your commit log on a separate drive when using this mode.
In the event that your server loses power, on startup Cassandra replays the commit log to rebuild its memtable. This process will take seconds (possibly minutes) on very write heavy servers.
If you want to ensure that the data in the memtables is written to disk you can run 'nodetool flush' (this operates per node). This will create a new SSTable and delete the commit logs referring to data in the memtables flushed.
You are asking something like
What happen if there is a network failure at the time data is being loaded in Oracle using SQL*Loader ?
Or what happens Sqoop stops processing due to some condition while transferring data?
Simply whatever data is being transferred before electrical failure or system shutdown, it will remain the same.
Coming to second question, when ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space.

Is it possible to get sub-1-second latency with transactional replication?

Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".
No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.
I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.
Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.