We've been using SQL Server merge replication for a few years to synchronise data between our data centres, but we are now suffering with a big performance issue. This may be because the amount of data we are synchronising has increased a lot this year.
Our publisher is an always-on data centre in the UK. Our subscriber is a mobile data centre that travels around the world and is on for periods of up to a week at a time, approx. 25 times a year. However, it also spends the same amount of time (if not more) switched off whilst on its travels - it is a well travelled data centre!
We have 5 database that we synchronise on these servers. However, one of our databases has high numbers of data changes between periods of subscriber downtime and our issue is that it take days to catch up when the server is powered up - the other databases are fine.
Downloads from publisher to subscriber run at about 1.5 rows a second (which is annoying when we have hundreds of thousands of rows) but strangely uploads from subscriber to publisher run about ten times faster.
Things I have checked / tried:
• all tables have non-clustered primary keys on guid columns that have the rowguid property set
• changing the generation levelling threshold doesn't help
• setting the agent profile to high volume doesn't help
• running a trace at the publisher and subscriber shows the queries are all running very fast (less than 20 m/s generally, but there are gaps of 200 m/s or so between some batches of queries)
• analysis on our WAN link shows we have huge amounts of bandwidth spare
• analysis on our servers show we have huge amounts of Ram and CPU spare
Some of the places the subscriber is at do suffer from high latency but this doesn't seem to have an impact - 300 m/s or 100m/s and we still get the same poor performance.
One things I did wonder about - does the replication confirm to the publisher every time it has successfully processed a row at the subscriber? If we have thousands of rows and there is a latency on the line will this compound the issue if it confirms each item? If this does happen, is there a way to batch up messages between publisher and subscriber?
Any help that you can offer will be gladly received!
Thanks
Mark
We got to the bottom of this in the end.
It was our use of nvarchar(max) columns that was stopping the replication from using batches. What used to take 3 hours now takes 50 seconds just by changing the data type.
Here is the lesson learnt: "nvarchar(max) is a replication killer"
Thanks
Related
this problem has been keeping me up at night. I seem to have intermittent issues that are causing data IO spikes in one of my databases. It's a high write, high read scenario. We are reading in live market data every 5 minutes for every single ticker in the US market so think along the lines of >8K record inserts every 5 minutes.
According to sql query performance insights I've been getting 90%+ Data IO spikes (seems it occurs in the morning as soon as our job starts running and populating our DB) but yet none of the queries running in that time frame seem to be causing the spikes.
My question is: what can be done to get to the root cause of what is causing these spikes? I would like to decrease the data tier of the db to save some money but these spikes cause lock ups on lower tiers for all of my reads so I would need to find out what's causing them first to eliminate them.
In our company we have "office Monday", that means every office/shop/department (circa 2000+ distinct user) should generate their reports, especially shops (SSRS with connection to tabular 1500 compatibility level). We are facing very high resource usage in 3+ hours range (CPU 100% - multiple cores) and the session queue is growing up and never flush. A report that takes 2 minutes off-peak can take more than an hour due to overload. We have on-premise machine. For the rest of the week problem, didn't occured (workload is 10 time lower, usage of CPU in peak is less than 30%).
Unfortunately, from a business point of view, we cannot spread the load over the remaining days of the week. We also have no influence on how many users will run the reports at a given time (load distribution throughout the day).
What we have tried already:
rewrite queries in reports from old MDX to Dax (always checking the performance of single query with Serving Timing in Dax Studio)
rewrite measures to less expenssive
Tuning our model (for example. change to the less consuming datatype, removing unused columns)
We can't migrate this model to Azure.
We can't make any hardware changes on this machine.
Maybe we can change some server properties? Model properties? Connections properties?
Can we manipulate for which reports / queries Tabular should keep the cache if out of resources? For example, for a group of store reports which we know will generate many similar inquiries (e.g. only the store number will change)
Any advices?
If you are reporting for the previous week could you automate ssrs to output the reports on Sunday night?
I am thinking to use Google Big Query to store realtime call records involving around 3 million rows per day inserted and never updated.
I have signed up for a trial account and ran some tests
I have few concerns before i can go ahead with development
When streaming data via PHP it takes around 10-20 minutes sometime to get loaded on my tables and this is a show stopper for us because network support engineers need this data updated realtime to troubleshoot quality issues
Partitions, we can store data in partitions divided for each day but that also involves one partition being 2.5 GB on any given day and that shoots my costs to query data in range of thousands per month. Is there any other way to bring down cost here? We can store data partitioned per hour but there is no such support available.
If not BigQuery what other solutions are out there in market which can deliver similar performance and can solve these problems ?
You have the "Streaming insert" option which enables the records to be searchable in few seconds (it has its price).
See: streaming-data-into-bigquery
Check table-decorators for limiting query scan.
I work for a fleet tracking company and this question is specifically about how I plan to do reports. Let me explain our environment. We have 1x Database, 1x Load Distributing process, and 3x Report Processing servers (let's assume these are equal in every way). When a customer requests a report, all the parameters of that report go in the database. I'm currently working on a load distributing app that will take pending reports from the database and delegate them to the 3 report processing servers that build and email the reports. When a server finishes a report (or an error arises), it notifies the load distributing app. Reports can come in all sizes, from 1 days worth of GPS data for 1 vehicles to 3 months of GPS data for hundreds of vehicles.
I can think of a few ways to do the load balancing but I'm not quite happy with them. I could have each server only do 5 reports at most, but 1 server might get 5 small reports while another gets 5 large reports. I could do a "Round Robin" approach and just hand out the reports sequentially across the servers, but this still doesn't protect against overloading any of the servers.
The best idea I think I have right now is to keep a count of how much GPS data is needed by each report (an easy task to do) and as I assign reports to each server I keep a running total for each server. When a server finishes a report (and notifies the load balancer), subtract that report's amount of GPS data from the running total for that server. This way, I could assign the next report to the server with the smallest amount of GPS data to work with. I could also set a max so that a server cannot get over worked (the problem that is causing us to refactor our whole reports process to begin with). If there are more reports when all servers hit their max, it can just queue them up and attempt them later when the servers finish a few of their reports.
I'm not convinced it's the best approach for finishing reports as quickly as possible. These are just the best I have come up with so far.
How can I optimize my approach to load balancing reports of different sizes across multiple servers?
Assuming that you have only one major table which you select data from, then I would configure one server to do all the big reports first and leave the other two to do smallest to largest. Otherwise big reports might never get done.
For the smaller reports, you want to try, in the absence of anything better, to have them try and run 'similar' reports, meaning those that cluster around similar values in the index mainly used. For example if a server has just completed a report for June 2011, then the next best report to run is same period, not jumping to November 2012. This is dependent on the actual table though, but I am presuming you have lots of date ordered data comprising the bulk of the selection. All you are really trying to do is group reports that are likely to reuse cached indexes/etc as this should give best throughput.
I have a similar scheduling problem, and any queries that are directed to major tables go one server (slow queue) and anything else goes to another ( fast queue), with some exceptions for special cases.
Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".
No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.
I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.
Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.