Load balancing weighted reports? - optimization

I work for a fleet tracking company and this question is specifically about how I plan to do reports. Let me explain our environment. We have 1x Database, 1x Load Distributing process, and 3x Report Processing servers (let's assume these are equal in every way). When a customer requests a report, all the parameters of that report go in the database. I'm currently working on a load distributing app that will take pending reports from the database and delegate them to the 3 report processing servers that build and email the reports. When a server finishes a report (or an error arises), it notifies the load distributing app. Reports can come in all sizes, from 1 days worth of GPS data for 1 vehicles to 3 months of GPS data for hundreds of vehicles.
I can think of a few ways to do the load balancing but I'm not quite happy with them. I could have each server only do 5 reports at most, but 1 server might get 5 small reports while another gets 5 large reports. I could do a "Round Robin" approach and just hand out the reports sequentially across the servers, but this still doesn't protect against overloading any of the servers.
The best idea I think I have right now is to keep a count of how much GPS data is needed by each report (an easy task to do) and as I assign reports to each server I keep a running total for each server. When a server finishes a report (and notifies the load balancer), subtract that report's amount of GPS data from the running total for that server. This way, I could assign the next report to the server with the smallest amount of GPS data to work with. I could also set a max so that a server cannot get over worked (the problem that is causing us to refactor our whole reports process to begin with). If there are more reports when all servers hit their max, it can just queue them up and attempt them later when the servers finish a few of their reports.
I'm not convinced it's the best approach for finishing reports as quickly as possible. These are just the best I have come up with so far.
How can I optimize my approach to load balancing reports of different sizes across multiple servers?

Assuming that you have only one major table which you select data from, then I would configure one server to do all the big reports first and leave the other two to do smallest to largest. Otherwise big reports might never get done.
For the smaller reports, you want to try, in the absence of anything better, to have them try and run 'similar' reports, meaning those that cluster around similar values in the index mainly used. For example if a server has just completed a report for June 2011, then the next best report to run is same period, not jumping to November 2012. This is dependent on the actual table though, but I am presuming you have lots of date ordered data comprising the bulk of the selection. All you are really trying to do is group reports that are likely to reuse cached indexes/etc as this should give best throughput.
I have a similar scheduling problem, and any queries that are directed to major tables go one server (slow queue) and anything else goes to another ( fast queue), with some exceptions for special cases.

Related

Synchronize SQL Server databases

I have a new idea and question about that I would like to ask you.
We have a CRM application on-premise / in house. We use that application kind of 24X7. We also do billing and payroll on the same CRM database which is OLTP and also same thing with SSRS reports.
It looks like whenever we do operation in front end which does inserts and updates to couple of entities at the same time, our application gets frozen until that process finishes. e.g. extracting payroll for 500 employees for their activities during last 2 weeks. Basically it summarize total working hours pulls that numbers from database and writes/updates that record where it says extract has been accomplished. so for 500 employees we are looking at around 40K-50K rows for Insert/Select/Update statements together.
Nobody can do anything while this process runs! We are considering the following options to take care of this issue.
Running this process in off-hours
OR make a copy of DB of Dyna. CRM and do this operations(extracting thousands of records and running multiple reports) on copy.
My questions are:
how to create first of all copy and where to create it (best practices)?
How to make it synchronize in real-time.
if we do select statement operation in copy DB than it's OK, but if we do any insert/update on copy how to reflect that on actual live db? , in short how to make sure both original and copy DB are synchronize to each other in real time.
I know I asked too many questions, but being SQL person, stepping into CRM team and providing suggestion, you know what I am trying to say.
Thanks folks for your any suggestion in advance.
Well to answer your question in regards to the live "copy" of a database a good solution is an alwayson availability group.
https://blogs.technet.microsoft.com/canitpro/2013/08/19/step-by-step-creating-a-sql-server-2012-alwayson-availability-group/
Though I dont think that is what you are going to want in this situation. Alwayson availability groups are typically for database instances that require very low failure time frames. For example: If the primary DB server goes down in the cluster it fails over to a secondary in a second or two at the most and the end users only notice a slight hiccup for a second.
What I think you would find better is to look at those insert statements that are hitting your database server and seeing why they are preventing you from pulling data. If they are truly locking the table maybe changing a large amount of your reads to "nolock" reads might help remedy your situation.
It would also be helpful to know what kind of resources you have allocated and also if you have proper indexing on the core tables for your DB. If you dont have proper indexing then a lot of the queries can take longer then normal causing the locking your seeing.
Finally I would recommend table partitioning if the tables you are pulling against are to large. This can help with a lot of disk speed issues potentially and also help optimize your querys if you partition by time segment (i.e. make a new partition every X months so when a query pulls from one time segment they only pull from that one data file).
https://msdn.microsoft.com/en-us/library/ms190787.aspx
I would say you need to focus on efficiency more then a "copy database" as your volumes arent very high to be needing anything like that from the sounds of it. I currently have a sql server transaction database running with 10 million+ inserts on it a day and I still have live reports hit against it. You just need the resources and proper indexing to accommodate.

Refactoring my database schema

I'm refactoring my current schema and it's too abstract for me.
I monitor my servers with a homemade monitoring software. This software sends HTTP requests to a Rails web server with about ten different fields worth of information so I can get a quick overview of everything.
My current implementation:
server [id, name, created_date, edited_date, ..., etc ]
status_update [id, server_id, field1, field2, field3, created_date, edited_date, ..., etc]
I treat the servers as Users and status updates as Tweets. I delete any status_update on a server_id older than the tenth one just to keep from growing to infinity.
Though I'm starting to run into a few complications. I need to display information from the most recent status_update on the index page, I need to sort the servers based on status_update info, I need to store info from certain status_updates that may be way older than 10 status_updates old. It also seems like I'm going to start needing to store information from status_updates in both the server and status_update, which would cause hitting the DB multiple times on an insert. Thus, I am looking to refactor.
My requirements:
I only need to display information from the most recent update.
Having the next 9 status_updates helps debug if the system goes offline.
I need to be able to sort based on some info from most recent status_update.
I need the database to remain small (Heroku free).
Ideal performance, IE not hitting database more than once unless necessary.
Non-Complicated DB structure so I can pass it along.
Edit: Additional Info => I am looking to ultimately monitor about 150-200 servers (a lot for a hobby dev, but I'm cheap). Each monitoring service posts every five minutes or so unless something goes wrong. So, worst case scenario has me reaching max capacity every four hours.
I was thinking it would be nice to track when the last time X event happened, and what the result was. Thus, tracking that information would have to be moved to the server model itself since I'm wiping out old records and would lose the information after an hour or so. Though in retrospect, I could just save that info in memory using the monitoring service and send it up every five minutes or only once each time it changes. I could also simply edit that information only when it changes, so as to process less information on each request. Hm!
Efficiency
All ORMs, including ActiveRecord, are designed and built around certain tradeoffs. It's commonplace for ORMs to use several simple SELECT statements to do what a SQL developer would do with a single SELECT statement. You're probably not overwhelming Heroku with your queries.
There's no reasonable structural solution to this problem.
Size
Your "status_update" table should be able to hold an enormous number of rows. Heroku's hobby-dev plan allows 10,000 rows. How many servers do you seriously expect to monitor on a free plan? If I were you, I would delete old rows from it no more than once a day, or when I got a permission error. (On Heroku, certain permission errors mean you're over the row limit.)
It also seems like I'm going to start needing to store information
from status_updates in both the server and status_update, which would
cause hitting the DB multiple times on an insert.
This really makes little sense. Tweets don't require updates to the user account; status updates don't require updates about the server. This might suggest refactoring is in order, but I'd want to see either your models or your CREATE TABLE statements to be sure. (You can paste those into your question, and leave a comment here.)
Alternatives
I'd seriously consider running this Rails app on a local machine, writing data to a database on the local machine, especially if you intend to target 200 web servers. This would eliminate all Heroku row limits, and you don't really need to run it 24 hours a day if this is just a hobby. If you're doing this professionally, your income from it should easily cover the cost of a hobby-basic plan on Heroku. (Currently $9.00/month.) But even then I'd think hard about hosting this locally.

How should data be provided to a web server using a data warehouse?

We have data stored in a data warehouse as follows:
Price
Date
Product Name (varchar(25))
We currently only have four products. That changes very infrequently (on average once every 10 years). Once every business day, four new data points are added representing the day's price for each product.
On the website, a user can request this information by entering a date range and selecting one or more products names. Analytics shows that the feature is not heavily used (about 10 users requests per week).
It was suggested that the data warehouse should daily push (SFTP) a CSV file containing all data (currently 6718 rows of this data and growing by four each day) to the web server. Then, the web server would read data from the file and display that data whenever a user made a request.
Usually, the push would only be once a day, but more than one push could be possible to communicate (infrequent) price corrections. Even in the price correction scenario, all data would be delivered in the file. What are problems with this approach?
Would it be better to have the web server make a request to the data warehouse per user request? Or does this have issues such as a greater chance for network errors or performance issues?
Would it be better to have the web server make a request to the data warehouse per user request?
Yes it would. You have very little data, so there is no need to try and 'cache' this in some way. (Apart from the fact that CSV might not be the best way to do this).
There is nothing stopping you from doing these requests from the webserver to the database server. With as little information as this you will not find performance an issue, but even if it would be when everything grows, there is a lot to be gained on the database-side (indexes etc) that will help you survive the next 100 years in this fashion.
The amount of requests from your users (also extremely small) does not need any special treatment, so again, direct query would be the best.
Or does this have issues such as a greater chance for network errors or performance issues?
Well, it might, but that would not justify your CSV method. Examples and why you need not worry, could be
the connection with the databaseserver is down.
This is an issue for both methods, but with only one connection per day the change of a 1-in-10000 failures might seem to be better for once-a-day methods. But these issues should not come up very often, and if they do, you should be able to handle them. (retry request, give a message to user). This is what enourmous amounts of websites do, so trust me if I say that this will not be an issue. Also, think of what it would mean if your daily update failed? That would present a bigger problem!
Performance issues
as said, this is due to the amount of data and requests, not a problem. And even if it becomes one, this is a problem you should be able to catch at a different level. Use a caching system (non CSV) on the database server. Use a caching system on the webserver. Fix your indexes to stop performance from being a problem.
BUT:
It is far from strange to want your data-warehouse separated from your web system. If this is a requirement, and it surely could be, the best thing you can do is re-create your warehouse-database (the one I just defended as being good enough to query directly) on another machine. You might get good results by doing a master-slave system
your datawarehouse is a master-database: it sends all changes to the slave but is inexcessible otherwise
your 2nd database (on your webserver even) gets all updates from the master, and is read-only. you can only query it for data
your webserver cannot connect to the datawarehouse, but can connect to your slave to read information. Even if there was an injection hack, it doesn't matter, as it is read-only.
Now you don't have a single moment where you update the queried database (the master-slave replication will keep it updated always), but no chance that the queries from the webserver put your warehouse in danger. profit!
I don't really see how SQL injection could be a real concern. I assume you have some calendar type field that the user fills in to get data out. If this is the only form just ensure that the only field that is in it is a date then something like DROP TABLE isn't possible. As for getting access to the database, that is another issue. However, a separate file with just the connection function should do fine in most cases so that a user can't, say open your webpage in an HTML viewer and see your database connection string.
As for the CSV, I would have to say querying a database per user, especially if it's only used ~10 times weekly would be much more efficient than the CSV. I just equate the CSV as overkill because again you only have ~10 users attempting to get some information, to export an updated CSV every day would be too much for such little pay off.
EDIT:
Also if an attack is a big concern, which that really depends on the nature of the business, the data being stored, and the visitors you receive, you could always create a backup as another option. I don't really see a reason for this as your question is currently stated, but it is a possibility that even with the best security an attack could happen. That mainly just depends on if the attackers want the information you have.

What are the factors that affect the time taken to run a SQL on a database?

I have a query that runs on a data warehouse. I ran the report last month. It gave me some results in say x minutes. The same report when run on the same database without any modifications to the database returns the same results but in y minutes now.
y>x. The difference between the time is so large.
The amount of data and the indexes are also the same. There is no difference in them.
Now clients ask for me for a reason for this. What are the possible reasons for this?
You leave a lot of questions open
is the database running on a dedicated server.
do you run the reports from clients or directly on the server.
have there been changes to the phyisical network, have some settings been changed.
did they (by accident) change the protocol to communicate with the server (tcp, named-pipes, ...)
have you tried defragmenting
have you rebooted the server
do you have an execution plan before and after
Most likely the query plan has changed. Some minor difference in data has pushed the query optimisers calculations onto a new, less optimal plan.
Here are a few:
The amount of data in the warehouse has changed.
Indexes might have been modified.
Your warehouse is split across different servers and there is connectivity lag between them...
Your database server is processing something else as well due to which it has lesser memory and cpu for ur reports to run.

Is it possible to get sub-1-second latency with transactional replication?

Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".
No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.
I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.
Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.