web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008 - sql

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.

I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.

Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.

Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

Related

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Considered:
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
ArangoDB
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

What Database for extensive logfile analysis?

The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?
well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"

Web Caching Servers for SQL Server OLTP Env. Recommendations

I inhereted a high volume OLTP DB which I have free reign to improve as much as I find reasonably possible. The improvements already were very helpful but I want to take it to the next level. The data access patterns I found made it a good candidate IMO to cache the data on other servers and I would love to hear anyone's experience or recommendations with this type of setup.
We have a DB that gets about 3GB of data added to a table every day and reporting on it used to be very slow. The data does not change once it's put in, and no data gets inserted that is over a week old. Rows inputted within the last 3 days tend to see thousands of inserts between tens of millions of rows.
I was thinking of having data over 2 weeks old get pushed out to MongoDB. I could then have the 2 week sliding window data that is not pushed out to Mongo be, be cached by some kind of caching software so those get queried and displayed instead of the data being read out of the DB the whole time. I figure this way we still get full A.C.I.D compliance by having the DB engine validate all the data, have high read performance as it is not hitting the DB, then Mongo can take it when it is no longer a 'transaction'.
Anyone have any recommended solutions? I was looking at MemCached, but not quite sure if that's a good or even plausible solution. Thanks!
Another thing you could consider is using the new In-Memory OLTP feature in SQL Server 2014. That feature improves efficiency and scaling for OLTP workloads. You will potentially be able to get a lot more out of your existing server, without the need to consider specific caching mechanisms.
I don't have specific experience with SQL Server, but what you are describing does seem like a valid use case for MongoDB.
Note that while MongoDB can't directly handle transactions, it is capable of handling certain operations in an atomic fashion (see findAndModify, for instance). Additionally, with journaling enabled, you shouldn't have any reason to worry about durability. MongoDB is a reliable data store and will not lose or corrupt your data.
MongoDB itself can also act as a performant cache if you run a second deployment with journaling disabled. In this instance, writes will take place in memory and only be persisted to the disk every 60 seconds (unless otherwise configured). This will provide comparable performance to memcache which is solely in-memory while allowing you to keep your stack a bit simpler.
Hope this helps!

Design a database with a lot of new data

Im new to database design and need some guidance.
A lot of new data is inserted to my database throughout the day. (100k rows per day)
The data is never modified or deleted once it has been inserted.
How can I optimize this database for retrieval speed?
My ideas
Create two databases (and possible on different hard drives) and merge the two at night when traffic is low
Create some special indexes...
Your recommendation is highly appreciated.
UPDATE:
My database only has a single table.
100k/day is actually fairly low. 3M/month, 40M/year. You can store 10 years archive and not reach 1B rows.
The most important thing to choose in your design will be the clustered key(s). You need to make sure that they are narrow and can serve all the queries your application will normally use. Any query that will end up in table scan will completely trash your memory by fetching in the entire table. So, no surprises there, your driving factor in your design is the actual load you'll have: exactly what queries will you be running.
A common problem (more often neglected than not) with any high insert rate is that eventually every row inserted will have to be deleted. Not acknowledging this is a pipe dream. The proper strategy depends on many factors, but probably the best bet is on a sliding window partitioning scheme. See How to Implement an Automatic Sliding Window in a Partitioned Table. This cannot be some afterthought, the choice for how to remove data will permeate every aspect of your design and you better start making a strategy now.
The best tip I can give which all big sites use to speed up there website is:
CACHE CACHE CACHE
use redis/memcached to cache your data! Because memory is (blazingly)fast and disc I/O is expensive.
Queue writes
Also for extra performance you could queue up the writes in memory for a little while before flushing them to disc -> writting them to SQL database. Off course then you have the risk off losing data if you keep it in memory and your computer crashes or has power failure or something
Context missing
Also I don't think you gave us much context!
What I think is missing is:
architecture.
What kind of server are you having VPS/shared hosting.
What kind of Operating system does it have linux/windows/macosx
computer specifics like how much memory available, cpu etc.
a find your definition of data a bit vague. Could you not attach a diagram or something which explains your domain a little bit. For example something like
this using http://yuml.me/
Your requirements are way to general. For MS SQL server 100k (more or less "normal") records per days should not be a problem, if you have decent hardware. Obviously you want to write fast to the database, but you ask for optimization for retrieval performance. That does not match very well! ;-) Tuning a database is a special skill on its own. So you will never get the general answer you would like to have.