What is database throughput? - sql

Well, not much to ask apart from the question. What do you mean when you say a OLTP DB must have a high throughput.
Going to the wiki.
"In communication networks, such as
Ethernet or packet radio, throughput
or network throughput is the average
rate of successful message delivery
over a communication channel. This
data may be delivered over a physical
or logical link, or pass through a
certain network node. The throughput
is usually measured in bits per second
(bit/s or bps), and sometimes in data
packets per second or data packets per
time slot."
So does this mean , OLTP databases need to have a high/quick insertion rate ( i.e. avoiding deadlocks etc)??
I was always under an impression if we take a database for say an airline industry, it must have quick insertion , but at the same time quick response time since it is critical to it's operation. And in many ways this shouldn't this be limited to the protocol involved in delivering the message/data to the database?
I am not trying to single out the "only" characteristic of OLTP systems. In general I would like to understand, what characteristics are inherent to a OLTP system.

In general, when you're talking about the "throughput" of an OLTP database, you're talking about the number of transactions per second. How many orders can the system take a second, how many web page requests can it service, how many customer inquiries can it handle. That tends to go hand-in-hand with discussions about how the OLTP system scales-- if you double the number of customers hitting your site every month because the business is taking off, for example, will the OLTP systems be able to handle the increased throughput.
That is in contrast to OLAP/ DSS systems which are designed to run a relatively small number of transactions over much larger data volumes. There, you're worried far less about the number of transactions you can do than about how those transactions slow down as you add more data. If you're that wildly successful company, you probably want the same number and frequency of product sales by region reports out of your OLAP system as you generate exponentially more sales. But you now have exponentially more data to crunch which requires that you tune the database just to keep report performance constant.

Throughput doesn't have a single, fixed meaning in this context. Loosely, it means the number of transactions per second, but "write" transactions are different than "read" transactions, and sustained rates are different than peak rates. (And, of course, a 10-byte row is different than a 1000-byte row.)
I stumbled on Performance Metrics & Benchmarks: Berkeley DB the other day when I was looking for something else. It's not a bad introduction to the different ways of measuring "how fast". Also, this article on database benchmarks is an entertaining read.


How to maintain high performance in a medical production database with millions of rows

I have an application that is used to chart patient data during an ICU stay (Electronic record).
Patients are usually connected to several devices (monitors, ventilator, dialysis etc.)
that send data in a one minute interval. An average of 1800 rows are inserted per hour per patient.
Until now the integration enginge recieves the data and stores it in files on a dedicated drive.
The application reads it from there and plots it in graphs and data grids.
As there's a requirement for analysis we're thinking about writing the incoming signals immediately into the DB.
But there're a lot of concerns with respect to performance. Especially in this work environment people are very sensitive when it comes to performance.
Are there any techniques besides proper indexing to mitigate a possbile performance impact?
I'm thinking of a job to load the data into a dedicated table or maybe even into another database e.g. after 1 Month after the record was closed.
Any experiences how to keep the production DB small and lightweight?
I have no idea how many patients you have in you ICU unit but unless you have thousands of patients you should not have any problems - as long as you stick to inserts, use bind variables and do have as many freelists as necessary. Insert will only create locks on the free list. So you can do as many parallel insert as there are freelists available to determine a free block where to write the data to. You may want to look at the discussion over ra TKyte's site
Generally speaking 1.800 records per hours (or 10-20 times that) is not a lot for any decent sized Oracle db. If you are really fancy you could choose to partition based on the patient_id. This would be specifically useful if you:
Access the data only for one patient at a time because you can just skip all other partitions.
If you want to remove the data for a patient en bloc once he leaves ICU. Instead of DELETEING you could just drop the patients partitions.
Define "immediately". One of the best things you can do to improve INSERT performance is to batch the commands instead of running them one-at-a-time.
Every SQL statement has overhead - sending the statement to the database, parsing it (remember to use bind variables so that you don't have to hard parse each statement), returning a message, etc. In many applications, that overhead takes longer than the actual INSERT.
You only have to do a small amount of batching to significantly reduce that overhead. Running an INSERT ALL with two rows instead of two separate statements reduces the overhead by 1/2, running with three rows reduces overhead by 2/3, etc. Waiting a minute, or even a few seconds, can make a big difference.
As long as you avoid the common row-by-row blunders, an Oracle database with "millions" of rows is nothing to worry about. You don't need to think about cryptic table settings or replication yet.

Rapidly changing large data processing advise

My team has the following dilemma that we need some architectural/resources advise:
Note: Our data is semi-structured
Over-all Task:
We have a semi-large data that we process during the day
each day this "process" get executed 1-5 times a day
each "process" takes anywhere from 30 minutes to 5 hours
semi-large data = ~1 million rows
each row gets updated anywhere from 1-10 times during the process
during this update ALL other rows may change, as we aggregate these rows for UI
What we are doing currently:
our current system is functional, yet expensive and inconsistent
we use SQL db to store all the data and we retrieve/update as process requires
Unsolved problems and desired goals:
since this processes are user triggered we never know when to scale up/down, which causes high spikes and Azure doesnt make it easy to do autoscale based on demand without data warehouse which we are wanting to stay away from because of lack of aggregates and other various "buggy" issues
because of constant IO to the db we hit 100% of DTU when 1 process begins (we are using Azure P1 DB) which of course will force us to grow even larger if multiple processes start at the same time (which is very likely)
yet we understand the cost comes with high compute tasks, we think there is better way to go about this (SQL is about 99% optimized, so much left to do there)
We are looking for some tool that can:
Process large amount of transactions QUICKLY
Can handle constant updates of this large amount of data
supports all major aggregations
is "reasonably" priced (i know this is an arguable keyword, just take it lightly..)
Apache Spark
we don't have ton of experience with HDP so any pros/cons here will certainly be useful (does the use case fit the tool??)
seems promising.. Seems fast and has all aggregations we need..
Azure Data Warehouse
too many various issues we ran into, just didn't work for us.
Any GPU-accelerated compute or some other high-end ideas are also welcome.
Its hard to try them all and compare which one fits the best, as we have a fully functional system and are required to make adjustments to whichever way we go.
Hence, any before hand opinions are welcome, before we pull the trigger.

What is the best way to store highly parametrized entities?

Ok, let met try to explain this in more detail.
I am developing a diagnostic system for airplanes. Let imagine that airplanes has 6 to 8 on-board computers. Each computer has more than 200 different parameters. The diagnostic system receives all this parameters in binary formatted package, then I transfer data according to the formulas (to km, km/h, rpm, min, sec, pascals and so on) and must store it somehow in a database. The new data must be handled each 10 - 20 seconds and stored in persistence again.
We store the data for further analytic processing.
Requirements of storage:
support sharding and replication
fast read: support btree-indexing
fast write
So, I calculated an average disk or RAM usage per one plane per day. It is about 10 - 20 MB of data. So an estimated load is 100 airplanes per day or 2GB of data per day.
It seems that to store all the data in RAM (memcached-liked storages: redis, membase) are not suitable (too expensive). However, now I am looking to the mongodb-side. Since it can utilize as RAM and disk usage, it supports all the addressed requirements.
Please, share your experience and advices.
There is a helpful article on NOSQL DBMS Comparison.
Also you may find information about the ranking and popularity of them, by category.
It seems regarding to your requirements, Apache's Cassandra would be a candidate due to its Linear scalability, column indexes, Map/reduce, materialized views and powerful built-in caching.

1 or many sql tables for persisting "families" of properties about one object?

Our application (using a SQL Server 2008 R2 back-end) stores data about remote hardware devices reporting back to our servers over the Internet. There are a few "families" of information we have about each device, each stored by a different server application into a shared database:
static configuration information stored by users using our web app. e.g. Physical Location, Friendly Name, etc.
logged information about device behavior, e.g. last reporting time, date the device first came online, whether device is healthy, etc.
expensive information re-computed by scheduled jobs, e.g. average signal strength, average length of transmission, historical failure rates, etc.
These properties are all scalar values reflecting the most current data we have about a device. We have a separate way to store historical information.
The largest number of device instances we have to worry about will be around 100,000, so this is not a "big data" problem. In most cases a database will have 10,000 devices or less to worry about.
Writes to the data about an individual device happens infrequently-- typically every few hours. It's theoretically possible for a scheduled task, user-inputted configuration changes, and dynamic data to all make updates for the same device at the same time, but this seems very rare. Reads are more frequent: probably 10x per minute reads against at least one device in a database, and several times per hour for a full scan of some properties of all devices described in a database.
Deletes are relatively rare, in fact many cases we only "soft delete" devices so we can use them for historical reporting. New device inserts are more common, perhaps a few every day.
There are (at least) two obvious ways to store this data in our SQL database:
The current design of our application stores each of these families of information in separate tables, each with a clustered index on a Device ID primary key. One server application writes to one table each.
An alternate implementation that's been proposed is to use one large table, and create covering indexes as needed to accelerate queries for groups of properties (e.g. all static info, all reliability info, etc.) that are frequently queried together.
My question: is there a clearly superior option? If the answer is "it depends" then what are the circumstances which would make "one large table" or "multiple tables" better?
Answers should consider: performance, maintainability of DB itself, maintainability of code that reads/writes rows, and reliability in the face of unexpected behavior. Maintanability and reliability are probably a higher priority for us than performance, if we have to trade off.
Don't know about a clearly superior option, and I don't know about sql-server architecture. But I would go for the first option with separate tables for different families of data. Some advantages could be:
granting access to specific sets of data (may be desirable for future applications)
archiving different famalies of data at different rates
partial functionality of the application in the case of maintenance on a part (some tables available while another is restored)
indexing and partitioning/sharding can be performed on different attributes (static information could be partitioned on device id, logging information on date)
different families can be assigned to different cache areas (so the static data can remain in a more "static" cache, and more rapidly changing logging type data can be in another "rolling" cache area)
smaller rows pack more rows into a block which means fewer block pulls to scan a table for a specific attribute
less chance of row chaining if altering a table to add a row, easier to perform maintenance if you do
easier to understand the data when seprated into logical units (families)
I wouldn't consider table joining as a disadvantage when properly indexed. But more tables will mean more moving parts and the need for greater awareness/documentation on what is going on.
The first option is the recognized "standard" way to store such data in a relational database.
Although a good design would probably result in more tables. Relational databases software such as SQLServer were designed to store and retrieve data in multiple tables quickly and efficiently.
In addition such designs allow for great flexibility, both in terms of changing the database to store extra data, and, in allowing unexpected/unusual queries against the data stored.
The single table option sounds beguilingly simple to practitioners unfamiliar with Relational databases. In practice they perform very badly, are difficult to manage, and lead to a high number of deadlocks and timeouts.
They also lead to development paralysis. You cannot add a requested feature because it cannot be done without a total redesign of the "simple" database schema.

In terms of today's technology, are these meaningful concerns about data size?

We're adding extra login information to an existing database record on the order of 3.85KB per login.
There are two concerns about this:
1) Is this too much on-the-wire data added per login?
2) Is this too much extra data we're storing in the database per login?
Given todays technology, are these valid concerns?
We don't have concrete usage figures, but we average about 5,000 logins per month. We hope to scale to larger customers, howerver, still in the 10's of 1000's per month, not 1000's per second.
In the US (our market) broadband has 60% market adoption.
Assuming you have ~80,000 logins per month, you would be adding ~ 3.75 GB per YEAR to your database table.
If you are using a decent RDBMS like MySQL, PostgreSQL, SQLServer, Oracle, etc... this is a laughable amount of data and traffic. After several years, you might want to start looking at archiving some of it. But by then, who knows what the application will look like?
It's always important to consider how you are going to be querying this data, so that you don't run into performance bottlenecks. Without those details, I cannot comment very usefully on that aspect.
But to answer your concern, do not be concerned. Just always keep thinking ahead.
How many users do you have? How often do they have to log in? Are they likely to be on fast connections, or damp pieces of string? Do you mean you're really adding 3.85K per time someone logs in, or per user account? How long do you have to store the data? What benefit does it give you? How does it compare with the amount of data you're already storing? (i.e. is most of your data going to be due to this new part, or will it be a drop in the ocean?)
In short - this is a very context-sensitive question :)
Given that storage and hardware are SOOO cheap these days (relatively speaking of course) this should not be a concern. Obviously if you need the data then you need the data! You can use replication to several locations so that the added data doesn't need to move over the wire as far (such as a server on the west coast and the east coast). You can manage your data by separating it by state to minimize the size of your tables (similar to what banks do, choose state as part of the login process so that they look to the right data store). You can use horizontal partitioning to minimize the number or records per table to keep your queries speedy. Lots of ways to keep large data optimized. Also check into Lucene if you plan to do lots of reads to this data.
In terms of today's average server technology it's not a problem. In terms of your server technology it could be a problem. You need to provide more info.
In terms of storage, this is peanuts, although you want to eventually archive or throw out old data.
In terms of network (?) traffic, this is not much on the server end, but it will affect the speed at which your website appears to load and function for a good portion of customers. Although many have broadband, someone somewhere will try it on edge or modem or while using bit torrent heavily, your site will appear slow or malfunction altogether and you'll get loud complaints all over the web. Does it matter? If your users really need your service, they can surely wait, if you are developing new twitter the page load time increase is hardly acceptable.