Design a database with a lot of new data - sql

Im new to database design and need some guidance.
A lot of new data is inserted to my database throughout the day. (100k rows per day)
The data is never modified or deleted once it has been inserted.
How can I optimize this database for retrieval speed?
My ideas
Create two databases (and possible on different hard drives) and merge the two at night when traffic is low
Create some special indexes...
Your recommendation is highly appreciated.
UPDATE:
My database only has a single table.

100k/day is actually fairly low. 3M/month, 40M/year. You can store 10 years archive and not reach 1B rows.
The most important thing to choose in your design will be the clustered key(s). You need to make sure that they are narrow and can serve all the queries your application will normally use. Any query that will end up in table scan will completely trash your memory by fetching in the entire table. So, no surprises there, your driving factor in your design is the actual load you'll have: exactly what queries will you be running.
A common problem (more often neglected than not) with any high insert rate is that eventually every row inserted will have to be deleted. Not acknowledging this is a pipe dream. The proper strategy depends on many factors, but probably the best bet is on a sliding window partitioning scheme. See How to Implement an Automatic Sliding Window in a Partitioned Table. This cannot be some afterthought, the choice for how to remove data will permeate every aspect of your design and you better start making a strategy now.

The best tip I can give which all big sites use to speed up there website is:
CACHE CACHE CACHE
use redis/memcached to cache your data! Because memory is (blazingly)fast and disc I/O is expensive.
Queue writes
Also for extra performance you could queue up the writes in memory for a little while before flushing them to disc -> writting them to SQL database. Off course then you have the risk off losing data if you keep it in memory and your computer crashes or has power failure or something
Context missing
Also I don't think you gave us much context!
What I think is missing is:
architecture.
What kind of server are you having VPS/shared hosting.
What kind of Operating system does it have linux/windows/macosx
computer specifics like how much memory available, cpu etc.
a find your definition of data a bit vague. Could you not attach a diagram or something which explains your domain a little bit. For example something like
this using http://yuml.me/

Your requirements are way to general. For MS SQL server 100k (more or less "normal") records per days should not be a problem, if you have decent hardware. Obviously you want to write fast to the database, but you ask for optimization for retrieval performance. That does not match very well! ;-) Tuning a database is a special skill on its own. So you will never get the general answer you would like to have.

Related

Is a Data-filled SQL table queryable while setting up a new index?

Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.

What Database for extensive logfile analysis?

The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?
well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

gDatabase Optimization: Need a really big database to test some of the features of sql server

I have done database optimization for dbs upto 3GB size. Need a really large database to test optimization.
Simply generating a lot of data and throwing it into a table proves nothing about the DBMS, the database itself, the queries being issued against it, or the applications interacting with them, all of which factor into the performance of a database-dependent system.
The phrase "I have done database optimization for [databases] up to 3 GB" is highly suspect. What databases? On what platform? Using what hardware? For what purposes? For what scale? What was the model? What were you optimizing? What was your budget?
These same questions apply to any database, regardless of size. I can tell you first-hand that "optimizing" a 250 GB database is not the same as optimizing a 25 GB database, which is certainly not the same as optimizing a 3 GB database. But that is not merely on account of the database size, it is because databases that contain 250 GB of data invariably deal with requirements that are vastly different from those addressed by a 3 GB database.
There is no magic size barrier at which you need to change your optimization strategy; every optimization requires in-depth knowledge of the specific data model and its usage requirements. Maybe you just need to add a few indexes. Maybe you need to remove a few indexes. Maybe you need to normalize, denormalize, rewrite a couple of bad queries, change locking semantics, create a data warehouse, implement caching at the application layer, or look into the various kinds of vertical scaling available for your particular database platform.
I submit that you are wasting your time attempting to create a "really big" database for the purposes of trying to "optimize" it with no specific requirements in mind. Various data-generation tools are available for when you need to generate data fitting specific patterns for testing against a specific set of scenarios, but until you have that information on hand, you won't accomplish very much with a database full of unorganized test data.
The best way to do this is to create your schema and write a script to populate it with lots of random(ish) dummy data. Random, meaning that your text-fields don't necessarily have to make sense. 'ish', meaning that the data distribution and patterns should generally reflect your real-world DB usage.
Edit: a quick Google search reveals a number of commercial tools that will do this for you if you don't want to write your own populate scripts: DB Data Generator, DTM Data Generator. Disclaimer: I've never used either of these and can't really speak to their quality or usefulness.
Here is a free procedure I wrote to generate Random person names. Quick and dirty, but it works and might help.
http://www.joebooth-consulting.com/products/genRandNames.sql
I use Red-Gate's Data Generator regularly to test out problems as well as loads on real systems and it works quite well. That said, I would agree with Aaronnaught's sentiment in that the overall size of the database isn't nearly as important as the usage patterns and the business model. For example, generating 10 GB of data on a table that will eventually get no traffic will not provide any insight into optimization. The goal is to replicate the expected transaction and storage loads you anticipate to occur in order to identify bottlenecks before they occur.

Performance Tuning PostgreSQL

Keep in mind that I am a rookie in the world of sql/databases.
I am inserting/updating thousands of objects every second. Those objects are actively being queried for at multiple second intervals.
What are some basic things I should do to performance tune my (postgres) database?
It's a broad topic, so here's lots of stuff for you to read up on.
EXPLAIN and EXPLAIN ANALYZE is extremely useful for understanding what's going on in your db-engine
Make sure relevant columns are indexed
Make sure irrelevant columns are not indexed (insert/update-performance can go down the drain if too many indexes must be updated)
Make sure your postgres.conf is tuned properly
Know what work_mem is, and how it affects your queries (mostly useful for larger queries)
Make sure your database is properly normalized
VACUUM for clearing out old data
ANALYZE for updating statistics (statistics target for amount of statistics)
Persistent connections (you could use a connection manager like pgpool or pgbouncer)
Understand how queries are constructed (joins, sub-selects, cursors)
Caching of data (i.e. memcached) is an option
And when you've exhausted those options: add more memory, faster disk-subsystem etc. Hardware matters, especially on larger datasets.
And of course, read all the other threads on postgres/databases. :)
First and foremost, read the official manual's Performance Tips.
Running EXPLAIN on all your queries and understanding its output will let you know if your queries are as fast as they could be, and if you should be adding indexes.
Once you've done that, I'd suggest reading over the Server Configuration part of the manual. There are many options which can be fine-tuned to further enhance performance. Make sure to understand the options you're setting though, since they could just as easily hinder performance if they're set incorrectly.
Remember that every time you change a query or an option, test and benchmark so that you know the effects of each change.
Actually there are some simple rules which will get you in most cases enough performance:
Indices are the first part. Primary keys are automatically indexed. I recommend to put indices on all foreign keys. Further put indices on all columns which are frequently queried, if there are heavily used queries on a table where more than one column is queried, put an index on those columns together.
Memory settings in your postgresql installation. Set following parameters higher:
.
shared_buffers, work_mem, maintenance_work_mem, temp_buffers
If it is a dedicated database machine you can easily set the first 3 of these to half the ram (just be carefull under linux with shared buffers, maybe you have to adjust the shmmax parameter), in any other cases it depends on how much ram you would like to give to postgresql.
http://www.postgresql.org/docs/8.3/interactive/runtime-config-resource.html
http://wiki.postgresql.org/wiki/Performance_Optimization
The absolute minimum I'll recommend is the EXPLAIN ANALYZE command. It will show a breakdown of subqueries, joins, et al., all the time showing the actual amount of time consumed in the operation. It will also alert you to sequential scans and other nasty trouble.
It is the best way to start.
Put fsync = off in your posgresql.conf, if you trust your filesystem, otherwise each postgresql operation will be imediately written to the disk (with fsync system call).
We have this option turned off on many production servers since quite 10 years, and we never had data corruptions.