What is a good way to manage large ever growing tables in a database? - sql

I am building a web application for medical record keeping. A requirement for this application is logging all changes (view, create, update, delete) to a patients data and pretty much any other useful info in the system (login, cron run, data export, etc).
I am storing the data into a database table currently which is working fine. However it is likely this table will grow unruly very quickly and bloat the database. I am not allowed to delete log entries.
My current plan is to choose an arbitrary size (such as 1 million entries, large but still manageable). When the table hits 1 million entries I move 100,000 oldest entries into a file and store it onto our file server.
Does anyone have any experience with this issue that has other/better ideas on how to handle it?
Additional info:
My primary concern is nothing will ever be deleted from this data. However the data does not necessarily need to be accessed after several months. Since this data could logically hit 1 Billion entries in a matter of a couple years (and I have 300 copies of this db that all include this table) what is a good way to manage the size and performance. This table needs to be on a pager which is obviously going to be an issue when it breaks 1 Million let alone 1 Billion.

Cases like this are tailor-made for partitioning. Using a partitioning strategy, you span your data across multiple tables. This helps to balance I/O, speed up access times for partition-specific queries, etc. This is a discipline in and of itself, and the choice of partitioning key is crucial. In many cases such as log data like this, people often partition on a datetime value.
Partitioned Tables and Indexes (SQL Server)

Related

maintenance of application log files sql

I want to create a log table to keep track of users and their actions on website. For ex, when a user log in page a record will be created into log table. when user creates information, a record will be created into log table. similarly for every action, a record will be created into log table. In this way, the log table data will grow very faster. What is the better way to maintain such bigger tables apart from creating trigger and scheduling scripts to clean data frequently?
From my experience typically excessive logging doesnt really gain you much. A lot of people lose the usefulness of logging with the sheer volume of it...just a little warning before hand.
As for maintaining a table that size i recommend potentially partitioning the table and writing a specific set of stored procedures that effectively use a few indexes that you place on the table. Any ad-hoc work on the table should be done minimally and if it is done make sure the ad-hoc hits up against any index you setup on the table. Also with (nolock) will be your friend for SELECT statements if a large amount of inserts going on.
This is the basic general idea I do for the transaction tables I handle and they typically get around 1-2 million rows a day.

Synchronize SQL Server databases

I have a new idea and question about that I would like to ask you.
We have a CRM application on-premise / in house. We use that application kind of 24X7. We also do billing and payroll on the same CRM database which is OLTP and also same thing with SSRS reports.
It looks like whenever we do operation in front end which does inserts and updates to couple of entities at the same time, our application gets frozen until that process finishes. e.g. extracting payroll for 500 employees for their activities during last 2 weeks. Basically it summarize total working hours pulls that numbers from database and writes/updates that record where it says extract has been accomplished. so for 500 employees we are looking at around 40K-50K rows for Insert/Select/Update statements together.
Nobody can do anything while this process runs! We are considering the following options to take care of this issue.
Running this process in off-hours
OR make a copy of DB of Dyna. CRM and do this operations(extracting thousands of records and running multiple reports) on copy.
My questions are:
how to create first of all copy and where to create it (best practices)?
How to make it synchronize in real-time.
if we do select statement operation in copy DB than it's OK, but if we do any insert/update on copy how to reflect that on actual live db? , in short how to make sure both original and copy DB are synchronize to each other in real time.
I know I asked too many questions, but being SQL person, stepping into CRM team and providing suggestion, you know what I am trying to say.
Thanks folks for your any suggestion in advance.
Well to answer your question in regards to the live "copy" of a database a good solution is an alwayson availability group.
https://blogs.technet.microsoft.com/canitpro/2013/08/19/step-by-step-creating-a-sql-server-2012-alwayson-availability-group/
Though I dont think that is what you are going to want in this situation. Alwayson availability groups are typically for database instances that require very low failure time frames. For example: If the primary DB server goes down in the cluster it fails over to a secondary in a second or two at the most and the end users only notice a slight hiccup for a second.
What I think you would find better is to look at those insert statements that are hitting your database server and seeing why they are preventing you from pulling data. If they are truly locking the table maybe changing a large amount of your reads to "nolock" reads might help remedy your situation.
It would also be helpful to know what kind of resources you have allocated and also if you have proper indexing on the core tables for your DB. If you dont have proper indexing then a lot of the queries can take longer then normal causing the locking your seeing.
Finally I would recommend table partitioning if the tables you are pulling against are to large. This can help with a lot of disk speed issues potentially and also help optimize your querys if you partition by time segment (i.e. make a new partition every X months so when a query pulls from one time segment they only pull from that one data file).
https://msdn.microsoft.com/en-us/library/ms190787.aspx
I would say you need to focus on efficiency more then a "copy database" as your volumes arent very high to be needing anything like that from the sounds of it. I currently have a sql server transaction database running with 10 million+ inserts on it a day and I still have live reports hit against it. You just need the resources and proper indexing to accommodate.

SQL: Joins vs Denormalization (lots of data)

I know, variations of this question had been asked before. But my case may be a little different :-)
So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)
I need to be able to quickly get answers to two queries:
get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -
Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.
Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.
I expect LOTS of data, certainly more than one table or single server can handle.
I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines).
This way, every such table will have limited amount of rows, let's say, 10M tops.
So, the question is: what to do with user's attributes?
Option 1, normalized: store them in separate table and reference from event tables.
(pro) No repetition of data.
(con) joins, which are expensive (or so
I heard).
(con) this requires user table and event tables to be on
the same server
Option 2, redundant: store user attributes in event tables and index them.
(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes
Your design should be normalized, you physical schema may end up denormalized for performance reasons.
Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.
Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.
In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.
Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.
Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.
Please Normalize
use partitions and indexing to balance load

How to replicate database A to B, then truncate data on database A, leaving B alone?

I am having a problem with my SQL Server 2005 database. The database must handle 1000 inserts a sec constantly. This is proving to be very difficult when the database must also handle reporting of the data, thus indexing. It seems to slow down after a couple of days only achieving 300 inserts per sec. By 10 days it is almost non functional.
The requirement is to store 14 days worth of data. So far I can only manage 3 or 4 before everything falls apart. Is there a simple solution to this problem?
I was thinking that I could replicate the primary database allowing the new database to be the reporting database storing the 14 days worth of database, then truncate the primary database daily. Would this work?
It is unlikely you will want reporting running against a database capturing 1000 records per second. I'd suggest two databases, one handling the constant stream of inserts and a second reporting database that only loads records at an interval, either by querying the first for a finite set since the last load or by caching the incoming data and loading it separately.
However, reporting in near real time against a database capturing 86 million rows per day and carrying approximately 1.2 billion rows will require significant planning and hardware demands. Further, on the backend as you reach day 14 and start to remove old data you will put more load on the database. If you can run without logging that will help the primary system, but the reporting system with indexing demands and such will require some pretty significant performance considerations.
If the server has multiple harddrives I would try to split the database (or even the tables) in partitions.
Yeah, you dont need to copy a database over and then truncate/delete the live database on the fly. My guess is that the slowness is because your transaction logs are growing like crazy?
I think you are trying to say that you want to "shrink" the database periodically. If you have a FULL backup scheme, I think that if you backup the transaction logs once in a while that will shrink things down to normal again.

Low MySQL Table Cache Hit Rate

I've been working on optimizing my site and databases, and I have been using mysqltuner.pl to help with this. I've gotten just about everything correct, except for the table cache hit rate, no matter how high I raise it in my.cnf, I am still hitting about 0% (284 open / 79k opened).
My problem is that I don't really understand exactly what affects this so I don't really know what to look for in my queries/database structure to fix this.
The table cache defines the number of simultaneous file descriptors that MySQL has open. So table cache hit rate will be affected by how many tables you have relative to your limit as well as how frequently you re-reference tables (keeping in mind that it is counting not just a single connection, but simultaneous connections)
For instance, if your limit is 100 and you have 101 tables and you query each one in order, you'll never get any table cache hits. On the other hand, if you only have 1 table, you should generally get close to 100% hit rate unless you run FLUSH TABLES a lot ( as long as your table_cache is set higher than the number of typically simultaneous connections).
So for tuning, you need to look at how many distinct tables you might reference by one process/client and then look at how many simultaneous connections you might typically have.
Without more details, I can't guess whether your case is due to too many simultaneous connections or too many frequently referenced tables.
A cache is supposed to maintain copies of hot data. Hot data is data that is used a lot. If you cannot retrieve data out of a certain cache it means the DB has to go to disk to retrieve it.
--edit--
sorry if the definition seemed a bit obnoxious. a specific cache often covers a lot of entities, and these are database specific, you need to find out what is cached by the table cache firstly.
--edit: some investigation --
Ok, it seems (from the reply to this post), that Mysql uses the table cache for the data structures used to represent a table. the data structures also (via encapsulation or by having duplicate table entries for each table) represent a set of file descriptors open for the data files on the file system. The MyIsam engine uses one for a table and one for each index, additionally each active query element requires its own descriptors.
A file descriptor is a kernel entity used for file IO, it represents the low-level context of a particular file read or write.
I think you are either interpreting the value's incorrectly or they need to be interpreted differently in this context. 284 is the amount of active tables at the instance you took the snapshot and the second value represents the amount of times a table was acquired since you started Mysql.
I would hazard a guess that you need to take multiple snapshots of this reading and see if the first value (active fd's at that instance) ever exceed your cache size capacity.
p.s., the kernel generally has a upper limit on the amount of file descriptors it will allow each process to open -- so you might need to tune this if it is too low.