I have a system that needs to snapshot specific tables at certain points in time.
At present the process that takes the snapshot queries the data in the table, and puts the output into a Temp table (like a stage table not an in memory table).
Many of these processes can be running at the same time in parallel (100+ per hour). And the tables to be copied can run into GBs worth of data.
I am considering the use of database snapshots, so each process can take its own snapshot and work with it.
What are the pros and cons of this approach?
Is there a better way to approach this?
I would certainly consider using snapshots, but there are a couple of things you need to consider before you decide one way or the other:
Snapshots use sparse files to store data, effectively only storing the changes to each table from the time of its creation. The more rows in your table that change, the smaller the disk saving over a simple replication.
Snapshots happen to the entire database, rather than just one table. Therefore, if your DB is huge, and need 100+ copies of your DB per hour, that's going to cost you a lot of disk space and performance. On the other hand, if you're happy with one snapshot per hour for all your processes and the data doesn't change too much, snapshots may be exactly what you need.
Snapshots will have a performance overhead on your original DB, as the snapshots need to be kept up to date with what data has changed. If you're already tight on performance, this may be an issue.
There are certainly some other things to consider, but that's not a bad starting point. Books OnLine has a fairly detailed article regarding the use of Snapshots, which I would read before deciding. http://msdn.microsoft.com/en-us/library/ms175158%28v=SQL.90%29.aspx There's also a section in there on limitations: http://msdn.microsoft.com/en-us/library/ms189940%28v=SQL.90%29.aspx
Hope this helps.
If you need to take a snapshot of the data for backup/recovery purposes, I suggest you utilize backup features of SQL Server (i.e., full, transaction log backups).
If your goal is to capture data from some table for historical purposes, you could implement a bunch of sql jobs running something like:
select *
into archive_table
from normal_table
Make sure you turn off transaction log (i.e., set database to "simple") right before you do this as it will drastically speed up the process.
Related
I'm not a trained DBA, but perform some SQL tasks and have this question:
In SQL databases I've noticed the use archive tables that mimic another table with the exact same fields and which are used to accept rows from the original table when that data is deemed for archiving. Since I've seen examples where those tables reside in the same database and on the same drive, my assumption is that this was done to increase performance. Such tables didn't have more than a about 10 million rows in them...
Why would this be done instead of using a column to designate the status of the row, such as a boolean for an in/active flag?
At what point would this improve performance ?
What would be the best pattern to structure this correctly, given that the data may still need to be queried (or unioned with current data) ?
What else is there to say about this ?
The notion of archiving is a physical, not logical, one. Logically the archive table contains the exact same entity and ought to be the same table.
Physical concerns tend to be pragmatic. The overarching notion is that the "database is getting too (big/slow"). Archiving records makes it easier to do things like:
Optimize the index structure differently. Archive tables can have more indexes without affecting insert/update performance on the working table. In addition, the indexes can be rebuilt with full pages, while the working table will generally want to have pages that are 50% full and balanced.
Optimize storage media differently. You can put the archive table on slower/less expensive disk drives that maybe have more capacity.
Optimize backup strategies differently. Working tables may require hot backups or log shipping while archive tables can use snapshots.
Optimize replication differently, if you are using it. If an archive table is only updated once per day via nightly batch, you can use snapshot as opposed to transactional replication.
Different levels of access. Perhaps you want different security access levels for the archive table.
Lock contention. If you working table is very hot you'd rather have your MIS developers access the archive table where they are less likely to halt your operations when they run something and forget to specify dirty read semantics.
The best practice would not to use archive tables but to move the data from the OLTP database to an MIS database, data warehouse, or data marts with denormalized data. But some organizations will have trouble justifying the cost of an additional DB system (which aren't cheap). There are far fewer hurdles to adding an additional table to an existing DB.
I say this frequently, but...
Multiple tables of identical structure almost never makes sense.
A status flag is a much better idea. There are proper ways to increase performance (partitioning/indexing) without denormalizing data or otherwise creating redundancies. 10 million records is pretty small in the world of modern rdbms, so what you're seeing is the product of poor planning or misunderstanding of databases.
The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?
well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"
I inhereted a high volume OLTP DB which I have free reign to improve as much as I find reasonably possible. The improvements already were very helpful but I want to take it to the next level. The data access patterns I found made it a good candidate IMO to cache the data on other servers and I would love to hear anyone's experience or recommendations with this type of setup.
We have a DB that gets about 3GB of data added to a table every day and reporting on it used to be very slow. The data does not change once it's put in, and no data gets inserted that is over a week old. Rows inputted within the last 3 days tend to see thousands of inserts between tens of millions of rows.
I was thinking of having data over 2 weeks old get pushed out to MongoDB. I could then have the 2 week sliding window data that is not pushed out to Mongo be, be cached by some kind of caching software so those get queried and displayed instead of the data being read out of the DB the whole time. I figure this way we still get full A.C.I.D compliance by having the DB engine validate all the data, have high read performance as it is not hitting the DB, then Mongo can take it when it is no longer a 'transaction'.
Anyone have any recommended solutions? I was looking at MemCached, but not quite sure if that's a good or even plausible solution. Thanks!
Another thing you could consider is using the new In-Memory OLTP feature in SQL Server 2014. That feature improves efficiency and scaling for OLTP workloads. You will potentially be able to get a lot more out of your existing server, without the need to consider specific caching mechanisms.
I don't have specific experience with SQL Server, but what you are describing does seem like a valid use case for MongoDB.
Note that while MongoDB can't directly handle transactions, it is capable of handling certain operations in an atomic fashion (see findAndModify, for instance). Additionally, with journaling enabled, you shouldn't have any reason to worry about durability. MongoDB is a reliable data store and will not lose or corrupt your data.
MongoDB itself can also act as a performant cache if you run a second deployment with journaling disabled. In this instance, writes will take place in memory and only be persisted to the disk every 60 seconds (unless otherwise configured). This will provide comparable performance to memcache which is solely in-memory while allowing you to keep your stack a bit simpler.
Hope this helps!
I'm working on designing an application with a SQL backend (Postgresql) and I've got a some design questions. In short, the DB will serve to store network events as they occur on the fly, so insertion speed and performance is critical due 'real-time' actions depending on these events. The data is dumped into a speedy default format across a few tables, and I am currently using postgresql triggers to put this data into some other tables used for reporting.
On a typical event, data is inserted into two different tables each share the same primary key (an event ID). I then need to move and rearrange the data into some different tables that are used by a web-based reporting interface. My primary goal/concern is to keep the load off the initial insertion tables, so they can do their thing. Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
Once the data is in the appropriate reporting tables, I won't need to hang on to the data in the insertion tables too long, so I'll keep those regularly pruned for insertion performance. In thinking about this scenario, which I'm sure is semi-common, I've come up with three options:
Use triggers to trigger on the initial row insert and populate the reporting tables. This was my original plan.
Use triggers to copy the insertion data to a temporary table (same format), and then trigger or cron to populate the reporting tables. This was just a thought, but I figure that a simple copy operation to a temp table will offload any of the query-ing of the triggers in the solution above.
Modify my initial output program to dump all the data to a single table (vs across two) and then trigger on that insert to populate the reporting tables. So where solution 1 is a multi-table to multi-table trigger situation, this would be a single-table source to multi-table trigger.
Am I over thinking this? I want to get this right. Any input is much appreciated!
You may experience have a slight increase in performance since there are more "things" to do (although they should not affect operations in any way). But using Triggers/other PL is a good way to reduce it to minimum subce they are executed faster than code that gets sent from your application to the DB-Server.
I would go with your first idea 1) since it seems to me the cleanest and most efficient way.
2) is the most performance hungry solution since cron will do more queries than the other solutions that use server-side functions. 3) would be possible but will resulst in an "uglier" database layout.
This is an old one but adding my answer here.
Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
That may be way off, I'm afraid, but under a few cases it may not be. It depends on the effects of caching on the reports. Keep in mind that disk I/O and memory are your commodities, and that writers and readers rarely block eachother on PostgreSQL (unless they explicitly elevate locks--- a SELECT ... FOR UPDATE will block writers for example). Basically if your tables fit comfortably in RAM you are better off reporting from them since you are keeping disk I/O free for the WAL segment commit of your event entry. If they don't fit in RAM then you may have cache miss issues induced by reporting. Here materializing your views (i.e. making trigger-maintained tables) may cut down on these but they have a significant complexity cost. This, btw, if your option 1. So I would chalk this one up provisionally as premature optimization. Also keep in mind you may induce cache misses and lock contention on materializing the views this way, so you might induce performance problems regarding inserts this way.
Keep in mind if you can operate from RAM with the exception of WAL commits, you will have no performance problems.
For #2. If you mean temporary tables as CREATE TEMPORARY TABLE, that's asking for a mess including performance issues and reports not showing what you want them to show. Don't do it. If you do this, you might:
Force PostgreSQL to replan your trigger on every insert (or at least once per session). Ouch.
Add overhead creating/dropping tables
Possibilities of OID wraparound
etc.....
In short I think you are overthinking it. You can get very far by bumping RAM up on your Pg box and making sure you have enough cores to handle the appropriate number of inserting sessions plus the reporting one. If you plan your hardware right, none of this should be a problem.
We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted
Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).
Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.
This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.