I worked in several companies and in each of them audit tables have been storing full snapshots of records for every change.
To my understanding it's enough to store only changed columns to recreate record in any given point of time. It will obviously reduce storage space. Moreover I suppose it would improve performance as we would need to write much smaller amount of data.
As I've seen it in across different databases and frameworks, I'm not putting any specific tag here.
I'd gladly understand reasoning behind this approach.
Here are some important reasons.
First, storage is becoming cheaper and cheaper. So there is little financial benefit in reducing the number of records or their size.
Second, the "context" around a change can be very helpful. Reconstructing records as they look when the change occurs can be tricky.
Third, the logic to detect changes is tricker than it seems. This is particularly true when you have NULL values. If there is a bug in the code, then you lose the archive. Entire records are less error-prone.
Fourth, remember that (2) and (3) need to be implemented for every table being archived, further introducing the possibility of error.
I might summarize this as saying that storing the entire record uses fewer lines of code. Fewer lines of code are easier to maintain and less error-prone. And those savings outweigh the benefits of reducing the size of the archive.
Related
I was reading that book about APIs called "API design patterns" by JJ Geewax and there was a section that talks about getting the count of items and he said it's not a good idea especially in distributed storage systems.
page 102
Next, there is often the temptation to include a count of the items along with the listing. While this
might be nice for user-interface consumers to show a total number of matching results, it often
adds far more headache as time goes on and the number of items in the list grows beyond what
was originally projected. This is particularly complicated for distributed storage systems that are
not designed to provide quick access to counts matching specific queries. In short, it's generally a
bad idea to include item counts in the responses to a standard List method.
Anyone has a clue why is that or at least gives me keywords to search for.
In a typical database (e.g., a MySQL db with a few gigs of data in there), counting the number of rows is pretty easy. If that's all you'll ever deal with, then providing a count of matching results isn't a huge deal -- the concern comes up when things get bigger.
As the amount data starts growing (e.g., say... 10T?), dynamically computing an accurate count of matching rows can start to get pretty expensive (you have to scan and keep a running count of all the matching data). Even with a distributed storage system, this can be fast, but still expensive. This means your API will be spending a lot of computing resources to calculate the total number of results when it could be doing other important things. In my opinion, this is wasteful (a large expense for a "nice-to-have" on the API). If counts are critical to the API, then that changes the calculation.
Further, as changes to the data become more frequent (more creates, updates, and deletes), a count becomes less and less accurate as it might change drastically from one second to the next. In that case, not only is there more work being done to come up with a number, but that number isn't even all that accurate (and presumably, not super useful at that point).
So overall... result counts on larger data sets tend to be:
Expensive
More nice-to-have than business critical
Inaccurate
And since APIs tend to live much longer than we ever predict and can grow to a size far larger than we imagine, I discourage including result counts in API responses.
Every API is different though, so maybe it makes sense to have counts in your API, though I'd still suggest using rough estimates rather than exact counts to future-proof the API.
Some good reasons to include a count:
Your data size will stay reasonably small (i.e., able to be served by a single MySQL database).
Result counts are critical to your API (not just "nice-to-have").
Whatever numbers you come up with are accurate enough for your use cases (i.e., exact numbers for small data sets or "good estimates", not useless estimates).
I am learning cassandra. Now, I am thinking about SQL's problems that NoSQL addresses, and I have a question about cases of very big data.
About SQL handling very big data, I thought that many pages are saying that tables will be on different servers and queries are slow because of joining tables on different servers. This is a problem of SQL that NoSQL addresses. But, even with NoSQL, if partitions are too big, do not I need to change my data model, make smaller partitions and make multiple queries on them to get the same result? And, is not it slow? Or, you never run out of a space in partition because 2B cells are big enough?
I think your question is mixing several different issues.
First of all, the problem with big data and SQL is usually not that queries become slow, but that the solution cannot scale as the data grows bigger and bigger. If you choose to manually split your tables to several servers, as you suggested, what do you do when you need even more servers - redesign your data model? Also, how do you ensure consistency when an update requires modifying several tables but they are on different hosts?
Second, you mentioned joins, and this is something which NoSQL solutions like Cassandra do not support. You need to manually denormalize the data yourself (i.e., put the already joined data in a table). For some things, Cassandra's new "Materialized Views" feature can come in handy.
Third, and perhaps most importantly, you asked about huge partitions. Indeed Cassandra is not designed to handle huge partitions, and the best practice is far below the 2-billion hard limit which you mentioned: Datastax (the commercial company behind Cassandra's development) suggests in https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html that a good rule of thumb is to "keep the maximum number of rows below 100,000 items and the disk size under 100 MB.".
There are several reasons why huge partitions are ill-advised in Cassandra. One of them is that the disk format (sstables and their so-called "promoted index") makes it inefficient to jump to the middle of a huge partition, and you need to do this when you want to read a specific row or iterate through all the rows. Some operations such as compaction and repair work on entire partitions and can become very slow (and in the worst case, also use a lot of memory). E.g., a case that a billion-row partition differs on two nodes by just one row, and the partition-based repair needs to send the entire partition over the network.
Scylla (https://en.wikipedia.org/wiki/Scylla_(database)), a Cassandra clone which is generally more efficient than Apache Cassandra, also has similar issues with huge partitions (as in Cassandra, moderately large partitions are fine), but these issues are actively being worked on, including re-designing the file format, so eventually Scylla should support arbitrary-sized partitions. However, we're still not there yet, and today the recommendation of not letting partitions grow too huge still applies to Scylla as well.
Finally, if you want to get around the problem of too many rows in a single partition, then, yes, you need to tweak your data model to avoid these huge partitions. Sometimes, you just need to fix design mistakes in your model - e.g., I have seen people sticking a lot of unrelated data into the same partition, when it could have easily (and more efficiently!) be put in separate partitions. Sometimes, you need to artificially split your partitions. This is common in so-called "time-series data" modeling in Cassandra, where we (for example) get a new value of some measurement every second and add it as a row to a partition. Here, instead of having one huge partition for all data ever, the accepted practice is to create a separate partition per time window (e.g., a new partition every day, or week, or whatever). Since most queries involve just one time window anyway, they don't even become slower.
I'm currently designing an application where users can create/join groups, and then post content within a group. I'm trying to figure out how best to store this content in a RDBMS.
Option 1: Create a single table for all user content. One of the columns in this table will be the groupID, designating which group the content was posted in. Create an index using the groupID, to enable fast searching of content within a specific group. All content reads/writes will hit this single table.
Option 2: Whenever a user creates a new group, we dynamically create a new table. Something like group_content_{groupName}. All content reads/writes will be routed to the group-specific dynamically created table.
Pros for Option 1:
It's easier to search for content across multiple groups, using a single simple query, operating on a single table.
It's easier to construct simple cross-table queries, since the content table is static and well-defined.
It's easier to implement schema changes and changes to indexing/triggers etc, since there's only one table to maintain.
Pros for Option 2:
All reads and writes will be distributed across numerous tables, thus avoiding any bottlenecks that can result from a lot of traffic hitting a single table (though admittedly, all these tables are still in a single DB)
Each table will be much smaller in size, allowing for faster lookups, faster schema-changes, faster indexing, etc
If we want to shard the DB in future, the transition would be easier if all the data is already "sharded" across different tables.
What are the general recommendations between the above 2 options, from performance/development/maintenance perspectives?
One of the cardinal sins in computing is optimizing too early. It is the opinion of this DBA of 20+ years that you're overestimating the IO that's going to happen to these groups.. RDBMS's are very good at querying and writing this type of info within a standard set of tables. Worst case, you can partition them later. You'll have a lot more search capability and management ease with 1 set of tables instead of a set per user.
Imagine if the schema needs to change? do you really want to update hundreds or thousands of tables or write some long script to fix a mundane issue? Stick with a single set of tables and ignore sharding. Instead, think "maybe we'll partition the tables someday, if necessary"
It is a no-brainer. (1) is the way to go.
You list these as optimizations for the second method. All of these are misconceptions. See comments below:
All reads and writes will be distributed across numerous tables, thus
avoiding any bottlenecks that can result from a lot of traffic hitting
a single table (though admittedly, all these tables are still in a
single DB)
Reads and writes can just as easily be distributed within a table. The only issue would be write conflicts within a page. That is probably a pretty minor consideration, unless you are dealing with more than dozens of transactions per second.
Because of the next item (partially filled pages), you are actually much better off with a single table and pages that are mostly filled.
Each table will be much smaller in size, allowing for faster lookups,
faster schema-changes, faster indexing, etc
Smaller tables can be a performance disaster. Tables are stored on data pages. Each table is then a partially filled page. What you end up with is:
A lot of wasted space on disk.
A lot of wasted space in your page cache -- space that could be used to store records.
A lot of wasted I/O reading in partially filled pages.
If we want to shard the DB in future, the transition would be easier
if all the data is already "sharded" across different tables.
Postgres supports table partitioning, so you can store different parts of a table in different places. That should be sufficient for your purpose of spreading the I/O load.
Option 1: Performance=Normal Development=Easy Maintenance=Easy
Option 2: Performance=Fast Development=Complex Maintenance=Hard
I suggest to choose the Oprion1 and for the BIG table you can manage the performance with better indexes or cash indexes (for some DB) and the last thing is nothing help make the second Option 2, because development a maintenance time is fatal factor
I found a few questions in the same vein as this, but they did not include much detail on the nature of the data being stored, how it is queried, etc... so I thought this would be worthwhile to post.
My data is very simple, three fields:
- a "datetimestamp" value (date/time)
- two strings, "A" and "B", both < 20 chars
My application is very write-heavy (hundreds per second). All writes are new records; once inserted, the data is never modified.
Regular reads happen every few seconds, and are used to populate some near-real-time dashboards. I query against the date/time value and one of the string values. e.g. get all records where the datetimestamp is within a certain range and field "B" equals a specific search value. These queries typically return a few thousand records each.
Lastly, my database does not need to grow without limit; I would be looking at purging records that are 10+ days old either by manually deleting them or using a cache-expiry technique if the DB supported one.
I initially implemented this in MongoDB, without being aware of the way it handles locking (writes block reads). As I scale, my queries are taking longer and longer (30+ seconds now, even with proper indexing). Now with what I've learned, I believe that the large number of writes are starving out my reads.
I've read the kkovacs.eu post comparing various NoSQL options, and while I learned a lot I don't know if there is a clear winner for my use case. I would greatly appreciate a recommendation from someone familiar with the options.
Thanks in advance!
I have faced a problem like this before in a system recording process control measurements. This was done with 5 MHz IBM PCs, so it is definitely possible. The use cases were more varied—summarization by minute, hour, eight-hour-shift, day, week, month, or year—so the system recorded all the raw data, but is also aggregated on the fly for the most common queries (which were five minute averages). In the case of your dashboard, it seems like five minute aggregation is also a major goal.
Maybe this could be solved by writing a pair of text files for each input stream: One with all the raw data; another with the multi-minute aggregation. The dashboard would ignore the raw data. A database could be used, of course, to do the same thing. But simplifying the application could mean no RDB is needed. Simpler to engineer and maintain, easier to fit on a microcontroller, embedded system, etc., or a more friendly neighbor on a shared host.
Deciding a right NoSQL product is not an easy task. I would suggest you to learn more about NoSQL before making your choice, if you really want to make sure that you don't end up trusting someone else's suggestion or favorites.
There is a good book which gives really good background about NoSQL and anyone who is starting up with NoSQL should read this.
http://www.amazon.com/Professional-NoSQL-Wrox-Programmer/dp/047094224X
I hope reading some of the chapters in the book will really help you. There are comparisons and explanations about what is good for what job and lot more.
Good luck.
For example, I have a table of bank users (user id, user name), and a table for transactions (user id, account id, amount).
Accounts have the same properties across different users, but hold different amounts (like Alex -> Grocery, it is specific to Alex, but all other users also have Grocery account).
The question is, would it be better to create a separate table of accounts (account id, user id, amount left) or to get this value by selecting all transactions with the needed user id and account id and just summing the 'amount' values? It seems that the first approach would be faster, but more prone to error and database corruption - I would need to update accounts every time the transaction happens. The second approach seems to be cleaner, but would it lead to significant speed reduction?
What would you recommend?
good question!
In my opinion you should always avoid duplicated data so I would go with the "summing" every time option
"It seems that the first approach would be faster, but more prone to error and database corruption - I would need to update accounts every time the transaction happens"
said everything, you are subject to errors and you'll have to build a mechanism to maintain the data up-to-date.
Dont forget that the first approach would be faster to select only. inserts updates and deletes would be slower because you will have to update your second table.
This is an example of Denormalization.
In general, denormalization is discouraged, but there are certain exceptions - bank account balances are typically one such exception.
So if this is your exact situation, I would suggest going with the separate table of accounts solution - but if you have far fewer records than a bank would typically, then I recommend the derived approach, instead.
To some extent, it depends.
With "small" data volumes, performance will more than likely be OK.
But as data volumes grow, having to SUM all transactions may become costlier to the point at which you start noticing a performance problem.
Also to consider is data access/usage patterns. In a ready-heavy system, where you "write once, ready many", then the SUM approach hits performance on every read - in this scenario, it may make sense to take a performance hit once on write, to improve subsequent read performance.
If you anticipate "large" data volumes, I'd definitely go with the extra table to hold the high level totals. You need to ensure though that it is updated when a (monetary) transaction is made, within a (sql server) transaction to make it an atomic operation.
With smaller data volumes, you could get away without it...personally, I'd probably still go down that path, to simplify the read scenario.
It makes sense to go with the denormalized approach (the first solution) only if you face significant performance issues. Since you are doing just simple SUM (or group by and then sum) with proper indexes, your normalized solution will work really well and will be a lot easier to maintain (as you noted).
But depending on your queries, it can make sense to go with denormalized solution...for example, if your database is read/only (you periodically load data from some other data source and don't make inserts/updates at all or make them really rarely), then you can just load data in the easiest way to make queries...and in that case, denormalized solution might prove to be better.