How should I store update history (auditing)

How should I store update history (auditing) - sql

I am wondering how to properly store object update information (ie auditing). My constraints are:
I have 1 type of objects with 10-25 properties that will be updated: These properties are well defined.
Each property is expected to be updated hundred to hundred thousand times over the life of the object (a hard limit can be set if needed).
The number of objects cannot be limited.
Most accesses will only read the last 100 updates.
My purist instinct want's me to put all updates in a table (per property). I can then query + merge the results to get my history. Trying to keep it in normal form as much as possible.
On the other hand I'd like to keep updates close to their parent object (I will never mix updates from 2 objects). This would make it easier to shard (keeping updates and objects on the same machine).

Related

When to use internal tables?

So, I have read that using internal tables increases the performance of the program and that we should make operations on DB tables as less as possible. But I have started working on a project that does not use internal tables at all.
Some details:
It is a scanner that adds or removes products in/from a store. First the primary key is checked (to see if that type of product exists) and then the product is added or removed. We use ‘Insert Into’ and ‘Delete From’ to add/remove the products directly from the DB table.
I have not asked why they do not use internal tables because I do not have a better solution so far.
Here’s what I have so far: Insert all products in an internal table, place the deleted products in another internal table.
Form update.
Modify zop_db_table from table gt_table." – to add all new products
LOOP AT gt_deleted INTO gs_deleted.
DELETE FROM zop_db_table WHERE index_nr = gs_deleted-index_nr.
ENDLOOP. " – to delete products
Endform.
But when can I perform this update?
I could set a ‘Save button’ to perform the update, but then there will be the risk that the user forgets to save large amounts of data, or drops the scanner, shutting it down or similar situations. So this is clearly not a good solution.
My final question is: Is there a (good) way to implement internal tables in a project like this?

internal tables should be used for data processing, like lists or arrays in other languages (c#, java...). From a performance and system load perspective it is preferred to first load all data you need into an internal table, then process that internal table instead of loading individual records from the database.
But that is mostly true for reporting, which is probably the most common type of custom abap program. You often see developers use select...endselect-statements, that in effect loop over a database table, transferring row after row to the report, one at a time. That is extremely slow compared to reading all records at once into an itab, then looping over the itab. More than once i've cut the execution time of a report down to a fraction by just eliminating roundtrips to the database.
If you have a good reason to read from the database or update records immediately, you should do so. If you can safely delay updates and deletes to a point in time where you can process all of them together, without risking inconsistencies, I'd consider than an improvement. But if there is a good reason (like consistency or data loss) to update immediately, do it.
Update: as #vwegert mentioned regarding the select-endselect statement, the statement doesn't actually create individual database queries for each row. The database interface of the application server optimizes the query, transferring rows in bulk to the application server. From there the records are transported to the abap report one by one (because in the report there is only the work area to store a single row), which has a significant performance impact especially for queries with large result sets. A select into an internal table can transport all rows directly to the abap report (as long as there is enough memory to hold them), as now there is the internal table to hold those records in the report.

How do I count occurrences of a property value in a collection?

I have some data that I arrange into a collection of custom class objects.
Each object has a couple of properties aside from its unique name, which I will refer to as batch and exists
There are many objects in my collection, but only a few possible values of batch (although the number of possibilities is not pre-defined).
What is the easiest way to count occurrences of each possible value of batch?
Ultimately I want to create a userform something like this (values are arbitrary, for illustration):
Batch A 25 parts (2 missing)
Batch B 17 parts
Batch C 16 parts (1 missing)
One of my ideas was to make a custom "batch" class, which would have properties .count and .existcount and create a collection of those objects.
I want to know if there is a simpler, more straightforward way to count these values. Should I scrap the idea of a secondary collection and just create some loops and counter variables when I generate my userform?

You described well the two possibilities that you have:
Loop over your collection every time you need the count
Precompute the statistics, and access it when needed
This is a common choice one often has to do. I think here it is between performance vs. complexity.
Option 1 with a naive loop implementation will take you an O(n) time, where n is the size of your collection. And, unless your collection is static, you will have to compute it everytime you need your statistics. On the bright side, the naive looping is fairly trivial to write. Performance on frequent queries and/or large collections could suffer.
Option 2 is fast for retrieval, O(1) basically. But everytime your collection changes, you need to recompute your statistics. However this is incremental recomputing, i.e. you do not have to go through the whole collection but just over the changed items. But that means you need to deal with all the possibilities of updates (new item, deleted item, updated items). So that's a bit more complex than the naive loop. Now if your collections are entirely new all the time, and you query them only once, you have little to gain here.
So up to you to decide where to tradeoff according to the parameters of your problems.

Merge identical databases into one

We have 15 databases of 75 tables with an avarage of a million rows. all with the same schema but different data. We have now been given the requirements by the client to bring all 15 into one database. Each set of data filtered by the user’s login.
The changes to the application have been completed to do the filtering. We are now left with the task of merging all databases into one.
The issue is conflicting PK and FK as the PK’s and the FK’s are of type int so we will have 15 PK ids of 1.
One idea is to use. net and the DBML to insert the records as new records into the new database letting linq deal with the PK and FK and using code to deal with duplicate data.
What other ways are there to do this?

It's never a trivial job to integrate databases when the records don't have unique primary keys in all databases. A few weeks ago I built a similar integration script for which I decided to use Entity Framework.
First the good news. With EF's DbContext API it's ridiculously easy to insert a complete object graph and make EF take care of all newly generated primary keys an foreign keys. The reason why this is so easy is that when an object's state is changed to Added all of its adhering objects become Added as well and EF figures out the right order of inserts. This is truly great! It made me build the core of the copy routine in a few hours, which would have been many days if I should have done it in T-SQL for example. The latter is much much more error prone too.
Of course life isn't that easy. Now the bad news:
This takes tons of machine resources. Of course I used a new context instance for each copy step, but still I had to execute the program on a machine with a decent processor and a fair amount of internal memory. The exact specifications don't matter, the message is: test with the largest databases and see what kind of beast you need. If the memory consumption can't be managed by any machine at your disposal, you have to split up the routine in smaller chunks, but that will take more programming.
The object graph that's changed to Added must be divergent. By this I mean that there should only be 1-n associations starting from the root. The reason is, EF will really mark all objects as Added. So if somewhere in the graph a few branches refer back to the same object (because there is a n-1 association), these "new" objects will be multiplied, because EF doesn't know their identity. An example of this could be Company -< Customer -< Order >- OrderType: when there are only 2 order types, inserting one root company with 10 customers with 10 orders each will create 100 order type records in stead of 2.
So the hard part is to find paths your class structure that are divergent as much as possible. This won't always be possible. If so, you'll have to add the leaves of the converging paths first. In the example: first insert order types. When a new company is inserted you first load the existing order types into the context and then add the company. Now link the new orders to the existing order types. This can only be done if you can match objects by natural keys (in this example: the order type names), but usually this is possible.
You must take care not to insert multiple copies of master data. Suppose the order types in the previous example are the same in all databases (although their primary keys may differ!). The order types from the source database should not be reinserted in the target database. Moreover, you must fix the references in the source data to the correct records in the target database (again by matching by natural key).
So although it wasn't trivial it was doable and the job was done in a relatively short time. I'm sure that other alternatives (t-SQL, integration services, BIDS, if doable at all) would have taken more time or would have been more buggy. And the problem with bugs in this area is that they may become apparent much later.
I later found out that the issues I describe under 2) are related to fetching the source objects with AsNoTracking. See this interesting post: Entity Framework 6 - use my getHashCode(). I used AsNoTracking because it performs better and it reduces memory consumption.

Super column vs serialization vs 2 lookups in Cassandra

We have:
users, each of which has events, each of which has several properties (time, type etc.). Our basic use case is to fetch all events of a given user in a given time-span.
We've been considering the following alternatives in Cassandra for the Events column-family. All alternatives share: key=user_id (UUID), column_name = event_time
column_value = serialized object of event properties. Will need to read/write all the properties every time (not a problem), but might also be difficult to debug (can't use Cassandra command-line client easily)
column is actually a super column, sub-columns are separate properties. Means reading all events(?) every time (possible, though sub-optimal). Any other cons?
column_value is a row-key to another CF, where the event properties are stored. Means maintaining two tables -> complicates calls + reads/writes are slower(?).
Anything we're missing? Any standard best-practice here?

Alternative 1 : Why go to Cassandra if you are to store serialized object ? MongoDB or a similar product would perform better on this task if I get it wright (never actually tried a document base NoSQL, so correct me if I'm wrong on this one). Anyway, I tried this alternative once in MySQL 6 years ago and it is still painful to maintain today.
Alternative 2 : Sorry, I didn't had to play with super colunm yet. Would use this only if I had to show frequently many information on many users (i.e. much more than just their username and a few qualifiers) and their respective events in one query. Also could make query based on a given time-span a bit tricky if there are conditions on the user itself too, since a user's row is likely to have event's columns that fit in the span an other columns that doesn't.
Alternative 3 : Would defenitly be my choice in most cases. You are not likely to write events and create a user in the same transaction, so no worry for consistency. Use the username itself as a standard event column (don't forget to index it) so your calls will be pretty fast. More on this type of data model at http://www.datastax.com/docs/0.8/ddl/index.
Yes it's a two call read, but it do is two different families of data anyway.
As for a best-practices, the field is kinda new, not sure there are any widely approved yet.

How should I keep accurate records summarising multiple tables?

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).

I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).

Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).

You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html

In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.

You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas