I have an application where I want to take a snapshot of all entities, creating cloned tables that represent a particular point in time. I then want to be able to compare the differences between these snapshots to see how the data evolves over time.
How would you accomplish this in NHibernate? It seems like NH isn't design for this type of data manipulation and I'm unsure if I'm abusing my database, NH, or both.
(P.S. Due to database engine restrictions I am unable to use views or stored procs.)
Do you really need to save the entirety of each entity in this snapshot? If so, maybe a collection of tables with names like type_snapshot would help. You could save your entities to this table (only inserting, never updating). You could store the original item's identifier, and generate a new identifier for the snapshot itself. And you could save the timestamp with each snapshot. Your item_snapshot table would look something like:
id | snapshot_date | item_id | item_prop1 | item_prop2 ...
123 | 7/16/10 | 15 | "item desc" | "item name" ...
Within your domain, maybe you could work with Snapshot instances (snapshot containing the id and the snapshot date, along with an instance of T)
It may not be ideal, as it'll introduce a second set of mappings, but it is a way to get where you're going. It seems like you might be better off doing something closer to the database engine, but without knowing what you have in mind for these snapshots (from application perspective) its its hard to say.
I wound up augmenting my entities with a snapshot id column and copying the entries in place in the table. Combined with a filter, I can select from any given snapshot. Had to make some patches to legacy code, but it basically works.
We wound up, creating duplicate tables with an extra column of type timestamp for snapshots. Made indexes on main table smaller, as we had 10Million + rows, so adding versions in same table would create many more records. Also version tables in different tablespace ( db file on mssql)
Related
I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.
I have a table with lots of fields which holds "totals" like so:
UserID | total_classA | total_classB | total_classC // and so on
I could have a second table however with:
ClassType | Total | UserID
But I don't really see how a second table would be beneficial here for a many to one relationship, firstly i would have to store more rows of data, AND i have to use a join for selecting data.
But alot of things i read would suggest having two tables is best over one table with lots of fields... why is this as i do not see the advantage to that in the above situation =/
Store your data cleanly, as you propose with your 'second table'.
You can always get the summarized column total display with a PIVOT (depending on your platform) or a specialized query if and when you need it.
The biggest benefit of doing so will be the elimination of having to change your table structure with every additional class type you decide to introduce. You will be able to extend your data tracking capabilities simply by adding rows (DML rather than DDL).
Take a look at second normal form for more of a technical explanation for going this route.
I apologize if this may seem like somewhat of a novice question (which it probably is), but I'm just introducing myself to the idea of relational databases and I'm struggling with this concept.
I have a database with roughly 75 fields which represent different characteristics of a 'user'. One of those fields represents a the locations that user has been and I'm wondering what the best way is to store the data so that it is easily retrievable and can be used later on (i.e. tracking a route on Google Maps, identifying if two users shared the same location etc.)
The problem is that some users may have 5 locations in total while others may be well over 100.
Is it best to store these locations in a text file named using the unique id of each user(one location on each line, or in a csv)?
Or to create a separate table for each individual user connected to their unique id (that seems like overkill to me)?
Or, is there a way to store all of the locations directly in the single field in the original table?
I'm hoping that I'm missing a concept, or there is a link to a tutorial that will help my understanding.
If it helps, you can assume that the locations will be stored in order and will not be changed once stored. Also, these locations are static (I don't need to add any more locations once as they can't be updated).
Thank you for time in helping me. I appreciate it!
Store the location data for the user in a separate table. The location table would link back to the user table by a common user_id.
Keeping multiple locations for a particular user in a single table is not a good idea - you'll end up with denormalized data.
You may want to read up on:
Referential Integrity
Relational denormalization
The most common way would be to have a separate table, something like
USER_LOCATION
+------------+------------------+
| USER_ID | LOCATION_ID |
+------------+------------------+
| | |
If user 3 has 5 locations, there will be five rows containing user_id 3.
However, if you say the order of locations matter then an additional field specifying the ordinal position of the location within a user can be used.
The separate table approach is what we call normalized.
If you store a location list as a comma-separated string of location ids, for example, it is trival to maintain the order, but you lose the ability for the database to quickly answer the question "which users have been at location x?". Your data would be what we call denormalized.
You do have options, of course, but relational databases are pretty good with joining tables, and they are not overkill. They do look a little funny when you have ordering requirements, like the one you mention. But people use them all the time.
In a relational database you would use a mapping table. So you would have user, location and userlocation tables (user is a reserved word so you may wish to use a different name). This allows you to have a many-to-many relationship, i.e. many users can visit many locations. If you want to model a route as an ordered collection of locations then you will need to do more work. This site gives an example
We have a web application that is built on top of a SQL database. Several different types of objects can have comments added to them, and some of these objects need field-level tracking, similar to how field changes are tracked on most issue-tracking systems (such as status, assignment, priority). We'd like to show who the change is by, what the previous value was, and what the new value is.
At a pure design level, it would be most straightforward to track each change from any object in a generic table, with columns for the object type, object primary key, primary key of the user that made the change, the field name, and the old and new values. In our case, these would also optionally have a comment ID if the user entered a comment when making the changes.
However, with how quickly this data can grow, is this the best architecture? What are some methods commonly employed to add this type of functionality to an already large-scale application?
[Edit] I'm starting a bounty on this question mainly because I'd like to find out in particular what is the best architecture in terms of handling scale very well. Tom H.'s answer is informative, but the recommended solution seems to be fairly size-inefficient (a new row for every new state of an object, even if many columns did not change) and not possible given the requirement that we must be able to track changes to user-created fields as well. In particular, I'm likely to accept an answer that can explain how a common issue-tracking system (JIRA or similar) has implemented this.
There are several options available to you for this. You could have audit tables which basically mirror the base tables but also include a change date/time, change type and user. These can be updated through a trigger. This solution is typically better for behind the scenes auditing (IMO) though, rather than to solve an application-specific requirement.
The second option is as you've described. You can have a generic table that holds each individual change with a type code to show which attribute was changed. I personally don't like this solution as it prevents the use of check constraints on the columns and can also prevent foreign key constraints.
The third option (which would be my initial choice with the information given) would be to have a separate historical change table which is updated through the application and includes the PK for each table as well as the column(s) which you would be tracking. It's slightly different from the first option in that the application would be responsible for updating the table as needed. I prefer this over the first option in your case because you really have a business requirement that you're trying to solve, not a back-end technical requirement like auditing. By putting the logic in the application you have a bit more flexibility. Maybe some changes you don't want to track because they're maintenance updates, etc.
With the third option you can either have the "current" data in the base table or you can have each column that is kept historically in the historical table only. You would then need to look at the latest row to get the current state for the object. I prefer that because it avoids the problem of duplicate data in your database or having to look at multiple tables for the same data.
So, you might have:
Problem_Ticket (ticket_id, ticket_name)
Problem_Ticket_History (ticket_id, change_datetime, description, comment, username)
Alternatively, you could use:
Problem_Ticket (ticket_id, ticket_name)
Problem_Ticket_Comments (ticket_id, change_datetime, comment, username)
Problem_Ticket_Statuses (ticket_id, change_datetime, status_id, username)
I'm not sure about the "issue tracking" specific approach, but I wouldn't say there is one ultimate way to do this. There are a number of options to accomplish it, each have their benefits and negatives as illustrated here.
I personally would just create one table that has some meta data columns about the change and a column that stores xml of the serialized version of the old object or whatever you care about. That way if you wanted to show the history of the object you just get all the old versions and the re-hydrate them and your done. One table to rule them all.
One often overlooked solution would be to use Change Data Capture. This might give you more space savings/performance if you really are concerned.
Good luck.
Here is the solution I would recommend to attain your objective.
Design your auditing model as shown below.
---------------- 1 * ------------
| AuditEventType |----------| AuditEvent |
---------------- ------------
| 1 | 1
| |
----------------- -------------
| 0,1 | +
------------------- ----------------
| AuditEventComment | | AuditDataTable |
------------------- ----------------
| 1
|
|
| +
----------------- + 1 --------------
| AuditDataColumn |------------------| AuditDataRow |
----------------- --------------
.
AuditEventType
Contains list of all possible events type in system and generic description for same.
.
AuditEvent
Contains information about particular even that triggerd this action.
.
AuditEventComment
Contains optional custom user comment about the audit event. Comments can be really huge so better store them in CLOB.
.
AuditDataTable
Contains list of one or more tables that were impacted by respective AuditEvent
.
AuditDataRow
Contains list of one or more identifying rows in respective AuditDataTable that was were impacted by respective AuditEvent
.
AuditDataColumn
Contains list of zero or more columns of respective AuditDataRow whose values were changed with it's previous and current values.
.
AuditBuilder
Implement AuditBuilder (Builder pattern). Instantiate it at begining of event and make it available in request context or pass it along with other DTO's. Each time anywhere in your code you make changes to your data, invoke appropriate call on AuditBuilder to notify it about the change. At the end, invoke build() on AuditBuilder to form above structure and then persist it to database.
Make sure all your activity for the event is in a single DB transaction along with persistence of audit data.
It depends on your exact requirements, and this might not be for you, but for general auditing in the database with triggers (so front-end and even the SP interface layer don't matter), we use AutoAudit, and it works very well.
I don't understand the actual usage scenarios for the audited data, though... do you need to just keep track of the changes? Will you need to "rollback" some of the changes? How frequent (and flexible) you want the audit log report/lookup to be?
Personally I'd investigate something like that:
Create AuditTable. This has an ID, a version number, a user id and a clob field.
When Object #768795 is created, add a row in AuditTable, with values:
Id=#768795
Version:0
User: (Id of the user who created the new object)
clob: an xml representation of the whole object. (if space is a problem, and access to this table is not frequent, you could use a blob and "zip" the xml representation on the fly).
Every time you change something, create a new version, and store the whole object "serialized" as an XML
In case you need to create an audit log you have all you need, and can use simple "text compare" tools or libraries to see what changed in time (a bit like Wikipedia does).
If you want to track only a subset of fields either because the rest is immutable, non significant or you are desperate for speed/space, just serialize the subset you care about.
I know this question is very old but Another possiblity which is built into sql is
TRACK CHANGES:
you can find more information on this link:
Introduction to Change Data Capture (CDC) in SQL Server 2008
http://www.simple-talk.com/sql/learn-sql-server/introduction-to-change-data-capture-(cdc)-in-sql-server-2008/
I think Observer is an ideal pattern in this scenario.
Say I'm mapping a simple object to a table that contains duplicate records and I want to allow duplicates in my code. I don't need to update/insert/delete on this table, only display the records.
Is there a way that I can put a fake (generated) ID column in my mapping file to trick NHibernate into thinking the rows are unique? Creating a composite key won't work because there could be duplicates across all of the columns.
If this isn't possible, what is the best way to get around this issue?
Thanks!
Edit: Query seemed to be the way to go
The NHibernate mapping makes the assumption that you're going to want to save changes, hence the requirement for an ID of some kind.
If you're allowed to modify the table, you could add an identity column (SQL Server naming - your database may differ) to autogenerate unique Ids - existing code should be unaffected.
If you're allowed to add to the database, but not to the table, you could try defining a view that includes a RowNumber synthetic (calculated) column, and using that as the data source to load from. Depending on your database vendor (and the products handling of views and indexes) this may face some performance issues.
The other alternative, which I've not tried, would be to map your class to a SQL query instead of a table. IIRC, NHibernate supports having named SQL queries in the mapping file, and you can use those as the "data source" instead of a table or view.
If you're data is read only one simple way we found was to wrapper the query in a view and build the entity off the view, and add a newguid() column, result is something like
SELECT NEWGUID() as ID, * FROM TABLE
ID then becomes your uniquer primary key. As stated above this is only useful for read-only views. As the ID has no relevance after the query.