Using Git for Data History

Using Git for Data History - sql

I am currently tracking history of changes across multiple tables in an Oracle database. We are storing the entire row that changes in a separate table with the same columns as the data table, except that we also have columns for
date/time
user
version number (auto-incremented and shared by all tables)
I currently have no way of displaying this data for regular users who have no access to the database, and we have been reconsidering the whole history system for the following reasons:
Completely duplicate data structure
Need to store history info in every row that changes
Difficult to track discrete change-sets across tables
Difficult to view history at particular time
No efficient conflict detection
Therefore, I have been considering using Git to track the history. I would envision the following components
File system representation of tables (see example below)
One official repository that writes to the database on fast-forward update
This repository will be the only user that can modify the data tables.
Abort any update that causes database errors so that we can still have database-level data enforcement
File structure will be exactly mapped to the tables so that databse modification can be automatic
Local repository pools for each application
Attempt push
If fails for non-fast-forward; rebase
If rebase fails due to conflicts; abort
Write entire repository nightly to enforce Git history as single source of truth
Has anyone attempted anything similar? If so, are there any details that I should know about.
Additionally, are there any libraries or tools available that are geared to this sort of thing?
Are there better ways of doing something like this with pure Oracle SQL that can solve all my issues?
Example file structure:
(D) indicates Directory; (F) indicates File; Quoted string is file contents
Schema_USER (D)
Table_ITEM (D)
Row_ITEM_ID-24435- (D)
Column_ITEM_ID (F) : "24435"
Column_DESCRIPTION (F) : "Big Ball"
Row_ITEM_ID-24436- (D)
Column_ITEM_ID (F) : "24436"
Column_DESCRIPTION : "Small Ball"
Table_CATEGORY (D)
Row_CATEGORY_ID-35- (D)
Column_CATEGORY_ID (F) : "35"
Column_NAME (F) : "Balls"
Table_ITEM_CATEGORY (D)
Row_CATEGORY_ID-35-ITEM_ID-24436 (D)
Column_CATEGORY_ID (F) : "35"
Column_ITEM_ID (F) : "24436"

Related

Verifying copy of a tree structure in unit test - via one or more methods?

Part of our application is copying data from an SQL database to another software, where the data is stored in a tree structure. The data in the SQL database is stored in 4 tables, each record in a table contains foreign keys to the record in the "parent" table (you can imagine something similar to "City" --> "Country" --> "Continent" tables). I have written a unit test, which proves, that the data from the SQL database is correctly transferred in the other software. But i have done it only in one method, because during verifying the transfer at a certain level, i also become the ids of the records in the database and the ids of the nodes in the other software at the higher level to be verified. Should it stay like this, or i should have one unit test method for each level, that i verify?

Is there a way to generate changelogs per table on an existing database?

I have an existing database on which I want to apply liquibase and generate a separate changelog for every table in order to create the current database scheme.
As far as I know, there is only the possibility to generate a single big changelog.xml for the entire database, which I did with commandline:
liquibase --driver=oracle.jdbc.OracleDriver \
--classpath=\path\to\classes:jdbcdriver.jar \
--changeLogFile=com/example/db.changelog.xml \
--url="jdbc:oracle:thin:#localhost:1521:XE" \
--username=scott \
--password=tiger \
generateChangeLog
However, I would like to generate a separate changelog.xml for every table. Let's say, if the database has three tables: butterfly, flower and bee, then there should be generated: changelog_butterfly.xml, changelog_flower.xml and changelog_bee.xml or something similar.
Any ideas much appreciated.

Judging by documentation - there is no strightforward way to do this.
Also, from logical point of view: if the goal is to have changelog per table, then where to place changesets for foreign keys? Since they belong to both tables involved - there is contradiction here :)
Probably, one way to resolve contradiction would be to place foreign keys to separate changelog. However, it looks not very convenient, especially if take into account migrating DB after initial setup.
Either way, here are some recommendations about changelog organization.

Oracle Audit Trail to get the list of columns which got updated in last transaction

Consider a table(Student) under a schema say Candidates(NOT DBA):
Student{RollNumber : VARCHAR2(10),Name : VARCHAR2(100),CLass : VARCHAR2(5),.........}
Let us assume that the table already contains some valid data.
I executed an update query to modify the name and class of the Student table
UPDATE STUDENT SET Name = 'ASHWIN' , CLASS = 'XYZ'
WHERE ROLLNUMBER = 'AQ1212'
Followed by another update query in which I am updating some other fields
UPDATE STUDENT SET Math_marks = 100 ,PHY_marks , CLASS = 'XYZ'
WHERE ROLLNUMBER = 'AQ1212'
Since I modified different columns in two different queries. I need to fetch the particular list of columns which got updated in last transaction. I am pretty sure that oracle must be maintaining this in some table logs which could be accessed by DBA. But I don't have the DBA access.
All I need is a the list of columns that got updated in last transaction under schema Candidates I DO NOT have the DBA rights
Please suggest me some ways.
NOTE : Here above I mentioned a simple table. But In actual I have got 8-10 tables for which I need to do this auditing where a key factor lets say ROLLNUMBER acts a foreign key for all other tables. Writing triggers would be a complex for all tables. So please help me out if there exists some other way to fetch the same.

"I am pretty sure that oracle must be maintaining this in some table logs which could be accessed by DBA."
Actually, no, not be default. An audit trail is a pretty expensive thing to maintain, so Oracle does nothing out of the box. It leaves us to decide what we what to audit (actions, objects, granularity) and then to switch on auditing for those things.
The Oracle requires DBA access to enable the built-in functionality, so that may rule it out for you anyway.
Auditing is a very broad topic, with lots of things to consider and configure. The Oracle documentation devotes a big chunk of the Security manual to Auditing. Find the Introduction To Auditing here. For monitoring updates to specific columns, what you're talking about is Fine-Grained Audit. Find out more.
"I have got 8-10 tables ... Writing triggers would be a complex for all tables."
Not necessarily. The triggers will all resemble each other, so you could build a code generator using the data dictionary view USER_TAB_COLUMNS to customise some generic boilerplate text.

Accepted methodology when using multiple Sqlite databases

Question
What is the accepted way of using multiple databases that record information about the same object that will ultimately end up living in one central database?
Example
There is one main SQL database about trees.
This database holds information about unique trees from all over the UK.
To collect the information a blank Sqlite database is created (with the same schema) and taken to the tree on a phone.
The collected information is then stored in the Sqlite database until it is brought back to the main database, Where it is then transferred into the main database.
Now this works fine as long as there is only one Sqlite database out for any one tree at a time.
However, if two people wanted to collect different information for the same tree at the same time, when they both came back and attempted to transfer their data in to the main database, there would be collisions on their primary key constraints.
ID Schemes (with example data)
There is a tree table which has unique identifier called treeID
TreeID - TreeName - Location
1001 - Teddington Field - Plymouth
Branch table
BranchID - BranchName - TreeID
1001-10001 - 1st Branch - 1001
1001-10002 - 2nd Branch -1001
Leave table
LeafID - LeafName - BranchId
1001-10001-1 - Bedroom - 1001-10001
1001-10002-2 - Bathroom - 1001-10001
Possible ideas
Assign each database 1000 unique ID's and then one they come back in as the ids have already been assigned the ids on each database won't collide.
Downfall
This isn't very dynamic and could fail if one database overruns on its preassigned ids.
Is there another way to achieve the same flexibility but with out the downfall mentioned above?

So, as an answer:
on the master db, store an extra id field identifying the source/collection database that the dataset was collected on, as well as the tree id.
(src01, 1001), (src02, 1001)
This also allows you to link back easily to the collection source of the information which is likely gonna be a future requirement. Now, you may or may not want to autogenerate another sequence id key value on the master db's table (I wouldn't but that's because I am not that fond of surrogate keys), but I would definitely keep track of the source/treeid it was originally collected with in the field, separately of any master db unique key considerations.

Apparently you are talking about auto-generated IDs for related objects, not the IDs for the trees themselves. Two different people collecting information about the same tree, starting from the same starting set, end up generating the same IDs independently. The two sets of generated IDs cannot coexist in the same DB.
Since you want to keep all the new data. One possible solution is to avoid using the field-generated IDs in the central database at all. When each set of data comes in, take the data that were added in the field, and programmatically add them to the central DB in a way equivalent to how they are added in the field, letting the central DB autogenerate its own IDs.
This requires a mechanism to distinguish newly-collected data from old, but that might be as simple as a timestamp.

merging data from 2 databases

Currently have a contracts system that pulls in job data from our finance system. Each job has an id and the contracts hang off of that. We now have to bring in job data from another finance system. The jobs from the new system will also contain a job id and contracts will have to hang from this. I expect there will be some id conflicts when the data is merged. Whats the best way to deal with this. Should I create another table that pulls in the job data from both and assigns a new id for the contracts to hang from. Obviously I will need to update the current contracts to match the new id's generated. Does this sound like a good idea or is there a better way.

Given your additional comments, I would suggest that you use a mapping table to map any of the conflicting IDs in the old system to new IDs. Normally when importing data into an existing system you would want to keep the IDs of the current system intact, but since that system is going to be gone in a year (or however long it takes) and is about to be read only I would think that you would want to try to preserve IDs in the new system.
Once you create the mapping table, you would then use that to update any foreign key references, etc. and then import the new data, which should now have no conflicts.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using Git for Data History - sql

Related

Verifying copy of a tree structure in unit test - via one or more methods?

Is there a way to generate changelogs per table on an existing database?

Oracle Audit Trail to get the list of columns which got updated in last transaction

Accepted methodology when using multiple Sqlite databases

merging data from 2 databases

Categories

Resources