Reverse Engineering of a DB without foreign keys

Reverse Engineering of a DB without foreign keys - sql

I'm looking for a solution for reverse engineering a DB without foreign keys (really! a 20 years old DB...). The intention is to do this completely without additional application or persistence logic, just by analyzing the data.
I know this would be somewhat difficult, but should be possible if the data itself esp. the PKs are analyzed as well.

I don't think there is a universal solution to your problem. Hopefully there is some sort of a naming convention for the tables/columns that can lead you. You can query the system tables to try and figure what's going on (Oracle: user_tab_columns, SQL Server: INFORMATION_SCHEMA.COLUMNS, etc.). Good luck!

I also don't think you'll find a universal solution to this... but I'd like to suggest you an approach, especially if you don't have any source code that could be read to guide you on mapping:
First scan all tables on your database. With scan, I mean store table names and columns.
You can assume columns types trying to convert data to a specific format (start trying to convert to dates, numbers, booleans and so on)... You can also try to discover data types by analysing its contents (if it has only numbers without floating points, if it has numbers with slashes, if it has long or short texts, if it is only one digit..etc.).
Once you have mapped all tables, start by comparing contents of all columns that has a numeric type. (Why? If the database was designed by a human... then he/she/they probably will use numbers as primary/foreign keys).
Every single time you find more than X successful correspondences between the contents of 2 columns from 2 different tables, log this connection. (This X factor depends on the amount of records you have...)
This analysis must run for each table comparing all other tables... column by column - so... this will take some time...
Of course, this is an overview of what need to be done, but it is not a complex code to be written...
Good luck and let me know if you find any sort of tool to do this! :-)

No offense but you can't have been in databases very long if this surprises you.
I am going to assume that by "reverse engineering" you are just looking to fill in the foreign keys, not moving to NoSQL or something. It could be an interesting project. Here is how I would go about it:
Look at all the SELECT statements and see how joins are made to a table. 20 years ago this would be in a WHERE clause but it gets more complicated than that, of course. With correlated subqueries and UPDATE statements with FROM clauses and whatever implies a join of some sort. You have to be able to figure all that out. If you want to do it formally (you can probably suss out all this stuff intuitively) you could list the number of times combinations are used in joins between tables. List them by pairs of tables not the the set of all the tables in the join. Those would be the candidate foreign keys if one side is a primary key. The other side gets the foreign key. There are multi-column PKs but you can figure that out (so if the other side of the primary key is in two tables that's not a foreign key). If one column ends up pointing to two different table PKs that's not a proper foreign key either but it might be appropriate to pick a table and use it as the target.
If you don't already have a primary keys you should do that first. Indexes, perhaps even clustered indexes (in Sybase/MSSQL), aren't always the correct primary keys. In any case you may have to change the primary keys accordingly.
To collect all the statements might be challenging in itself. You could use perl/awk to parse them out of their C/Java/PHP/Basic/COBOL programs, or you could reap them from monitoring input to the server. You would want to look for WHERE/JOIN/APPLY etc. rather than SELECT. There are lots of other ways.

Related

PostgreSQL - What should I put inside a JSON column

The data I want to store data that has this characteristics:
There are a finite number of fields (I don't expect to add new fields);
There are some columns that are common to all sets of data (a category field, for instance);
There are some columns that are specific to individual sets of data (each category needs it's own fields);
Here's how it would look like in a regular table:
I'm having trouble figuring out which would be the better way to store this data in a database for this situation.
Bellow are the ideas I already had:
Do exactly as the tabular table (I would have many NULL values);
Divide the categories into tables (I would use joins when needed);
Use JSON type for storing the values (no NULL values and having it all in same table).
So my questions are:
Is one of these solutions (or one that I have not thought about it) that is better for this case?
Are there other factors, other than the ones presented here, that I should consider to make this decision?

Unless you have very many columns (~ 100), it is usually better to use normal columns. NULL values don't take any storage space in PostgreSQL.
On the other hand, if you have queries that can use any of these columns in the WHERE condition, and you compare with =, a single GIN index on a jsonb might be better than having many B-tree indexes, because the index maintenance costs would be higher.
The definitive answer depends on the SQL statements that you plan to run on that table.

You have laid out the three options pretty well. Things to consider are:
Performance
Data size
Each of maintenance
Flexibility
Security
Note that you don't even allude to security considerations. But security at the table level is usually a tad simpler than at the column level and might be important for regulated data such as PII (personally identifiable information).
The primary strength of the JSON solution is flexibility. It is easy to add new columns. But you don't need that. JSON has a cost in data size and data type flexibility (notably JSON doesn't support date/times explicitly).
A multiple table solution requires duplicating the primary key but may result in much less storage overall if the columns really are sparse. The "may" may also depend on the data type. A NULL string for instance occupies less space than a NULL float in a table record.
The joins on multiple tables will be 1-1 on primary keys. These should be pretty fast.
What would I do? Unless the answer is obvious, I would dump the data into a single table with a bunch of columns. If that table starts to get unwieldy, then I would think about splitting it into separate tables -- but still have one table for the common columns. The details of one or multiple tables can be hidden behind a view.

Depends on how much data you want to store, but as long as it is finite it shouldn't make a big difference if it contains a lot of null's or not

Structuring Relational databases - combining similar and related tables

I am used to seeing relational databases where distinct entities are stored in different tables. (simple example: Country, State, City). Recently I been seeing more cases where distinct but similar entities are bundled into same table combined with different Views. I supposed this can economize on tables and data access programs (maybe at the expense of clarity and flexibility). Re-reading definition of normalized databases, I don't think this breaks any rules, but it seems less intuitive and through back to old mainframe "Miscellaneous" tables where you put anything that was forgotten in design stage. See 2 examples below: Multi-table solution vs Single table solution. Is this phenomenon part of a data or programming design pattern and have a name?

If you have small dedicated tables, then the database can easily cache the ones it needs in memory.
If you take what would otherwise be small tables and cram them together into one, the database doesn't know which entries are important to cache and which aren't.
More importantly, there is more opportunity for errors because you can inadvertently type in the wrong type code and end up joining to something irrelevant, with no RI or typechecking to warn you. If you use small dedicated tables then you can specify RI constraints.
Thinking back to a place where I saw the single monster-lookup-table pattern done, I think the attraction was that developers can add more kinds of entries without needing DBA intervention to create more tables. There were a lot of developers and only a few DBAs and this was how the DBAs avoided getting sucked into having to create dedicated lookup tables every time a new type of lookup entry was introduced. (Apparently granting create table rights in dev was not acceptable for the DBAs there.)
This seems like a workaround for environments where database schema changes are hard to come by. But another consideration is it may be easier to internationalize if all your entries are in one table.
And the pattern has an established name, it's called the One True Lookup Table. The linked article calls it out as an antipattern, and lists more downfalls of this technique. Here is the bulleted list from the article:
It makes the SQL look ugly.
Many statements will require multiple joins to the lookup table. The extra join columns make the statements look bigger and scarier. There will be the same number of joins when using separate lookup tables, but those joins will be simpler.
Multiple references to the same table can make it hard to determine what is happening in the execution plan, as you will see those repeated references there, and have to refer to the predicates to understand the context of table reference. If you were using separate lookup tables, it would be clear which table you were referring to at any point of the execution plan.
You can't foreign key to this type of table. Technically you can if you are willing to put both columns (lookup_type_code and lookup_key) in the table, but you won't because it is ugly. This means there is a good chance your data integrity will be compromised over time. It's really easy to foreign key to individual lookup tables, and therefore protect your data.
It's hard to control the contents of the table. It's a shared resource, so check constraints and triggers are problematic. If you need users to have different privileges, depending on which lookup they are dealing with, things are going to get messy. That would be really easy with separate lookup tables.
If you need to make a change for one reference type, like extending the size of the key or value, it affects all reference data. Using separate lookup tables isolates the change.
Over time, many reference tables take on additional data. To model that you would need to either split out that reference data from this shared lookup table, or start adding optional columns to cope with the "one-off" issues. A change like this is really simple for separate lookup tables.
Data types matter. You should always use the correct data type, as it will reduce the number of data type conversions needed. Implicit data type conversions are bugs waiting to happen!
Performance can be a problem with the OTLT approach as it's hard for the optimizer to make sound judgements about the data. The optimizer cares about cardinality, but it may be hard to make that decision if you are dealing with a large number of rows, most of which are irrelevant in any one specific context. The optimizer also cares about high/low values, but these are not be relevant to any one lookup, but shared. We've also mentioned you probably won't foreign key to this data, which will reduce the amount of information the optimizer has when making its decision. You may have artificially made columns optional, that are actually mandatory, a key must have a value, but which column? I think you get the message.

I think, if you need name dictionary only ( for spellchecking or something like ) second approach is good enough. Otherwise, if objects have some additional specific fields second approach is very bed.

How to design a table that only needs a column?

I am creating a database table that'll have a list of all Tags available in my application (just like SO's tags).
Currently, I don't have anything associated with each tag (and I'll probably never have), so my idea was to have something of the form
Tags (Tag(pk) : string)
Should this be the way to do it? Or should I instead do something like
Tags (tag_id(pk) : int, tag : string)
I guess looking up on the table in the 2nd case would be faster than in the first one, but that it also takes up more space?
Thanks

I'd go for the second option with the surrogate key.
It will mean the table takes up more space but will likely reduce space over all assuming that you have the tag information as a foreign key in other tables (e.g. a posts/tags table)
using an int rather than a string will make the lookups required to enforce the foreign key more efficient and mean that updates of tag titles don't need to affect multiple tables.

Indexes work better with integers than CHAR/VARCHAR, go with a dedicated integer primary key column. If you need tag names to be unique you can add a constraint, but it's probably not worth the hassle.

You should go for the second option. Firstly, you never know what the future holds. Secondly, you may later want multiple language support or other things that makes the string-as-the-primary-key have a strange feeling around it. Thirdly, I like the idea of using a standard procedure for a table definition, ie. that there always is a column 'id' or 'pk'. It separates business from technology.
Quite possibly you'll have a faster lookup with the index being an integer. Further, consider making your index clustered for even further speedup.
I wouldn't emphasize too much on the performance issue though. As soon as a program starts talking to a database over the internet, you have a much bigger delay than 99% of all the queries of your database (of course with the exception of reporting queries!).

Those two options achieve quite different things. In the first case you have unique tags and in the second you don't. You haven't said what use TAG_ID is in this model. Unless you put in TAG_ID for a good reason then I'd stick with the first design. It's smaller, appears to meet your requirements precisely and Tag seems like a more obvious choice for a key (on grounds of familiarity and simplicity).

Modeling database : many small tables or not?

I have a database with some information which are repeated in some tables.
I want to know if it's interesting to create a table with this information and in the other table, I put only the id.
It's interesting because with this method I haven't got redundance. But I will have to do many joints between my tables in my request, and I'm afraid my request will be more slow.
(I work with symfony if it changes something)

It sounds like the 'information' in question is data that makes up key values. If so, it sounds like the database designer likes to use natural keys and that you prefer to use surrogate keys.
First, these are both merely a question of style. If the natural key values are composite (i.e. involve more than one column) and are included in other columns for data integrity purposes then they are not redundant.
Second, as you have observed, when it comes to performance of surrogate keys you have to weigh the advantage of the more efficient data type (e.g. a single integer column) against the degrading performance of needing to write more JOINs. Note that using surrogates tends to make constraints more troublesome to write e.g. when the required values for a rule is in another table and you SQL product doesn't support subqueries in CHECK constraints then you will need to use a trigger which degrades performance in a high activity environment.
Further consider that performance is not the only consideration e.g. using natural key values will tend to make the data more readable and therefore make the schema easier to maintain because the physical model will reflect the logical model more closely (surrogate keys do not appear in the logical model at all).

You're talking about Normalisation. As with so many design aspects it's a trade-off.
Having duplication within the database leads to many problems - for example how to keep those duplicates in step when updating data. So Inserts and Updates may well go more slowly because of the duplication. Hence we tend to normalise the database to avoid such duplication. That does lead to more complex queries and possibly some retrieval overhead.
Modern database products tend to do such queries really well if you take a bit of care to have the right indexes in place.
Hence my starting position would be to normalise your data, avoid duplication. Then in a special case perhaps denormalise just pieces where it really becomes essential. For example suppose some part of you database is large, mostly queried rather than updated (eg. historic order information) then perhaps denormalise that bit.

It is not a question of style.
The answer is, as the seeker has already identified, removal of duplication; Normalisation. Pull them all into one table, and place a Foreign Key wherever they are used.
Now an Integer FK may be "tidy", but any good, short, fixed length key will do. Variable length keys are very bad for performance, as the key needs to be unpacked every time the index is searched.
The nature of a Normalised database is more, smaller tables, which is much faster than an Unnormalised data heap, with fewer, larger tables. Get used to it.
As long as you are Joining on keys, Joins do not cost anything in themselves; ten joins to construct a row do not cost more than five. The cost is in the table sizes; the indices used; the distribution; the datatypes of the index columns; etc. Relational dbms are heavily engineered for Normalised databases.
If you need to do lookups of lookups, then that is the way it is. Just ensure that the tables are Normalised.

If you don't normalise
How are you going to store values that could potentially be used?
How are you going to separate "Lookup value" from "Look up value from "LookUpValue" etc
You'll be slows because you are storing variable length string "Lookup value" across many rows, rather than a nice tidy integer key
This is the more practical points to the other 2 answers...

Database design: why use an autoincremental field as primary key?

here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?

Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.

You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.

I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas