For our project it has become increasingly complicated to reproduce certain error conditions that show up in productive use. Extracting and recreating certain conditions sometimes takes hours to re-enter the data and reproduce a situation mainly because the required "graph" can be huge and there are many referential constraints that must be fulfilled and recreating these in an analysis DB (production can not be used for this for obvious reasons) in the correct order is often extremely complicated and tedious.
What would ease such analyses enormously would be some tool that - given a specific table and row-id as starting point - would traverse the entire graph as defined by a table's references (foreign keys) and emit all referenced entries recursively.
Ideally it would emit all these rows (table name, column names and their values) as sql-insert statements such that one could execute these as inserts scripts to load a relevant subset into another DB for analysis.
Does such a tool exist? I could imagine that this is not such a seldom and exotic wish or requirement. Or is this wishful dreaming and I am in for a longer programming exercise?
The DB we are using is Oracle (v12) - in case that matters.
Hope I could make myself clear and convey the intention.
Years ago I did this and it was very easy because I had object-relational mapping to java objects. I could just pick up the "parent" object from source db [which would recursively traverse all the relationships and pick up all the children], open a connection to the target DB and save the fully instantiated parent object tree. I don't know of any other tools. Folks in my company try to keep a pre-prd DB periodically refreshed from prod.
Taking hours to manually reproduce the problem conditions in the data seems like a long time, but so would hand building a custom solution.
If you are having data problems, it is likely a bug in the code, so if you can get developers to "write the test that fails" based on the conditions, you'll be better off.
Related
Background
I have a software component that writes data to a postgres database (into several tables) and I want to write an automatic functional test for this component. I already have a host of unit tests in place that check the subcomponents, but I'd like a test that checks the whole system end-to-end.
For each test run, I use a clean database (actually a completely new, this-test-run-only database). The software component is stable in the sense that given the same input, it will always write the same user data to the database.
The database design is relational, such that most tables contain foreign keys. Obviously, I don't want to check the value of these keys, because I don't want to rely on the fact that these keys are generated in a predictive manner by postgres.
Assume that there are no issues regarding user rights on the database, connection issues etc. Also disregard development/production disparities.
I currently use a number of select statements to produce a textual "dump" of the database and compare it to a reference dump (ignoring whitespace and so on), but this seems rather clumsy. Also, this doesn't take into account the relationships between the tables. Extending the current approach to deal with this doesn't strike me as maintainable at all, should the database layout ever change.
My software as well as the testing framework is written in C++, the testing scripts are simple bash scripts. I'm open to use any language to achieve this.
Question
How can I automatically verify the database contents in "the database way"?
Even better would be an approach that doesn't rely on postgres as the backend.
pgTap is a testing framework for PostgreSQL. You can use it to test both the structure and the content of a PostgreSQL database. I've used it on projects that had to meet certain contractual standards for seeded data (data for "lookup" tables like state codes and abbreviations, delivery carriers, user roles, etc.). It has worked well for that purpose.
But I don't yet see a compelling reason to abandon your current method, which is already written and working. Text dumps of single tables are supported by all current SQL dbms, as far as I know. If you move to a different dbms, you'll have to change the name of the dump program and the arguments to it. I can't imagine why you'd need to change the reference file, but I suppose that could happen.
The "database way" is really just to select the data you expect to be in the database, and see if it's really there. That's pretty much what you're doing now, and what pgTap does with perhaps greater flexibility.
To increase maintainability (to reduce duplication), you could generate the INSERT statements from the reference data, or you could generate the reference data from the INSERT statements. I can imagine development environments where that would be a wise thing to do, but I don't know whether yours is one of them.
The project I work on has undergone a transformation at the database level. For the better, about 40% of the SQL layout has been changed. Some columns were eliminated, others moved. I am now tasked with developing a data migration strategy.
What migration methods, even tools are available so that I don't have to figure out each every individual dependency and manually script a key change when their IDs (for instance) change.
I realize this question is a bit obtuse and open ended, but I assume others have had to do this before and I would appreciate any advice.
I'm on MS SQL Server 2008
#OMG Ponies Not obtuse but vague:
Great point. I guess this helps me reconsider what I am asking, at least make it more specific. How do you insert from multiple tables to multiple tables keeping the relationships established by the foreign keys intact? I now realize I could drop the ID key constraint during the insert and re-enable it after, but I guess I have to figure out what depends on what myself and make sure it goes smoothly.
I'll start there, but will leave this open in case anyone else has other recommendation.
You should create an upgrade script that morphs the current schema into the v. next schema, applying appropriate operations (alter table, select into, update, delete etc). While this may seem tedious, is the only process that will be testable: start from a backup of the current db, apply the upgrade script, test the result db for conformance with the desired schema. You can test and debug your upgrade script until is hammered into correctness. You can test it on a real data size so that you get a correct estimate of downtime due to size-of-data operations.
While there are out there tools that can copy data or transforms schema(s) (like SQL Compare) I believe approaching this as a development project, with a script deliverable that can be tested repeatedly and validated, is a much saner approach.
In future you can account for this upgrade step in your development and start with it, rather than try to squeeze it in at the end.
there are tons of commercial tools around that claim to solve this -> i wouldn't buy that...
I think your best bet is to model domain classes that represent your data and write adapters that read in/serialize to the old/new schemas.
If you haven't got a model of your domain, you should build one now.
ID's will change, so ideally they should not carry any meaning to user's of your database.
I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.
I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....
So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.