I have issues with schema changes in my testing infrastructure. We use a unit test framework, and truncate our database between each unit test. I've noticed that running a series of CREATE statements is very slow as well, specifically duplicating a database for running parallel tests. I'm finding it's taking upwards of 10 minutes to duplicate a database 10 times (I'm using SHOW CREATE ALL TABLES and running those statements).
The problem with CockroachDB schema changes getting slow in tests is known, but there are some workarounds. See this guide with some recommended settings for unit tests, which should help address some of the problems: https://www.cockroachlabs.com/docs/stable/local-testing.html#use-a-local-single-node-cluster-with-in-memory-storage
Future releases (that is, v22.2 and later) will continue improving the performance of schema changes when repeatedly dropping and creating tables.
Related
I'm integrating SQL database (PostgreSQL) into my first ever Java backend system.
My dilemma now is should the application create the database and the tables by itself, or should it just assume that the required database initialized with the tables will be available?
If the application should not manage the database by itself, then what's the best approach to reproducing the database on developers' machines (I want to be able to run integration tests with real database locally)?
I ended up using FlyWay as #Max suggested in his comment.
The setup was quite easy and intuitive. I used the default settings for time being.
Setup of integration testing on development machine was a breeze.
The only thing that I don't particularly like is that cleaning the database between tests using FlyWay is very slow because it doesn't just clean the tables, but erases the entire scheme.
This, however, is not a big issue for me right now (currently have ~100 tests), and, probably, will not be difficult to optimize in the future if needed.
I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering.
How do people usually do testing/unit test/functional test on ETL process?
We recently worked on a project where the governance board demanded 'You must have Unit Tests' and so we tried our best.
What worked for us was have each ETL solution start and end with a QA/Test package.
Anything unexpected discovered by these packages was logged into an audit table and a Fail Package event was then raised to stop the entire Job - We figured it was better to run with yesterdays good data than risk reporting against possible bad 'today' data.
The starting package would do db schema and data sanity checks. Data Sanity involved checking for duplicate or missing data caused by a lack of Referential Integrity in the source systems. Schema checks ensured that any schema changes that did not get applied during Continuous integration were detected.
The end package would check the results of any transformations. These included:
Comparing record counts between source|destination
Checking specific transforms (eg: all date values changed to appropriate SK value, all string values RTrimed)
Ensuring all SK fields were populated (-1 instead of nulls)
Most of these tests were SQL statements the used the built in schema objects of our database, so they were not to onerous to create.
In addition, as part of our development process we would create views that had the end result of any transformations we were doing. We would make use of these views to validate our package transformations.
Each of these checks created a record in our special audit table. That way we could provide a comprehensive list of all the tests and checks we had done each running of the process to satisfy the governance peoples.
(We also had a separate set of packages that would unit test each QA test by means of creating dummy tables, populating them, running the test then confirming the appropriate audit record was written. As Nick stated, this was a lot of work and of little real value)
testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.
Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.
This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).
You can unit test ETLs.
End-to-end tests are good, but slow, expensive and difficult to construct and keep stable.
Unit testing ETLs is highly desired to be able to test all data permutations but generally put into the too-hard basket. However it is possible to write true unit tests for ETLs that can run quickly and reliably.
We have found that the key is to decompose the ETL into two separate sections. Since an ETL is an Extract-Transform-Load the key is to separate the T from the E&L. Make a pure Transform function that transforms an input dataset to an output dataset, then call this function from the Extract and Load module.
The Extract and Load module isn't suitable for unit testing because it will generally involve external data sources and sinks, access tokens and user permissions, etc.
But all of the testable logic should be in the Transform component. Test this function from any unit testing framework - you will be able to pass in predefined datasets and test the transformed output against expected results. With some thinking we have even managed to create unit tests that test multi-stage updates of datasets onto each other.
Our particular implementation was done on Databricks in Scala, but the concept should work on any platform.
We've set up a system where for each ETL procedure we have defined an input dataset and an expected result dataset. Then we have created a system which, utilizing Robot Framework, runs three-part tests for each ETL procedure where the first part inserts the input dataset into the source data tables, the second part runs the ETL, and the third part compares the actual results with our expected results.
This works pretty well for us, but there are a couple of downsides: first of all, we create the test datasets manually for each ETL procedure which takes some work, and secondly, this means that testing for "unexpected" inputs is not done.
For the automated unit testing we have a separate environment in which we can install builds of our entire DW automatically.
The testing in ETL process fits in the following stages:
Identify Business requirements
Validate Data sources
Prepare test cases
Extract Data from different sources
Apply transformation logic to validate data
Load data into the destination
Reporting analysis
We can also categorize the ETL testing process as follows:
Product validation
Source to target data testing
Metadata testing
Performance testing
Integration and quality testing
Report testing
Please forgive my long question. I have an idea for a design that I could use some comments on. Is it a good idea to do this? And what are the pit falls I should be aware of? Are there other similar implementations that are better?
My situation is as follows:
I am working on a rewrite of a windows forms application that connects to a SQL 2008 (earlier it was SQL 2005) server. The application is an "expert-system" for an engineering company where we store structured data about constructions. We have control of all installations of the client software, we have no external customers or users, they are all internal to the company, and they are all be trusted not to do anything malicious to the software or database.
The current design doesn't have too many tables (about 10 - 20) but some of them have millions of records that belong to several hundred constructions. The systems performance has been ok so far, but it is starting to degrade as we are pushing the limits of the design.
As part of the rewrite I am considering splitting the database into one master database and several "child" databases where each describes one construction. Each child database should be of identical design. This should eliminate the performance problems we are seeing today since the data stored in each database would be less than one percent of the total data amount.
My concern is that instead of maintaining one database we will now get hundreds of databases that must be kept up to date. The system is constantly evolving as the companys requirements change (you know how it is), and while we try to look forward to reduce the number of changes the changes will come. So we will need a system where we keep track of all database changes done to the system so they can be applied to the child databases. Updating the client application won't be a problem, we have good control of that aspect.
I am thinking of a change tracing system where we store database scripts for all changes in a table in the master database. We can then give each change a version number and we can store a current version number in each child database. When the client program connects to a child database we can then check the version number of the database against the current version number of the master database and if there are patches with version numbers greater than the version number of the child database we run these and update the child database to the latest version.
As I see it this should work well. Any changes to the system will first be tested and validated before committed as a new version of the database. The change will then be applied to the database the first time a user opens it. I suppose we would open the database in exclusive mode while applying the changes, but as long as the changes aren't too frequent this should not be a problem.
So what do you think? Will this work? Have any of you done something similar? Should we scrap the solution and go for the monolithic system instead?
Have you considered partitioning your large tables by 'construction'? This could alleviate some of the growing pains by splitting the storage for the tables across files/physical devices without needing to change your application.
Adding spindles (more drives) and performing a few hours of DBA work can often be cheaper than modifying/adapting software.
Otherwise, I'd agree with #heikogerlach and these similar posts:
How do I version my ms sql database
Mechanisms for tracking DB schema changes
How do you manage databases in development, test and production?
I have a similar situation here, though I use MySQL. Every database has a versions table that contains the version (simply an integer) and a short comment of what has changed in this version. I use a script to update the databases. Every database change can be in one function or sometimes one change is made by multiple functions. Functions contain the version number in the function name. The script looks up the highest version number in a database and applies only the functions that have a higher version number in order.
This makes it easy to update databases (just add new change functions) and allows me to quickly upgrade a recovered database if necessary (just run the script again).
Even when testing the changes before this allows for defensive changes. If you make some heavy changes on a table and you want to play it safe:
def change103(...):
"Create new table."
def change104(...):
"""Transfer data from old table to new table and make
complicated changes in the process.
"""
def change105(...):
"Drop old table"
def change106(...):
"Rename new table to old table"
if in change104() is something going wrong (and throws an exception) you can simply delete the already converted data from the new table, fix your change function and run the script again.
But I don't think that changing a database dynamically when a client connects is a good idea. Sometimes changes can take some time. And the software that accesses a database should match the schema of the database. You have somehow to keep them in sync. Maybe you could distribute a new software version and then you want to upgrade the database when a client is actually starting to use this new software. But I haven't tried that.
Better don't create additional databases. At first glance you may think that you'll get some performance gain, but actually you get support nightmare. Remember - what can break, does break sooner or later.
It is way simpler to perform and optimize queries in single database. It is much easier manage user permissions in single database. It is much easier to make consistent backups for single database.
Like KenG said, if you need break your large tables - consider partitioning them. And add some drives :)
But at first run SQL profiler on your database and optimize indexes and queries. Several million rows is usually not a big problem to handle (unless your customer needs live totaling over half of these, in which case no partitioning can help).
I know that this is a crazy answer but here it goes...
I currently have a similar scenario where I need to keep control of database versions in multiple locations for a system using MS SQL Server.
What I am doing now is using Ruby on Rails ActiveRecord Migrations to keep control of database versions. Yes I know that we are talking about Windows systems but this works fine for me. (By the way, my system is programmed in VB and .NET)
I have installed Rails on each server, when I need to update the database schema I copy the migration files to the server and run rake db:migrate which updates the database to the latest version or rollbacks it to a desired version.
As a side effect you will have a set of migration files for your database schema in an database independent language (in this case ruby) that you can apply to other database engines and that you can put under source control too.
I know that this is a strange solution in which a totally different technology is used but it does not hurt to learn new approaches. You can find additional information here.
I have become a better .Net programmer since I learned Ruby on Rails. I asked here before a question about this approach.
I'm considering writing some unit tests for my Tsql stored procedures, I have two concerns:
I will have to write a lot of SQL to create test fixtures (test data prepared in _setup procedures)
I will have to "re-write" my query in the test procedure to obtain the results to compare against the results from the stored procedure I'm testing.
Considering that my DB has hundreds of tables and really complex stored procedures... I don't see how this will save me time?? any thoughts? am I missing something? is there any other way to go?
Automated unit-testing often gets left by the wayside as managers push for quick releases rather than increasing project scope and budget to emphasis stability. The fact is, unit-test takes time. In my experience, the benefits far outweigh any drawbacks. In cases where stored procedures are being called by external systems unit-testing has been invaluable in eliminating unforeseen problems and guaranteeing stability prior to integration testing.
Regarding your concerns:
If you place any data required to unit test your stored procedure(s) in XML files which can be read prior to running the unit test(s), you can read the data using the standard API routines for reading XML data and potentially re-use the data for multiple tests. Run each test in the context of a transaction which is rolled back at the end of the test to allow the overall environment to be configured once at the beginning of a test run rather than having to perform lots of steps for each individual test. Unit-tests can be bundled with automated nightly build processes to further bullet-proof your code.
There will be some overhead initially, but this will decrease over time as you and your team become more familiar with the unit-test concepts and how to leverage reusability.
You shouldn't need to re-write your query to compare the results. A standard scenario might be something like the following:
load test data and prepare environment
begin transaction
run stored procedure using test data
compare actual output to expected output using Assert statements
if actual and expected output don't match, test fails
if actual and expected output match, test passes
rollback transaction
/...
repeat steps 2 thru 7 for any additional tests
.../
Cleanup test environment
Keep in mind, you are testing a specific set of conditions looking for pass/fail so it's Ok to hard code the expected values within your test routines.
Hope this helps,
Bill
In theory, Unit Testing (in general) means more time up front writing tests, but should make things easier for you later on. For example, the time invested pays dividends later on when you have the ability to spot regression bugs very easily. The wikipedia entry on unit testing has a good overview of the general benefits.
Whether it will be good for you in practice is a hard question to answer - depends on the project.
As for 'having to re-write the query to test the query results', obviously that isn't going to prove anything. I suppose what you need to do is set up test data that will return a predictable result when the query (or whatever) is run, and then test for that specific result. That way you are testing the query against your mental model of it, rather than testing the query against a copy of itself.
But yeah, sounds like that will take a lot of setting up time - I can imagine that preparing a SQL stored procedure test will involve doing a lot more setting-up than your average .Net object test.
The thing I wonder about is, WHY are you considering writing unit tests? Do you have operational issues with the database? Is it hard to implement changes? Is management making your raise dependent on unit tests?
If there's no clear reason, I wouldn't start with unit tests "for fun". When there's a well-oiled change system in place, unit tests add overhead but no value.
There are also serious risks with unit tests:
People start seeing unit tests as a "quality guarantee". Just keep hacking till the unit tests give the green light, and then it's good enough for production.
Small changes that used to be a "quick fix", will grow bigger because they require (changes to) the unit tests. This way unit tests make you less flexible.
Unit tests often check many things that don't matter to anyone using the production system. So unit tests force you to spill resources on stuff only the unit tests care about.
Sorry for the the rant (I've had bad experience with unit tests.)
I am trying to start a new MVC project with tests and I thought the best way to go would have 2 databases. 1 for testing against and 1 for when I run the app and use it (also test really as it's not production yet).
For the test database I was thinking of putting create table scripts and fill data scripts within the test setup method and then deleting all this in the tear down method.
I am going to be using Linq to SQL though and I don't think that will allow me to do this?
Will I have to just go the ADO route if I want to do it this way? Or should I just use a mock object and store data as an array or something?.
Any tips on best practices?
How did Jeff go about doing this for StackOveflow?
What I do is define an interface for a DataContext wrapper and use an implementation of the wrapper for the DataContext. This allows me to use an alternate, fake DataContext implementation in my tests (or mock it, if easier). This abstracts the database out of my unit tests completely. I found some starter code at http://andrewtokeley.net/archive/2008/07/06/mocking-linq-to-sql-datacontext.aspx, although I've extended it so that it handles the validation implementations on my entity classes.
I should also mention that I have a separate staging server for QA, so there is live testing of the entire system. I just don't use an actual database in my unit testing.
I checked out the link from tvanfosson and RikMigrations and after playing about with them I prefer the mocking datacontext method best. I realised I don't need to create tables and drop them all the time.
After a little more research I found Stephen Walther's article http://stephenwalther.com/blog/archive/2008/08/17/asp-net-mvc-tip-33-unit-test-linq-to-sql.aspx which to me seems easier and more reliable.
So I am going with this implementation.
Thanks for the help.
You may want to find some other way around actually hitting the database for your unit tests because it takes a lot more time. That being said, have you considered using Migrations for creating / deleting your tables instead of using sql scripts? RikMigrations is what I have been using to create my database so I can easily revision all of my code in one place. Justin Etheredge has a great article on using RikMigrations.
Consider these methods on DataContext:
http://msdn.microsoft.com/en-us/library/system.data.linq.datacontext.createdatabase.aspx
http://msdn.microsoft.com/en-us/library/system.data.linq.datacontext.executecommand(v=VS.100).aspx
I agree with much of the above, relating to unit testing. However, I think it's important to raise the point that using Mock Repositories and unit tests doesn't give you the same level of tests as a DB Integration Test would.
For example, our databases often have cascading deletes built right in to the schema. In this case, deleting a primary entity in an aggregate will automatically delete all child entities. However, this would not automatically apply in a mocked repository that was not backed up by a physical database with these business rules (unless you built all of those rules in to the Mock). This is important because if somebody comes along and changes the design of my schema, I need it to break my tests so I can adjust the code/schema accordingly. I appreciate that this is Integration Testing and not Unit Testing but thought it was worth mentioning.
My preferred option is to create a Master Design Database that contains sample data (the same sort of data you would create in your Mocks). During the start of each test run, I have an automated script that creates a backup of the MasterDB and restores it to "TestDB" (which all my tests use). That way, I maintain a repository of clean test data in Master than recreates itself upon each test run. My tests can play around with the data and test out all the scenarios needed.
When I debug the application, I have another script that backs up and restores the Master DB to a DEV database. I can play around with data here too without worrying about losing my sample data. I don't typically run this particular script every session because of the delay waiting for the DB to be recreated. I may run it once a day and then play around/debug the app throughout the day. If for example, I delete all the records from a table as part of my debugging, I would run the script to recreate the DevDB when I'm done.
These steps sound like they would add a huge amount of time to the process, but actually - they don't. Our application currently has in the region of 3500 tests, with about 3000 of them accessing the DB at some point. The database backup and restore typically takes around 10-12 seconds at the start of each test run. And since the whole test suite is only executed upon TFS checkin, we don't mind if we have to wait a while longer anyway. On an average day, our entire test suite takes about 15-20 minutes to run.
I appreciate and accept that integration testing is much slower than unit testing (because of the inherent need to use a real DB) but it more closely represents the 'real world' app. For example, Mock Repositories don't return DB error codes, the don't time-out, they don't lock up, they don't run out of disk space, etc.
Unit tests are ok for simple calculations, basic business rules, etc. and certainly they are absolutely the best choice for most operations that don't involve DB (or other resource) access. But I don't think they are as valuable as integration tests - people talk a lot about unit tests, but little is said about integration tests.
I expect those passionate about unit tests will be sending flames my way for this. That's fine - I'm just trying to bring some balance and to remind people that projects that are full of passed unit tests can still fail badly the moment you implement them in the field.
This article gives example of mocking linq to sql with typemock.
http://blog.benhall.me.uk/2007/11/how-to-unit-test-linq-to-sql-and.html