This question is kind of general and not very specific.
We have a java project that uses Oracle database. We are currently using SoapUI tool for the QA tests. Each test needs some data to exist on the database before it is run. Our current way of running the tests is as follows:
Before each test we run a .sql file (unique to the test) to load some data into the db
We run the soapui test
We use a general .sql file to erase the test data we inserted for the test
Go back to 1 and run the next test.
The advantage of this method for us is that each test runs on a "clean sheet" with it's own data and is unrelated to the other tests.
The disadvantage is that each time during development when something changes in the db, for example a column was added to a table, we need to change all of the sql scripts that inserts to this table instead of changing in one place, this makes it very hard to maintain the tests.
I wanted to know what are some of the industry "standards" ways of doing this kind of stuff, or to hear more approaches to solving this problem.
Any advice would be great.
You could integrate a SQL data generator into your testing loop. A suitable data generator takes the schema and additional constraints as input and produces data that is consistent with the current schema.
This way, every time the schema changes, the changes are accommodated by the test generator. You can have your test specific SQL scripts modified to be input constraints for the data generator. The link is to another question on SO where relevant tools have been listed.
You can include Databene Generator in your toolchain. It can generate sql files or talk directly to database. You have just create xml file with data generation scheme.
Related
I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering.
How do people usually do testing/unit test/functional test on ETL process?
We recently worked on a project where the governance board demanded 'You must have Unit Tests' and so we tried our best.
What worked for us was have each ETL solution start and end with a QA/Test package.
Anything unexpected discovered by these packages was logged into an audit table and a Fail Package event was then raised to stop the entire Job - We figured it was better to run with yesterdays good data than risk reporting against possible bad 'today' data.
The starting package would do db schema and data sanity checks. Data Sanity involved checking for duplicate or missing data caused by a lack of Referential Integrity in the source systems. Schema checks ensured that any schema changes that did not get applied during Continuous integration were detected.
The end package would check the results of any transformations. These included:
Comparing record counts between source|destination
Checking specific transforms (eg: all date values changed to appropriate SK value, all string values RTrimed)
Ensuring all SK fields were populated (-1 instead of nulls)
Most of these tests were SQL statements the used the built in schema objects of our database, so they were not to onerous to create.
In addition, as part of our development process we would create views that had the end result of any transformations we were doing. We would make use of these views to validate our package transformations.
Each of these checks created a record in our special audit table. That way we could provide a comprehensive list of all the tests and checks we had done each running of the process to satisfy the governance peoples.
(We also had a separate set of packages that would unit test each QA test by means of creating dummy tables, populating them, running the test then confirming the appropriate audit record was written. As Nick stated, this was a lot of work and of little real value)
testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.
Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.
This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).
You can unit test ETLs.
End-to-end tests are good, but slow, expensive and difficult to construct and keep stable.
Unit testing ETLs is highly desired to be able to test all data permutations but generally put into the too-hard basket. However it is possible to write true unit tests for ETLs that can run quickly and reliably.
We have found that the key is to decompose the ETL into two separate sections. Since an ETL is an Extract-Transform-Load the key is to separate the T from the E&L. Make a pure Transform function that transforms an input dataset to an output dataset, then call this function from the Extract and Load module.
The Extract and Load module isn't suitable for unit testing because it will generally involve external data sources and sinks, access tokens and user permissions, etc.
But all of the testable logic should be in the Transform component. Test this function from any unit testing framework - you will be able to pass in predefined datasets and test the transformed output against expected results. With some thinking we have even managed to create unit tests that test multi-stage updates of datasets onto each other.
Our particular implementation was done on Databricks in Scala, but the concept should work on any platform.
We've set up a system where for each ETL procedure we have defined an input dataset and an expected result dataset. Then we have created a system which, utilizing Robot Framework, runs three-part tests for each ETL procedure where the first part inserts the input dataset into the source data tables, the second part runs the ETL, and the third part compares the actual results with our expected results.
This works pretty well for us, but there are a couple of downsides: first of all, we create the test datasets manually for each ETL procedure which takes some work, and secondly, this means that testing for "unexpected" inputs is not done.
For the automated unit testing we have a separate environment in which we can install builds of our entire DW automatically.
The testing in ETL process fits in the following stages:
Identify Business requirements
Validate Data sources
Prepare test cases
Extract Data from different sources
Apply transformation logic to validate data
Load data into the destination
Reporting analysis
We can also categorize the ETL testing process as follows:
Product validation
Source to target data testing
Metadata testing
Performance testing
Integration and quality testing
Report testing
I'm trying to figure out how to implement version control in an environment where we have two DBs: one Testing and one Production.
In Testing. there are an arbitrary number of tasks being tested. These have no constraints in number of objects manipulated and complexity, meaning we can have a 3-day task that changes 2 package bodys and one trigger, and we can have a 3 month task that changes 100 different objects, including С source files and binary objects.
My main concern are the text-based objects of the DB. We need to version the Test and Production code, but any task can go from Testing to Production with no defined order whatsoever.
This means right now we have to manually track the changes in the files, selecting inside each file which lines in the code go from Testing to Production. We use a very rudimentary solution, writing in the header a sequence of comments with a file-based version number and adding in the code tags with that sequence to delimit the change.
I'm struggling to implement SVN because I wanted to create Testing as a branch of Production, having branches in Testing to limit each task, but I find that it can lead to many Testing tasks being ported to Production during merges.
This said, my questions are:
Is there a way to resolve this automatically?
Are there any database-specific version control solutions?
How can I "link" both environments if the code base is so different?
I used SVN for source control on DB scripts.
I dont have a technological solution to your problem but i can explain the methodology we used.
We had two sets of scripts - one for incremental changes and another for the complete declaration of database objects and procedures.
During development we updated only the incremental changes in a script that that was eventually used during deployment. during test rounds we updated the script.
Finally, After running the script on production we updated the second set of scripts containing the full declarations. The full scripts were used as reference and to create a db from scratch.
Currently, we have a several environments, but mainly we have a live environment and a test environment as most do.
It seems our test environment we are always having issues with having good data in our test environment. Largely because our test sql environment is maintained by a bunch of loose scripts that are not maintained very well. Obviously, one course of action is to maintain these more efficiently, that is not going to happen we have found.
So, I am looking for a tool that would help with this. Has anyone found or used such a tool?
Any advice or direction would be greatly appreciated. This is really a thorn in our developer's sides.
*Update*Our main issue is that test structures often differ from live and we need an automated solution that handles this, i.e. when a table has more columns in test than in live. Updating question to reflect this thank you for your answer.
As far as Transfer of Data is concerned use SSMS import and export data wizard.If source and destination tables names and schema is same Data transfer is a breeze.I use it quite often never faced any problem (use identity insert option if there is any identity col).
and Script generation feature of SSMS is not that bad either. Normally i have to edit 'Create Database' and 'create Users' parts of the script manually.
For most database-backed projects I've worked on, there is a need to get "startup" or test data into the database before deploying the project. Examples of startup data: a table that lists all the countries in the world or a table that lists a bunch of colors that will be used to populate a color palette.
I've been using a system where I store all my startup data in an Excel spreadsheet (with one table per worksheet), then I have a utility script in SQL that (1) creates the database, (2) creates the schemas, (3) creates the tables (including primary and foreign keys), (4) connects to the spreadsheet as a linked server, and (5) inserts all the data into the tables.
I mostly like this system. I find it very easy to lay out columns in Excel, verify foreign key relationships using simple lookup functions, perform concatenation operations, copy in data from web tables or other spreadsheets, etc. One major disadvantage of this system is the need to sync up the columns in my worksheets any time I change a table definition.
I've been going through some tutorials to learn new .NET technologies or design patterns, and I've noticed that these typically involve using Visual Studio to create the database and add tables (rather than scripts), and the data is typically entered using the built-in designer. This has me wondering if maybe the way I'm doing it is not the most efficient or maintainable.
Questions
In general, do you find it preferable to build your whole database via scripts or a GUI designer, such as SSMSE or Visual Studio?
What method do you recommend for populating your database with startup or test data and why?
Clarification
Judging by the answers so far, I think I should clarify something. Assume that I have a significant amount of data (hundreds or thousands of rows) that needs to find its way into the database. This data could be sourced from various places, such as text files, spreadsheets, web tables, etc. I've received several suggestions to script this process using INSERT statements, but is this really viable when you're talking about a lot of data?
Which leads me to...
New questions
How would you write a SQL script to take the country data on this page and insert it into the database?
With Excel, I could just copy/paste the table into a worksheet and run my utility script, and I'd basically be done.
What if you later realized you needed a new column, CapitalCity?
With Excel, I could take that information from this page, paste it into Excel, and with a quick text-to-column manipulation, I'd have the data in the format I need.
I honestly didn't write this question to defend Excel as the best way or even a good way to get data into a database, but the answers so far don't seem to be addressing my main concern--how to get all this data into your database. Writing a script with hundreds of INSERT statements by hand would be extremely time consuming and error prone. Somehow, this script needs to be machine generated, but how?
I think your current process is fine for seeding the database with initial data. It's simple, easy to maintain, and works for you. If you've got a good database design with adequate constraints then it doesn't really matter how you seed the initial data. You could use an intermediate tool to generate scripts but why bother?
SSIS has a steep learning curve, doesn't work well with source control (impossible to tell what changed between versions), and is very finicky about type conversions from Excel. There's also an issue with how many rows it reads ahead to determine the data type -- you're in deep trouble if your first x rows contain numbers stored as text.
1) I prefer to use scripts for several reasons.
• Scripts are easy to modify, and plus when I get ready to deploy my application to a production environment, I already have the scripts written so I'm all set.
• If I need to deploy my database to a different platform (like Oracle or MySQL) then it's easy to make minor modifications to the scripts to work on the target database.
• With scripts, I'm not dependent on a tool like Visual Studio to build and maintain the database.
2) I like good old fashioned insert statements using a script. Again, at deployment time scripts are your best friend. At our shop, when we deploy our applications we have to have scripts ready for the DBA's to run, as that's what they expect.
I just find that scripts are simple, easy to maintain, and the "least common denominator" when it comes to creating a database and loading up data to it. By least common denominator, I mean that the majority of people (i.e. DBA's, other people in your shop that might not have visual studio) will be able to use them without any trouble.
The other thing that's important with scripts is that it forces you to learn SQL and more specfically DDL (data definition language). While the hand-holding GUI tools are nice, there's no substitute for taking the time to learn SQL and DDL inside out. I've found that those skills are invaluable to have in almost any shop.
Frankly, I find the concept of using Excel here a bit scary. It obviously works, but it's creating a dependency on an ad-hoc data source that won't be resolved until much later. Last thing you want is to be in a mad rush to deploy a database and find out that the Excel file is mangled, or worse, missing entirely. I suppose the severity of this would vary from company to company as a function of risk tolerance, but I would be actively seeking to remove Excel from the equation, or at least remove it as a permanent fixture.
I always use scripts to create databases, because scripts are portable and repeatable - you can use (almost) the same script to create a development database, a QA database, a UAT database, and a production database. For this reason it's equally important to use scripts to modify existing databases.
I also always use a script to create bootstrap data (AKA startup data), and there's a very important reason for this: there's usually more scripting to be done afterward. Or at least there should be. Bootstrap data is almost invariably read-only, and as such, you should be placing it on a read-only filegroup to improve performance and prevent accidental changes. So you'll generally need to script the data first, then make the filegroup read-only.
On a more philosophical level, though, if this startup data is required for the database to work properly - and most of the time, it is - then you really ought to consider it part of the data definition itself, the metadata. For that reason, I don't think it's appropriate to have the data defined anywhere but in the same script or set of scripts that you use to create the database itself.
Test data is a little different, but in my experience you're usually trying to auto-generate that data in some fashion, which makes it even more important to use a script. You don't want to have to manually maintain an ad-hoc database of millions of rows for testing purposes.
If your problem is that the test or startup data comes from an external source - a web page, a CSV file, etc. - then I would handle this with an actual "configuration database." This way you don't have to validate references with VLOOKUPS as in Excel, you can actually enforce them.
Use SQL Server Integration Services (formerly DTS) to pull your external data from CSV, Excel, or wherever, into your configuration database - if you need to periodically refresh the data, you can save the SSIS package so it ends up being just a couple of clicks.
If you need to use Excel as an intermediary, i.e. to format or restructure some data from a web page, that's fine, but the important thing IMO is to get it out of Excel as soon as possible, and SSIS with a config database is an excellent repeatable method of doing that.
When you are ready to migrate the data from your configuration database into your application database, you can use SQL Server Management Studio to generate a script for the data (in case you don't already know - when you right click on the database, go to Tasks, Generate Scripts, and turn on "Script Data" in the Script Options). If you're really hardcore, you can actually script the scripting process, but I find that this usually takes less than a minute anyway.
It may sound like a lot of overhead, but in practice the effort is minimal. You set up your configuration database once, create an SSIS package once, and refresh the config data maybe once every few months or maybe never (this is the part you're already doing, and this part will become less work). Once that "setup" is out of the way, it's really just a few minutes to generate the script, which you can then use on all copies of the main database.
Since I use an object-relational mapper (Hibernate, there is also a .NET version), I prefer to generate such data in my programming language. The ORM then takes care of writing things into the database. I don't have to worry about changing column names in the data because I need to fix the mapping anyway. If refactoring is involved, it usually takes care of the startup/test data also.
Excel is an unnecessary component of this process.
Script the current version the database components that you want to reuse, and add the script to your source control system. When you need to make changes in the future, either modify the entities in the database and regenerate the script, or modify the script and regenerate the database.
Avoid mixing Visual Studio's db designer and Excel as they only add complexity. Scripts and SQL Management Studio are your friends.
I am trying to start a new MVC project with tests and I thought the best way to go would have 2 databases. 1 for testing against and 1 for when I run the app and use it (also test really as it's not production yet).
For the test database I was thinking of putting create table scripts and fill data scripts within the test setup method and then deleting all this in the tear down method.
I am going to be using Linq to SQL though and I don't think that will allow me to do this?
Will I have to just go the ADO route if I want to do it this way? Or should I just use a mock object and store data as an array or something?.
Any tips on best practices?
How did Jeff go about doing this for StackOveflow?
What I do is define an interface for a DataContext wrapper and use an implementation of the wrapper for the DataContext. This allows me to use an alternate, fake DataContext implementation in my tests (or mock it, if easier). This abstracts the database out of my unit tests completely. I found some starter code at http://andrewtokeley.net/archive/2008/07/06/mocking-linq-to-sql-datacontext.aspx, although I've extended it so that it handles the validation implementations on my entity classes.
I should also mention that I have a separate staging server for QA, so there is live testing of the entire system. I just don't use an actual database in my unit testing.
I checked out the link from tvanfosson and RikMigrations and after playing about with them I prefer the mocking datacontext method best. I realised I don't need to create tables and drop them all the time.
After a little more research I found Stephen Walther's article http://stephenwalther.com/blog/archive/2008/08/17/asp-net-mvc-tip-33-unit-test-linq-to-sql.aspx which to me seems easier and more reliable.
So I am going with this implementation.
Thanks for the help.
You may want to find some other way around actually hitting the database for your unit tests because it takes a lot more time. That being said, have you considered using Migrations for creating / deleting your tables instead of using sql scripts? RikMigrations is what I have been using to create my database so I can easily revision all of my code in one place. Justin Etheredge has a great article on using RikMigrations.
Consider these methods on DataContext:
http://msdn.microsoft.com/en-us/library/system.data.linq.datacontext.createdatabase.aspx
http://msdn.microsoft.com/en-us/library/system.data.linq.datacontext.executecommand(v=VS.100).aspx
I agree with much of the above, relating to unit testing. However, I think it's important to raise the point that using Mock Repositories and unit tests doesn't give you the same level of tests as a DB Integration Test would.
For example, our databases often have cascading deletes built right in to the schema. In this case, deleting a primary entity in an aggregate will automatically delete all child entities. However, this would not automatically apply in a mocked repository that was not backed up by a physical database with these business rules (unless you built all of those rules in to the Mock). This is important because if somebody comes along and changes the design of my schema, I need it to break my tests so I can adjust the code/schema accordingly. I appreciate that this is Integration Testing and not Unit Testing but thought it was worth mentioning.
My preferred option is to create a Master Design Database that contains sample data (the same sort of data you would create in your Mocks). During the start of each test run, I have an automated script that creates a backup of the MasterDB and restores it to "TestDB" (which all my tests use). That way, I maintain a repository of clean test data in Master than recreates itself upon each test run. My tests can play around with the data and test out all the scenarios needed.
When I debug the application, I have another script that backs up and restores the Master DB to a DEV database. I can play around with data here too without worrying about losing my sample data. I don't typically run this particular script every session because of the delay waiting for the DB to be recreated. I may run it once a day and then play around/debug the app throughout the day. If for example, I delete all the records from a table as part of my debugging, I would run the script to recreate the DevDB when I'm done.
These steps sound like they would add a huge amount of time to the process, but actually - they don't. Our application currently has in the region of 3500 tests, with about 3000 of them accessing the DB at some point. The database backup and restore typically takes around 10-12 seconds at the start of each test run. And since the whole test suite is only executed upon TFS checkin, we don't mind if we have to wait a while longer anyway. On an average day, our entire test suite takes about 15-20 minutes to run.
I appreciate and accept that integration testing is much slower than unit testing (because of the inherent need to use a real DB) but it more closely represents the 'real world' app. For example, Mock Repositories don't return DB error codes, the don't time-out, they don't lock up, they don't run out of disk space, etc.
Unit tests are ok for simple calculations, basic business rules, etc. and certainly they are absolutely the best choice for most operations that don't involve DB (or other resource) access. But I don't think they are as valuable as integration tests - people talk a lot about unit tests, but little is said about integration tests.
I expect those passionate about unit tests will be sending flames my way for this. That's fine - I'm just trying to bring some balance and to remind people that projects that are full of passed unit tests can still fail badly the moment you implement them in the field.
This article gives example of mocking linq to sql with typemock.
http://blog.benhall.me.uk/2007/11/how-to-unit-test-linq-to-sql-and.html