Integration tests vs units test on 2 microservices - testing

Our team designed 2 microservices in 2 distinct GitHub repositories.
First one is responsible for transforming file inputs to data inserted in a (graph/neo4j) database.
Second one is a REST API responsible for read-only queries to that database, returning JSONs.
Since these are very tightly-coupled services (if the algorithm to populate/update the data in database is wrong, data will be corrupted), my approach was to create units tests on both repos with curated input/expected outputs
i.e. prepared input files to expected nodes, then prepared input nodes to expected JSONs
Is that approach making sense? what about integration tests that I suspect to be overlapping in definition here ? are they needed? what would they cover that units tests won't in such scenario?

I wouldn't say these were tightly coupled. They're both coupled to the database, but that provides a layer of seperation form each other. Anything which crosses compoenent is a kind of integration test anyway.
In the case of the API, the "real" unit testing approach would be to mock the database calls, so it's not part of the situation. How straight forward / rebost that is depends on how involved it is. If that's not practical, restore a test database to a known state before you being.
In the case of file loading the same basic idea applies, although mocking db calls for a data load process is likely to be brittle. If you use a real database, you'd need to start it from a known state, then validate it's state afterwards.
Question to ask yourself: It is really one operation, or are "parse file" and "write content to db" two units?

Related

Best practice around GraphQL nesting depth

Is there an optimum maximum depth to nesting?
We are often presented with the option to try to represent complex heirarchical data models with the nesting they demonstrate in real life. In my work this is genetics and modelling protein / transcript / homology relationships where it is possible to have very deep nesting up to maybe 7/8 levels. We use dataloader to make nested batching more efficient and resolver level caching with directives. Is it good practice to model a schema on a real life data model or should you focus on making your resolvers reasonable to query and keep nesting to a maximum ideal depth of say 4 levels?
When designing a schema is it better to create a different parent resolver for a type or use arguments that to direct a conditional response?
If I have two sets of for example ‘cars’ let’s say I have cars produced by Volvo and cars produced by tesla and the underlying data while having similarities is originally pulled from different apis with different characteristics. Is it best practice to have a tesla_cars and volvo_cars resolver or one cars resolver which uses for example a manufacturer argument to act differently on the data it returns and homogenise the response especially where there may then be a sub resolver that expects certain fields which may not be similar in the original data.
Or is it better to say that these two things are both cars but the shape of the data we have for them is significantly different so its better to create seperate resolvers with totally or notably different fields?
Should my resolvers and graphQL apis try to model the data they describe or should I allow duplication in order to create efficient application focused queries and responses?
We often find ourselves wondering do we have a seperate API for application x and y that maybe use underlying data and possibly even multiple sources (different databases or even API calls) inside resolvers very differently or should we try to make a resolver work with any application even if that means using type like arguments to allow custom filtering and conditional behaviour?
Is there an optimum maximum depth to nesting?
In general I'd say: don't restrict your schema. Your resolvers / data fetchers will only get called when the client requests the corresponding fields.
Look at it from this point of view: If your client needs the data from 8 levels of the hierarchy to work, then he will ask for it no matter what. With a restricted schema the client will execute multiple requests. With an unrestricted schema he can get all he needs in a single request. Though the amount processing on your server side and amount of data will still be the same, just split across multiple network requests.
The unrestricted schema has several benefits:
The client can decide if he wants all the data at once or use multiple requests
The server may be able to optimize the data fetching process (i.e. don't fetch duplicate data) when he knows everything the client wants to receive
The restricted schema on the other hand has only downsides.
When designing a schema is it better to create a different parent resolver for a type or use arguments that to direct a conditional response
That's a matter of taste and what you want to achieve. But if you expect your application to grow and incorporate more car manufacturers, your API may become messy, if there are lot's of abc_cars and xyz_cars queries.
Another thing to keep in mind: Even if the shape of data is different, all cars have something in common: They are some kind of type Car. And all of them have for example a construction year. If you now want to be able to query "all cars sorted by construction year" you will need a single query endpoint.
You can have a single cars query endpoint in your api an then use interfaces to query different kinds of cars. Just like GraphQL Relay's node endpoint works: Single endpoint that can query all types that implement the Node interface.
On the other hand, if you've got a very specialized application, where your type is not extensible (like for example white and black chess pieces), then I think it's totally valid to have a white_pieces and black_pieces endpoint in your API.
Another thing to keep in mind: With a single endpoint some queries become extremely hard (or even impossible), like "sort white_pieces by value ascending, and black_pieces by value descending". This is much easier if there are separate endpoints for each color.
But even this is solvable if you have a single endpoint for all pieces, and simply call it twice.
Should my resolvers and graphQL apis try to model the data they describe or should I allow duplication in order to create efficient application focused queries and responses?
That's question of use case and scalability. If you have exactly two types of clients that use the API in different ways, just build two seperate APIs. But if you expect your application to grow, get more different clients, then of course it will become an unmaintainable mess to have 20 APIs.
In this case have a look at schema directives. You can for example decorate your types and fields to make them behave differently for each client or even show/hide parts of your API depending on the client.
Summary:
Build your API with your clients in mind.
Keep things object oriented, make use of interfaces for similar types.
Don't provide endpoints you clients don't need, you can still extend your schema later if necessary.
Think of your data a huge graph ;) that's what GraphQL is all about.

Is it okay to have more than one repository for an aggregate in DDD?

I've read this question about something similar but it didn't quite solve my problem.
I have an application where I'm required to use data from an API. Problem is there are performance and technical limitations to doing this. The performance limitations are obvious. The technical limitations lie in the fact that the API does not support some of the more granular queries I need to make.
I decided to use MySQL as a queryable cache.
Since the data I needed to retrieve from the API did not change very often, I settled on refreshing the cache once a day, so I didn't need any complicated mapper that checked if we had the data in the cache and if not fell back to the API. That was my first design, but I realized that wasn't very practical when the API couldn't support most of the queries I needed to make anyway.
Now I have a set of two mappers for every aggregate. One for MySQL and one for the API.
My problem is now how I hide the complexities of persistence from the domain, and the fact that it seems that I need multiple repositories.
Ideally I would have an interface that both mappers adhered to, but as previously disclosed that's not possible.
Is it okay to have multiple repositories, one for each mapper?
Is it okay to have more than one repository for an aggregate in DDD?
Short answer: yes.
Longer answer: you won't find any suggestion of multiple repository in the original book by Evans. As he described things, the domain model would have one representation of the aggregate, and the repository abstraction provided consumers with the illusion that the aggregate was stored in an in-memory collection.
Largely, this makes sense -- you are trying to ensure that writes to data within the aggregate boundary are consistent, so you need a single authority for change.
But... there's no particular reason that reads need to travel through the same code path as writes. Welcome to the world of cqrs. What that gives you immediately is the idea that the in memory representation for reads might need to be optimized differently from the in memory representation used for writes.
In its more general form, you get the idea that the concept that you are modeling might have different representations for each use case.
For your case, where it is sometimes appropriate to read from the RDBMS, sometimes from the API, sometimes both, this isn't quite an exact match -- the repository interface hides the implementation details from the consumer, but you still have to bother with the implementation.
One thing you might look at is your requirements; how fresh does the data need to be in each use case? A constraint that is often relaxed in the CQRS pattern is the idea that the effects of writes are immediately available for reading. The important question to ask would be, if the data hasn't been cached yet, can you simply report "data not available" without hitting the API?
If so, then use cases that access the cached data need only a single repository implementation.
If you are using external API to read and modify data, you can cache them locally to be faster in reads, but I would avoid to have a domain repository.
From the domain perspective it seems that you need a service to query (or just a Query in CQRS implementation) for some data, that you can do with a service, that internally can call some remote API or read from a local cache (mysql, whatever).
When you read your local cache you can develop a repository to decouple your logic from the db implementation, but this is a different concept from a domain repository, it is just a detail of your technical implementation, that has nothing to do with your domain.
If the remote service start offering the query you need you will change the implementation of how your query is executed, calling the remote API instead of the db, but your domain model should not change.
A domain repository is used to load and persist your aggregates, meanwhile if you are working with external aggregates (in a different context, subdomain) you need to interact with them using services.

How to test (unit test) on ETL process?

I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering.
How do people usually do testing/unit test/functional test on ETL process?
We recently worked on a project where the governance board demanded 'You must have Unit Tests' and so we tried our best.
What worked for us was have each ETL solution start and end with a QA/Test package.
Anything unexpected discovered by these packages was logged into an audit table and a Fail Package event was then raised to stop the entire Job - We figured it was better to run with yesterdays good data than risk reporting against possible bad 'today' data.
The starting package would do db schema and data sanity checks. Data Sanity involved checking for duplicate or missing data caused by a lack of Referential Integrity in the source systems. Schema checks ensured that any schema changes that did not get applied during Continuous integration were detected.
The end package would check the results of any transformations. These included:
Comparing record counts between source|destination
Checking specific transforms (eg: all date values changed to appropriate SK value, all string values RTrimed)
Ensuring all SK fields were populated (-1 instead of nulls)
Most of these tests were SQL statements the used the built in schema objects of our database, so they were not to onerous to create.
In addition, as part of our development process we would create views that had the end result of any transformations we were doing. We would make use of these views to validate our package transformations.
Each of these checks created a record in our special audit table. That way we could provide a comprehensive list of all the tests and checks we had done each running of the process to satisfy the governance peoples.
(We also had a separate set of packages that would unit test each QA test by means of creating dummy tables, populating them, running the test then confirming the appropriate audit record was written. As Nick stated, this was a lot of work and of little real value)
testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.
Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.
This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).
You can unit test ETLs.
End-to-end tests are good, but slow, expensive and difficult to construct and keep stable.
Unit testing ETLs is highly desired to be able to test all data permutations but generally put into the too-hard basket. However it is possible to write true unit tests for ETLs that can run quickly and reliably.
We have found that the key is to decompose the ETL into two separate sections. Since an ETL is an Extract-Transform-Load the key is to separate the T from the E&L. Make a pure Transform function that transforms an input dataset to an output dataset, then call this function from the Extract and Load module.
The Extract and Load module isn't suitable for unit testing because it will generally involve external data sources and sinks, access tokens and user permissions, etc.
But all of the testable logic should be in the Transform component. Test this function from any unit testing framework - you will be able to pass in predefined datasets and test the transformed output against expected results. With some thinking we have even managed to create unit tests that test multi-stage updates of datasets onto each other.
Our particular implementation was done on Databricks in Scala, but the concept should work on any platform.
We've set up a system where for each ETL procedure we have defined an input dataset and an expected result dataset. Then we have created a system which, utilizing Robot Framework, runs three-part tests for each ETL procedure where the first part inserts the input dataset into the source data tables, the second part runs the ETL, and the third part compares the actual results with our expected results.
This works pretty well for us, but there are a couple of downsides: first of all, we create the test datasets manually for each ETL procedure which takes some work, and secondly, this means that testing for "unexpected" inputs is not done.
For the automated unit testing we have a separate environment in which we can install builds of our entire DW automatically.
The testing in ETL process fits in the following stages:
Identify Business requirements
Validate Data sources
Prepare test cases
Extract Data from different sources
Apply transformation logic to validate data
Load data into the destination
Reporting analysis
We can also categorize the ETL testing process as follows:
Product validation
Source to target data testing
Metadata testing
Performance testing
Integration and quality testing
Report testing

Should your test data be in the same form as the live data?

When testing systems (any system, really, e.g. a database), is it important that the test data is in the same form (format) as the live data?
To what degree do you allow differences in the two types of data?
Thanks
Put it this way: the more different your test data is from your live data, the less valuable the testing actually is. So yes, your test data should be as close as possible to your live data.
Barring specific reasons to use fake data, I think it's important to get as close as you can to the live data when testing. Otherwise you will definitely miss issues.
Specific reasons you might use fake data:
live data has privacy or sensitivity concerns; you might use fake credit card numbers (but with the proper format), you might obfuscate names or phone numbers
live data volume is too high for speedy testing; in this case you should select a representative sample
using live data might cause external impacts; for example, you might not want to use real email addresses if emails could go to real users during tests. However, this last one is better solved by mocking your email system.
I try to use both test data that hits specific cases I have designed (often modified from live data); and a significant volume of live data whenever it is available, which hits a large number of scenarios that could definitely impact customers and may include scenarios I haven't thought of.
Keep in mind precisely what you are testing at any given moment. If you are just testing that the data acceptance service grabs files and it should grab any files and then reject bad formats later, then you don't care so much about what is inside the file and you will need at least some other-format test files. In that case, maybe just changing extensions on a notepad file would be enough for the functionality testing, with some large files generated to test file size, etc.
Using non-accurate test data could be especially useful if the format is still being worked out while the devs start work on the other parts of the system. However, you will want to run live or similar-to-live data through every part of your system for integration and end-to-end testing at some point.
I disagree with MusiGenesis, unless you are testing your ability to read from the data source.
If you are just testing how the system performs with certain data, then you can just use mocking to remove all connectivity to external data sources. However, if you need to test things like handling failures in connections and dropping connections, then you will probably want to try to connect to the same type of data source.
I think it's more complex than some people have made out and I would generally have the following test environments
Unit Test - Partial Copy of production data
System Test - Stale but full copy of production data with interfaces from other system test environments
Production Acceptance - Same as system test but fed from other PA systems and may have more data if you use massive data sets
Production maintenance - Copy of production refreshed frequently (e.g weekly) with no interfaces but the ability to implement them quickly. This is used for fixing big production mistakes.

ASP.NET MVC TDD with LINQ and SQL database

I am trying to start a new MVC project with tests and I thought the best way to go would have 2 databases. 1 for testing against and 1 for when I run the app and use it (also test really as it's not production yet).
For the test database I was thinking of putting create table scripts and fill data scripts within the test setup method and then deleting all this in the tear down method.
I am going to be using Linq to SQL though and I don't think that will allow me to do this?
Will I have to just go the ADO route if I want to do it this way? Or should I just use a mock object and store data as an array or something?.
Any tips on best practices?
How did Jeff go about doing this for StackOveflow?
What I do is define an interface for a DataContext wrapper and use an implementation of the wrapper for the DataContext. This allows me to use an alternate, fake DataContext implementation in my tests (or mock it, if easier). This abstracts the database out of my unit tests completely. I found some starter code at http://andrewtokeley.net/archive/2008/07/06/mocking-linq-to-sql-datacontext.aspx, although I've extended it so that it handles the validation implementations on my entity classes.
I should also mention that I have a separate staging server for QA, so there is live testing of the entire system. I just don't use an actual database in my unit testing.
I checked out the link from tvanfosson and RikMigrations and after playing about with them I prefer the mocking datacontext method best. I realised I don't need to create tables and drop them all the time.
After a little more research I found Stephen Walther's article http://stephenwalther.com/blog/archive/2008/08/17/asp-net-mvc-tip-33-unit-test-linq-to-sql.aspx which to me seems easier and more reliable.
So I am going with this implementation.
Thanks for the help.
You may want to find some other way around actually hitting the database for your unit tests because it takes a lot more time. That being said, have you considered using Migrations for creating / deleting your tables instead of using sql scripts? RikMigrations is what I have been using to create my database so I can easily revision all of my code in one place. Justin Etheredge has a great article on using RikMigrations.
Consider these methods on DataContext:
http://msdn.microsoft.com/en-us/library/system.data.linq.datacontext.createdatabase.aspx
http://msdn.microsoft.com/en-us/library/system.data.linq.datacontext.executecommand(v=VS.100).aspx
I agree with much of the above, relating to unit testing. However, I think it's important to raise the point that using Mock Repositories and unit tests doesn't give you the same level of tests as a DB Integration Test would.
For example, our databases often have cascading deletes built right in to the schema. In this case, deleting a primary entity in an aggregate will automatically delete all child entities. However, this would not automatically apply in a mocked repository that was not backed up by a physical database with these business rules (unless you built all of those rules in to the Mock). This is important because if somebody comes along and changes the design of my schema, I need it to break my tests so I can adjust the code/schema accordingly. I appreciate that this is Integration Testing and not Unit Testing but thought it was worth mentioning.
My preferred option is to create a Master Design Database that contains sample data (the same sort of data you would create in your Mocks). During the start of each test run, I have an automated script that creates a backup of the MasterDB and restores it to "TestDB" (which all my tests use). That way, I maintain a repository of clean test data in Master than recreates itself upon each test run. My tests can play around with the data and test out all the scenarios needed.
When I debug the application, I have another script that backs up and restores the Master DB to a DEV database. I can play around with data here too without worrying about losing my sample data. I don't typically run this particular script every session because of the delay waiting for the DB to be recreated. I may run it once a day and then play around/debug the app throughout the day. If for example, I delete all the records from a table as part of my debugging, I would run the script to recreate the DevDB when I'm done.
These steps sound like they would add a huge amount of time to the process, but actually - they don't. Our application currently has in the region of 3500 tests, with about 3000 of them accessing the DB at some point. The database backup and restore typically takes around 10-12 seconds at the start of each test run. And since the whole test suite is only executed upon TFS checkin, we don't mind if we have to wait a while longer anyway. On an average day, our entire test suite takes about 15-20 minutes to run.
I appreciate and accept that integration testing is much slower than unit testing (because of the inherent need to use a real DB) but it more closely represents the 'real world' app. For example, Mock Repositories don't return DB error codes, the don't time-out, they don't lock up, they don't run out of disk space, etc.
Unit tests are ok for simple calculations, basic business rules, etc. and certainly they are absolutely the best choice for most operations that don't involve DB (or other resource) access. But I don't think they are as valuable as integration tests - people talk a lot about unit tests, but little is said about integration tests.
I expect those passionate about unit tests will be sending flames my way for this. That's fine - I'm just trying to bring some balance and to remind people that projects that are full of passed unit tests can still fail badly the moment you implement them in the field.
This article gives example of mocking linq to sql with typemock.
http://blog.benhall.me.uk/2007/11/how-to-unit-test-linq-to-sql-and.html