Can BDD work for Big Data ETL testing?

Can BDD work for Big Data ETL testing? - testing

I was wandering if anyone uses BDD for testing a Big Data ETL application?
I can see how BDD can be used for testing applications having a client interact with them, but in case of Big Data ETL application there is no client interaction so its hard to see what 'When' I might use.
For example:
Give 100 event of type A occur
And 50 event of type B occur after 5 minute
Then database rows should be:
|Type|Count|Bucket|
|A|100|1|
|B|50|2|
But that seems wrong.
Any one with an insight?

Can you give me an example of what you'd expect to see in an ETL output?
There are a couple of responses you could give to this. One might be the different kinds of database rows you'd expect, and the fact that some of them will probably be repeated, but not others. That was something that struck me as weird, but if you're used to working with star schemas then you'll probably notice other differences instead.
Normally I'd steer people away from talking about the database, but if you're working with star schemas, I think it's OK to mention the facts and dimensions (I haven't worked with ETL a lot, but I do remember talking through specific examples of these and what I would expect to see).
The alternative is to use the client.
I saw that you said there was no client; however, there's always a client, even if it's one that might exist in the future. There are implications for ETL which run across security, performance and access, amongst others. It's worth having a client, even if it's a string-based or SQL-based toy, to explore the things which might trip you up.
Why are you doing this? What's new about the thing the business or users or customers will be able to do when this is in place, that they can't do already? And can you get hold of an example of that?
"We'll be able to understand how X is performing against Y standard."
Great. Can you give me an example of some X, some Y, and some standard? How will you measure the performance? What data will you be looking for? Should everyone be able to see that data? Can you think of any scenario where someone shouldn't be able to access that?
Those examples become the ETL equivalent of scenarios; the conversations retain the same pattern. You just end up automating them at a different level, since your API is machine-oriented rather than human-oriented, and some of your conversations will be about monitoring instead of testing. Your conversations should still be with the people.
Your "when" will be the query or report that you run, within the data, permission and security context in which you run it.

BDD always works for application logic inside Big data space. Remember the testing triangle principle. Have your unit tests. Practice BDD and build your integration and acceptance tests with BDD and within your sprint. Its not recommended to have your test data externally maintained and thus validating E2E flow with all moving pieces needs to be light weight. Practice TDD model if permits.

Related

Fastest way to understand business logic in a new project

I would like to know the fastest/best way to learn the business logic in a new project.
Most projects have been running for years, some of them are poorly documented, but you still need to know how to work with them. What is the best way to do this? (Use Case Diagram / support from colleagues / code analysis etc.)

The problem with verbose logging is that you may be overloaded with details that do not help, and even if you have the right details, you may misunderstand the big picture.
Moreover, if projects run for years, and are poorly documented, chances are the little available documentation is already obsolete. And chances are, the team did not invest heavily in logging either.
Reverse engineering the code is another approach, but where to start if there are millions of lines of code in the legacy system? Some things can be easily read in code, but many more complex, emergent behavior comes from the interactions between many classes, and this kind of knowledge is the most difficult to extract.
So here is the way to go:
Talk to colleagues. The best approach to move knowledge from one brain to another is direct conversation. It works much better than any formal diagram or any documentation. Unfortunately this is not always possible (e.g. team left)
If 1 is not possible, understand the business user's point of view. May be there is a user manual? Maybe some colleagues of the user-support? if none of those are possible, the ultimate way is to spend some time in the day of the user's life. You will not understand how the system works, but at least you'll get a quick intro in what the system is supposed to do, what matters to the users, and maybe some business rules.
Check for automated test cases. In fact, such test cases are a hidden and up-to date documentation resource.
Check for non-automated test cases, in particular use acceptance tests, and integration tests. If these are not automated, there are chances that they are already obsolete. But it's better than nothing.
Reverse engineer the code. Identify the main classes and how they interact. And yes, some simplified class diagrams will help you to understand how classes are related (no need to document properties and methods: these can be found back in the code). And some sequence diagram will help you to get a picture of the more complex interactions.

Run server, log verbose everywhere. Follow code flow and dig in.

Automatisation&Piping of diverse tasks

I am looking for recommendations for a very generic automation/task execution tool. The scope is somewhat between a script, a build system like make and orchestration tools like Ansible or Puppet. The best I can do is describe my rather vague 'requirements' and hope for clues how others have solved these problems. Sorry for the long description, I guess I don't really know what exactly I want he solution to do. I profit from programming answers on SO all the time but I am not entirely sure if my open ended question is acceptable here.
--
We work as data analysts/system validators in a corporate setting. We perform a range of diverse tasks and interact with lots of ever changing systems. Each little step we do is arguably mundane/easy, but the bigger picture only forms if lots of iterations with slightly different inputs or combinations are repeated. It is a bit like looking for a needle in a hay stack, but the concrete problem is slightly different every time. This makes it hard to use a normal script or automation tool, which require more structure to work. But doing things semi-manual without a big team does not allow us to cover all the analysis/cases we want/need.
To give an applied example: a typical tasks could involve setting up a big calculation in a vendor system, extracting their ASCII output from a web server and parsing it. Then we would suck raw input data from a set of configuration files and data bases. This is piped into some of our home grown replication tools/models living in C++. Then both the system's results and our replication is scanned for interesting outliers (e.g. regression tested) and only this subset is uploaded for human analysts to investigate, nicely presented in an Excel sheet.
We can do all these things easily by hand for a once-off or maybe using ad-hoc tools/scripts. We just can't do it repeatedly for ever so slightly different settings. We seem to need a library for 'common tasks' that are just specialized by some few inputs (e.g. task it to download a time series and scan for outliers - parameters would be db access/login and maybe parameters defining what an outlier is in that context). And then I need to chain these tasks together to make complex tasks repeatable and simple to build up from atomic steps.
I have not found anything really do something like this. There seems to be specialist scripting or tools for each niche available, but not something combining all the different tasks I need to perform.
I have been so far toying on and off with a minimalist sqlite database which controls a set of python 'scripts'/wrappers. These scripts take input parameters from the data base, and they are chained/piped based on the database. The scripts write their results back to the database, mostly as plain text and floats/ints. This kind of db interface is very error prone and complicated for humans; the idea is to have (template) scripts writing (concrete/parametrised) scripts to the db for execution, like rolling itself out before executing. Not sure if this is a smart idea, but the db is driving the scripts, without much interacting among these building block script; rather than having the conventional bunch of scripts calling each other and dumping some data into db as an after thought. So far we have lots of separate wrappers (scripts) to talk to all the systems and do the work, what is really missing is something tying it all together an controlling it.
I am interested (obviously) more in data/flow transparency, repeatability and chaining mini-programs together to bigger units, rather than speed or scaling to larger data sets. All the heavier lifting is either done in the systems we interact with, or it is delegated to C++ called from these python scripts. This is not a production system with more stability and fixed goals but rather a flexible analysis/investigation helper.
I really hope someone here has previously run into exactly that problem severely limiting our productivity, and we can just piggy back off your solution or ideas.

I would suggest that you consider staf (Software Test Automation Framework). It's open source, distributed, and cross-platform. It will run just about any task on just about any platform. It has a variety of plugin "Services" available for specific purposes, or you can create your own custom Service. You can also extend the functionality through scripting (jython) It's also well documented and reasonably well supported through user forums by IBM.

What are various methods for discovering test cases

All,
I am a developer but like to know more about testing process and methods. I believe this helps me write more solid code as it improves the cases I can test using my unit tests before delivering product to the test team. I have recently started looking at Test Driven Development and Exploratory testing approach to software projects.
Now it's easier for me to find test cases for the code that I have written. But I am curios to know how to discover test cases when I am not the developer for the functionality under test.
Say for e.g. let's have a basic user registration form that we see on various websites. Assuming the person testing it is not the developer of the form, how should one go about testing the input fields on the form, what would be your strategy? How would you discover test cases? I believe this kind of testing benefits from exploratory testing approach, i may be wrong here though.
I would appreciate your views on this.
Thanks,
Byte

Bugs! One of my favorite starting places on a project for adding new test cases is to take a look at the bug tracking system. The existing bugs are test cases in their own right, but they also can steer you towards new test cases. If a particular module is buggy, it can lead you to develop more test cases in that area. If a particular developer seems to introduce a certain class of bugs, it can guide testing of future projects by that developer.
Another useful consideration is to look more at testing techniques, than test cases. In your example of a registration form, how would you attack it from a business requirements perspective? Security? Concurrency? Valid/invalid input?

Testing Computer Software is a good book on how to do all kinds of different types of testing; black box, white box, test case design, planning, managing a testing project, and probably a lot more I missed.
For the example you give, I would do something like this:
For each field, I would think about the possible values you can enter, both valid and invalid. I would look for boundary cases; if a field is numeric, what happens if I enter a value one less than the lower bound? What happens if I enter the lower bound as a value? Etc.
I would then use a tool like Microsoft's Pairwise Independent Combinatorial Testing (PICT) Tool to generate as few test scenarios as I could across the cases for all input fields.
I would also write an automated test to pound away on the form using random input, capture the results and see if the responses made sense (virtual monkeys at a keyboard).

Ask questions. Keep a list of question words and force yourself to come up with questions about the product or a feature. Lists like this can help you get out of the proverbial box or rut. Don't spend too much time on a question word if nothing comes to you.
Who
Whose
What
Where
When
Why
How
How much
Then, when you answer them, ask "else" questions. This forces you to distrust, for a moment at least, your initial conclusions.
Who else
Whose else
etc..
Then, ask the "not" questions--negate or refute your assumptions, and challenge them.
Who not (eg, Who might not need access to this secure feature, and why?)
What not (what data will the user not care about? What will the user not put in this text box? Are you sure?)
etc...
Other modifiers to the qustions could be:
W else
W not
W risks
W different
Combine two question words, eg, Who and when.

In the case of the form, I'd look at what I can enter into it and test various boundary conditions there,e.g. what happens if no username is supplied? I'm reminded of there being a few different forms of testing:
Black box testing - This is where you test without looking inside what is being tested. The challenge here is not being able to see inside can cause issues with limiting what are useful tests and how many different tests are worthwhile. This is of course what some default testing can look like though.
White box testing - This is where you can look at the code and have metrics like code coverage to ensure that you are covering a percentage of the code base. This is generally better as in this case you know more about what is being done.
There are also performance tests compared to logic tests that are also worth noting somewhere,e.g. how fast does the form validate me rather than just does the form do this.

Identify your assumptions from different perspectives:
How can users possibly misunderstand this?
Why do I think it acts or should act this way?
What biases might I have about how this software should work?
How do I know the requirements/design/implementation is what's needed?
What other perspectives (users, administrators, managers, developers, legal) might exist on priority, importance, goals, etc, of this software?
Is the right software being built?
Do I really know what a valid name/phone number/ID number/address/etc looks like?
What am I missing?
How might I be mistaken about (insert noun here)?
Also, use any of the mnemonics and testing lists noted here:
http://www.qualityperspectives.ca/resources_mnemonics.html

Discussing test ideas with others. When you explain your ideas to someone else, you tend to see ways to refine or expand on them.

Group brainstorming sessions. (or informally in pairs when necessary)
see these brainstorming techniques

Make data tables with major features listed across the top and side, and consider possible interactions between each pair. Doing this in three dimensions can get unwieldy.

Keep test catalogs with common questions and problem types for different kinds of tasks such as integer validation and workflow steps etc.

Make use of Exploratory Testing Dynamics and Satisfice Heuristic Test Strategy Model by James Bach. Both offer general ways to start thinking more broadly or differently about the product, which can help you switch between boxes and heuristics in testing.

In functional testing, should I compare all tabular data rendered in the browser with the one coming from the DB?

I'm working on a test plan for a website where some tests are taking the following path:
Hit the requested URI and get the data rendered inside some table(20 rows per page).
Make a database query to get the data that is supposed to be rendered in that table.
Compare the 2 data row by row, they should match.
Is that a correct way of doing functional testing? If that request was an Ajax request, what will be the answer also? Would the answer differ for integration testing?
I have some reason that makes me believe that this is wrong somehow.... still need your opinions guys!

Yes, this could be a productive test. Either you have a fixed data set or you don't.
If you have a fixed data set, this is much easier to test, because all you're doing is comparing against a fixed output.
If you don't have a fixed data set, then you need to duplicate the business logic, effectively duplicating the work already done by the developer. Then you have two sets of logic to maintain.
The second is the best approach because you get two ways of doing the same thing, effectively a peer review of the specification and code. It's also very expensive in terms of time and resources, which is why most people choose to have a fixed data set.
To answer your question, if your business logic in the query is simple, then you can get a test very easily. However, the value that the test brings isn't great, because you aren't testing very much.
If the business logic is complex, you are getting more value from the test, but it's going to be harder to maintain in the long term.
For me, what your test does bring is a simple integration test that proves that the system reads correctly from the database, and displays the data correctly. This is a good test, even better if it is automated.

This seems fine for functional testing. Integration testing in my mind has to do with the testing of different technologies or components that are supposed to work together which is generally broader than functional testing. But of course this sort of testing could also be considered integration testing, depending on how your application is put together and where the testing is happening in the lifecycle of your development. For example it may be that in order for this site to work you have to put together a few components that were developed independently; this might be one of the tests to validate that the integration works.
Don't see how this being Ajax or not has anything to do with making the answer different.

I will likely be a dissenting opinion here, but I don't consider this to be a productive test. What you are doing is simply duplicating the code which produces the page. And any time you introduce duplicated code (even across departments) you'll be looking at defects cropping up long-term.
It is far better to load the DB with known data (either through the app, or directly) and then check that the output matches what you'd expect. This also ensures that your DB layer, or DB itself, hasn't modified the data in a way you do not expect.
That is:
Load known data (preferably through the app itself)
Load the requested URI
Check that displayed data matches your known data

This kind of test could be good for testing a large set of data with relatively little tester effort if there is not much developer logic between the database and the display to the end user. Our team has done this on a number of occasions, and it is especially useful for running large quantities of real production data through our tests to be sure that actual scenarios are handled as expected. Do make sure you do at least a little fixed input testing for rare scenarios that might be especially likely to be handled differently in the DB and on the web page - null values, special characters, and other oddities.
Personally, I would call this "integration testing", since you are testing the integration of the DB and the web site, and not "functional testing". For "functional testing", I'd probably want to make a mock of the datasource (e.g., the database) that will provide pre-written sets of data in the format you expect.
Having said that, if I had high confidence in the validity of the DB data and if the logic between the DB query and the web page display was very small and low-risk, I would probably not bother with the mock and would let the integration test cover the functionality as well. I don't know that testing the functionality and integration separately would be a big quality win in this case, and there are likely better things you could do with the available testing time. If there is a lot of logic around this data, you should probably test the integration separately from the functionality. Additional integration testing would probably include things like, "What if the database can't be reached?" and "What if the database is slow?".
While this technique will work with Ajax, make sure your testing tools will work with Ajax. Specifically, think about how you will capture the database query results and how you will gather the results displayed on the web page.
I'm assuming that the validity of the data in the query is being tested elsewhere, since you mentioned that this was just one type of test in the test plan. I'm also just discussing integration with the database and this report and not other features or components, and not other aspects of testing (performance, security. etc.), since that was the scope of your question.

How much a tester should know about internal details of code?

How useful, if at all, is for the testers on a product team to know about the internal code details of a product. This does not mean they need to know every line of code but a good idea of how the code is structured, what is the object model, how the various modules are inter-linked, what are the inter-dependencies between various features etc.? This can argubaly help them in finding related issues or defects once they hit one. On the other side, this can potentially 'bias' their "user-centric" approach towards evaluating and certifying the product and can effect the testing results in the end.
I have not heard of any specific model for such interaction. (Lets assume a product that users, potentially non-technical consume, and not a framework or API that the testers are testing - in the latter case the testers may need to understand the code to test that because the user is another programmer).

That entirely depends upon the type of testing being done.
For functional system testing, the testers can and probably should be oblivious to the details of the implementation -- if they know the details they may inadvertently account for that in their test strategy and not properly test the product.
For performance and scalability testing it's often helpful for the testers to have some high-level knowledge of the structure of the codebase, as it's beneficial in identifying potential performance hotspots, and therefore writing targetted test cases. The reason this is important is that generally performance testing is a broad open-ended process, so anything that can be done to focus the testing to get results is beneficial to everybody.

This sounds similiar to this previous question: Should QA test from a strictly black-box perspective?

I've never seen a circumstance where a tester who knew a lot about the internals of system was disadvantaged.
I would assert that there are self justifying myths that an informed tester is as adequate or even better than a deeply technical one because:
It allows project managers to use 'random or low quality resources' for testing. The 'as uninformed as the user myth'. If you want this type of testing - get some 'real' users to test your stuff.
Testers are still often seen as cheaper and less valuable than developers. The 'anybody can do blackbox testing myth'.
Development can defer proper testing to the test team. Two myths in one 'we don't need to train testers' and 'only the test team does testing' myths.

What you are looking at here is the difference between black box (no knowledge of the internals), white box (all knowledge) and grey box (some select knowledge).
The answer really depends on the purpose of the code. For integration heavy projects then where and how they communicate, even if it is entirely behind the scenes, allows testers to produce appropriate non-functional test cases.
These test cases are determining whether or not a component will gracefully handle the lack of availability of a dependency. It can also be used to identify performance related issues.
For example: As a tester if I know that the Web UI component defers a request to a orchestration service that does the real work then I can construct a scenario where the orchestration takes a long time (high load). If the user then performs another request (simulating user impatience) and the web service will receive a second request while the first is still going. If we continually repeat this the web service will eventually die from stress. Without knowing the underlying model it would not be easy to find the problem
In most cases for functionality testing then black box is preferred, as soon as you move towards non-functional or system integration then understanding the interactions can assist in ensuring appropriate test coverage.
Not all testers are skilled or comfortable working/understanding the component interactions or internals so it is on a per tester/per system basis on whether it is appropriate.
In almost all cases we start with black box and head towards white as the need sees.

A tester does not need to know internal details.
The application should be tested without any knowledge of interal structure, development problems, externals depenedncies.
If you encumber the tester with those additional info you push him into a certain testing scheme and the tester should never be pushed in a direction he should just test from a non coder view.

There are multiple testing methodologies that require code reviewing, and also those that don't.
The advantages to white-box testing (i.e. reading the code) is that you can tailor your testing to only test areas that you know (from reading the code) will fail.
Disadvantages include time wasted from actual testing to understand the code.
Black-box testing (i.e. not reading the code) can be just as good (or better?) at finding bugs than white-box.
Normally both types of testing can happen on one project, developers white-box unit testing, and testers black-box integration testing.

I prefer Black Box testing for final test regimes
In an ideal world...
Testers should know nothing about the internals of the code
They should know everything the customer will - i.e. have the documents/help required to use the system/application.(this definetly includes the API description/documents if it's some sort of code deliverable)
If the testers can't manage to find the defects with these limitations, you haven't documented your API/application enough.

If they are dedicated testers (Only thing they do) then I think they should know as little about the code as possible that they are attempting to test.
Too often they try to determine why its failing, that is the responsibility of the developer not the tester.
That said I think developers make great testers, because we tend to know the edge cases for certain types of functionality.

Here's an example of a bug which you can't find if you don't know the code internals, because you simply can't test all inputs:
long long int increment(long long int l) {
if (l == 475636294934LL) return 3;
return l + 1;
}
However, in this case it would be found if the tester had 100% code coverage as a target, and looked at only enough of the internals to write tests to achieve that.
Here's an example of a bug which you quite likely won't find if you do know the code internals, because false confidence is contagious. In particular, it is usually not possible for the author of the code to write a test which catches this bug:
int MyConnect(socket *sock) {
/* socket must have been bound already, but that's OK */
return RealConnect(sock);
}
If the documentation of MyConnect fails to mention that the socket must be bound, then something unexpected will happen some day (someone will call it unbound, and presumably the socket implementation will select an arbitrary local address). But a tester who can see the code often doesn't have the mindset of "testing" the documentation. Unless they're really on form, they won't notice that there's an assumption in the code not mentioned in the docs, and will just accept the assumption. In contrast, a tester writing from the docs could easily spot the bug, because they'll think "what possible states can a socket be in? I'll do a test for each". Since no constraints are mentioned, there's no reason they won't try the case that fails.
Answer: do both. One way to do this is to write a test suite before you see/write the code, and then add more tests to cover any special cases you introduce in your implementation. This applies whether or not the tester is the same person as the programmer, although obviously if the programmer writes the second kind of test, then only one person in the organisation has to understand the code. It's arguable whether it's a good long-term strategy to have code only one person has ever understood, but it's widespread, because it certainly saves time getting something out the door.
[Edit: I decline to say how these bugs came about. Maybe the programmer of the first one was clinically insane, and for the second one there are some restrictions on the port used, in order to workaround some weird network setup known to occur, and the socket is supposed to have been created via some de-weirdifying API whose existence is mentioned in the general sockets docs, but they neglect to require its use. Clearly in both these cases the programmer has been very careless. But that doesn't affect the point: the examples don't need to be realistic, since if you don't catch bugs that only a very careless programmer would make, then you won't catch all the actual bugs in your code unless you never have a bad day, make a crazy typo, etc.]

I guess it depends how good of testing you want. If you just want to sanity check the common scenarios, then by all means, just give the testers / pizza-eaters the application and tell them to go crazy.
However, if you'd like to have a chance at finding edge cases, performance or load issues, or a whole lot of other issues that hide in the depths of your code, you'd probably be better off hiring testers who know how and when to use white box techniques.
Your call.

IMHO, I think the industry view of testers is completely wrong.
Think about it ... you have two plumbers, one is extremely experienced, knows all the rules, the building codes, and can quickly look at something and know if the work is done right or not. The other plumber is good, and get the job done reliably.
Which one would you want to do the final inspection to make sure you don't come home to a flooded house? In fact, in what other industry do they allow someone who knows hardly anything about the system they are inspecting to actually do the inspection?
I have seen the bar for QA go up over the years, and that makes me happy. In time, QA may become something that devs aspire to be.
In short, not only should they be familiar with the code being tested, but they should have an understanding that rivals the architects of the product, as well as be able to effectively interface with the product owner(s) / customers to ensure that what is being created is actually what they want. But now I am going into a whole seperate conversation ...
Will it happen? Probably sooner than you think. I have been able to reduce the number of people needed to do QA, increase the overall effectiveness of the team, and increase the quality of the product simply by hiring very skilled people with dev / architect backgrounds with a strong aptitude for QA. I have lower operating costs, and since the software going out is higher quality, I end up with lower support costs. FWIW ... I have found that while I can backfill the QA guys effectively into a dev role when needed, the opposite is almost always not true.

If there is time, a tester should definitely go through a developers code. This way, you can improve your tests to get better coverage.
So, maybe if you write your black box tests looking at the spec and think you have the time to execute all of those and will still be left with time, going through code cannot be a bad idea.
Basically it all depends on how much time you have.. Another thing you can do to improve coverage is look at the developers design documents. Those should give you a good idea of what the code is going to look like...
Testers have the advantage of being familiar with both the dev code and the test code!

I would say they don't need to know the internal code details at all. However they do need to know the required functionality and system rules in full detail - like an analyst. Otherwise they won't test all the functionality, or won't realise when the system misbehaves.

For user acceptance testing the tester does not need to know the internal code details of the app. They only need to know the expected functionality, the business rules. When a bug is reported
Whoever is fixing the bug should know the inter-dependencies between various features.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas