Functional testing of output files, when output is non-deterministic (or with low control) - testing

A long time ago, I had to test a program generating a postscript file image. One quick way to figure out if the program was producing the correct, expected output was to do an md5 of the result to compare against the md5 of a "known good" output I checked beforehand.
Unfortunately, Postscript contains the current time within the file. This time is, of course, different depending on when the test runs, therefore changing the md5 of the result even if the expected output is obtained. As a fix, I just stripped off the date with sed.
This is a nice and simple scenario. We are not always so lucky. For example, now I am programming a writer program, which creates a big fat RDF file containing a bunch of anonymous nodes and uuids. It is basically impossible to check the functionality of the whole program with a simple md5, and the only way would be to read the file with a reader, and then validate the output through this reader. As you probably realize, this opens a new can of worms: first, you have to write a reader (which can be time consuming), second, you are assuming the reader is functionally correct and at the same time in sync with the writer. If both the reader and the writer are in sync, but on incorrect assumptions, the reader will say "no problem", but the file format is actually wrong.
This is a general issue when you have to perform functional testing of a file format, and the file format is not completely reproducible through the input you provide. How do you deal with this case?

In the past I have used a third party application to validate such output (preferably converting it into some other format which can be mechanically verified). The use of a third party ensures that my assumptions are at least shared by others, if not strictly correct. At the very least this approach can be used to verify syntax. Semantic correctness will probably require the creation of a consumer for the test data which will likely always be prone to the "incorrect assumptions" pitfall you mention.

Is the randomness always in the same places? I.e. is most of the file fixed but there are some parts that always change? If so, you might be able to take several outputs and use a programmatic diff to determine the nondeterministic parts. Once those are known, you could use the information to derive a mask and then do a comparison (md5 or just a straight compare). Think about pre-processing the file to remove (or overwrite with deterministic data) the parts that are non-deterministic.
If the whole file is non-deterministic then you'll have to come up with a different solution. I did testing of MPEG-2 decoders which are non-deterministic. In that case we were able to do a PSNR and fail if it was above some threshold. That may or may not work depending on your data but something similar might be possible.

Related

Making lex/yaac script multhreaded

I have a lex/yaac code which captures some data after parsing an file. That file is in specific format. Consider this format:
File format:
ABC
Something something
ABC
Something something
....
....
Lex/Yacc code is sequential right now. Is it possible to make the code multithreading for single file by dividing it into chunks separated by ABC.
Where to start?
I shall be happy to share more details, if needed.
The problem is that you don't know where the delimiters are until you lexically analyse the file. Unless the delimiters are contextually unambiguous (which won't be the case if your syntax includes things like comments or quoted strings), then you can't start the analysis at a random point in the file without knowing the context of that point. So in common cases, finding the delimiters requires a linear scan.
You could hand a chunk of the file to a separate chunk parser thread every time you identify the chunk's endpoints. That will require scanning the chunk twice (once for the initial delimiter scan and a second time for the precise parse) but it might still be a win if parsing is relatively slow. You could avoid the second scan by retaining the token stream in memory and passing the parser thread the list of tokens, but it may well turn out that the memory management and other bookkeeping for the token lists are as or more expensive than the lexical scan. That will depend on a lot of factors, so experimentation and benchmarking in your particular use case will be necessary.
If the delimiters are completely unambiguous and the chunks are relatively short, you could divide the files into pages of some convenient size, and give each thread its own pages to analyse. When the thread is assigned a page, it will have to skip initial data until it finds the first delimiter; when it reaches the end of the page, it will have to continue scanning the rest of the current chunk in the next page. So there will be a bit of duplicate scanning, but if the chunks are short relative to the page size, it won't be too bad. Again, benchmarking your particular use case will be useful. And, again, you can only do this if you can unambiguously recognise a delimiter without knowing anything about the lexical context.

Comparing Blob fields

My example is that I am using a fingerprint scanner, the fingerprint data is stored in a blob field, so I want to make sure that the same fingerprint does not get inserted, so whats the best way to compare these fields.
This does not seem to be about delphi or blob fields at all, since "the same fingerprint" will rarely (if ever) happen. Even the same person will produce slightly different images every time (s)he puts a finger on the scanner. Therefore the real problem is not checking for equality but checking for close matches which is a nontrivial problem in and of itself. You should consult specialized literature.

How to create your own package for interaction with word, pdf etc

I know that there are a lot of packages around which allow you to create or read e.g. PDF, Word and other files.
What I'm interested in (and never learned at the university) is how you create such a package? Are you always relying on source code being given by the original company (such as Adobe or Microsoft), or is there another clever way of working around it? Should I analyze the individual bytes I see in e.g. PDF files?
It varies.
Some companies provide an SDK ("Software Development Kit") for their own data format, others only a specification (i.e., Adobe for PDF, Microsoft for Word and it's up to the software developer to make sure to write a correct implementation.
Since that can be a lot of work – the PDF specification, for example, runs to over 700 pages and doesn't go deep into practically required material such as LZW, JPEG/JPEG2000, color theory, and math transformations – and you need a huge set of data to test against, it's way easier to use the work that others have done on it.
If you are interested in writing a support library for a certain file format which
is not legally protected,
has no, or only sparse (official) documentation,
and is not already under deconstruction elsewhere,a
then yes: you need to
gather as many possible different files;
from as many possible sources;
(ideally, you should have at least one program that can both read and create the files)
inspect them on the byte level;
create a 'reader' which works on all of the test files;
if possible, interesting, and/or required, create a 'writer' that can create a new file in that format from scratch or can convert data in another format to this one.
There is 'cleverness' involved, mainly in #3, as you need to be very well versed in how data representation works in general. You should be able to tell code from data, and string data from floating point, and UTF8 encoded strings from MacRoman-encoded strings (and so on).
I've done this a couple of times, primarily to inspect the data of various games, mainly because it's huge fun! (Fair warning: it can also be incredibly frustrating.) See Reverse Engineering's Reverse engineering file containing sprites for an example approach; notably, at the bottom of my answer in there I admit defeat and start using the phrases "possibly" and "may" and "probably", which is an indication I did not get any further on that.
a Not necessarily of course. You can cooperate with other whose expertise lies elsewhere, or even do "grunt work" for existing projects – finding out and codifying fairly trivial subcases.
There are also advantages of working independently on existing projects. For example, with the experience of my own PDF reader (written from scratch), I was able to point out a bug in PDFBox.

Writing wav files of unknown length

The various headers of a wav-file contain file-length information. Consider the case where I generate a wav file without knowing how long it is going to be and possibly without the ability to alter the header after I finished (i.e. in case of writing to a pipe), what should I write into these fields?
Either way this isn't an ideal situation. But, if there's absolutely no way to edit the file, I'd recommend writing 0xFFFFFFFF, that is, the maximum possible value that can be assigned to the Subchunk2Size field of a standard wav header (albeit somewhat of a hack). Doing so will allow the whole file to be read/played by practically all players.
As some players solely rely on this field to calculate the audio's length (so it knows when to loop, how far to allow seeking, etc.), therefore, saying the file is longer than it actually is will "trick" the player into processing the entire file (although, depending on the player an error may occur once it reaches the end of the audio).

VB.net quarantine techniques

I was thinking of an efficient way to add quarantining abilities to my antivirus application:
copy the file into a specified directory and change its extension to none (*.).
save the file's binary code in an XML database.
Which way is better?
However, I have no idea how I will recompile the binary code once the user wants to restore the file.
A way to do this is to encrypt the binary file using an encryption engine and moving it into a quarantine folder, you could create a random password and encrypt the file with that password and store it somewhere (that password could also be encrypted with a master key). That is probably the easiest way of quarantining. To unquaranine, just write the complete opposite of the quarantining code. Enumerate the files into a list and filter it out, then when the user clicks on an item and presses unquarantine, it calls the unquarantine function with the filepath as the variable.
If I had to do this (and again, I wouldn't want to be in this situation in the first place, per my comment), I would use an in-process database engine with native support for encryption and large-format binary data. I think sql compact or sqlite both fit this.
I would not use xml, because it's plain-text and the binary data could be easily extracted, and I would not just change the extension, because the file could still easily be executed. Neither are much of a quarantine.
Note that the renaming option is probably the most "efficient" of what I've seen discussed so far, but when dealing with security software correctness should always be your first concern over efficiency. There are times when you can compromise correctness for performance (3D game rendering software does this all the time, to great effect), but security software is not in this category.
What you can do is optimize later. For example, anti-virus engines use heuristics (rules of thumb that will only hold most of the time) to make their software faster, they do this in a way that favors false positives that must then be more-closely checked rather than potentially missing a threat. This only works because the code that more-closely checks each item was written and battle-tested first.