Comparing Blob fields - blob

My example is that I am using a fingerprint scanner, the fingerprint data is stored in a blob field, so I want to make sure that the same fingerprint does not get inserted, so whats the best way to compare these fields.

This does not seem to be about delphi or blob fields at all, since "the same fingerprint" will rarely (if ever) happen. Even the same person will produce slightly different images every time (s)he puts a finger on the scanner. Therefore the real problem is not checking for equality but checking for close matches which is a nontrivial problem in and of itself. You should consult specialized literature.

Related

Writing wav files of unknown length

The various headers of a wav-file contain file-length information. Consider the case where I generate a wav file without knowing how long it is going to be and possibly without the ability to alter the header after I finished (i.e. in case of writing to a pipe), what should I write into these fields?
Either way this isn't an ideal situation. But, if there's absolutely no way to edit the file, I'd recommend writing 0xFFFFFFFF, that is, the maximum possible value that can be assigned to the Subchunk2Size field of a standard wav header (albeit somewhat of a hack). Doing so will allow the whole file to be read/played by practically all players.
As some players solely rely on this field to calculate the audio's length (so it knows when to loop, how far to allow seeking, etc.), therefore, saying the file is longer than it actually is will "trick" the player into processing the entire file (although, depending on the player an error may occur once it reaches the end of the audio).

How should I (if I should at all) implement Generic DB Tables without falling into the Inner-platform effect?

I have a db model like this:
tb_Computer (N - N) tb_Computer_Peripheral (N - 1) tb_Peripheral
Each computer has N peripherals. But each peripheral is different in nature, and will have different fields. A keyboard will have model, language, etc, and a network card has specification about speed and such.
But I don't think it's viable to create as many tables as there are peripherals. Because one day someone will come up with a very specific peripheral and I don't want him to be unable to add it just because it is not a keyboard neither a network card.
Is it a bad practice to create a field data inside tb_Peripheral which contains JSON data about a specific peripheral?
I could even create a tb_PeripheralType with specific information about which data a specific type of peripheral has.
I read about this in many places and found everywhere that this is a bad practice, but I can't think of any other way to implement this the way I want, completely dynamic.
What is the best way to achieve what I want? Is the current model wrong? What would you do ?
It's not a question of "good practices" or "bad practices". Making things completely dynamic has an upside and a downside. You have outlined the upside fairly well.
The downside of a completely dynamic design is that the process of turning the data into useful information is not nearly as routine as it is with a database that pins down the semantics of the data within the scope of the design.
Can you build a report and a report generating process that will adapt itself to the new structure of the data when you begin to add data about a new kind of peripheral? If you end up stuck with doing maintenance on the application when requirements change, what have you gained by making the database design completely dynamic?
PS: If the changes to the database design consist only of adding new tables, the "ripple effect" on your existing applications will be negligible.
I can think of four options.
The first is to create a table peripherals that would have all the information you could want about peripherals. This would have NULLs in the columns where the field is not appropriate to the type. When a new peripheral is added, you would have to add the descriptive columns.
The second is to create a separate table for each peripheral.
The third is to encode the information in something like JSON.
The fourth is to store the data as pairs. So each peripheral would have many different rows.
There are also hybrids for these approaches. For instance, you could store common fields in a single table (ala (1)) and then have key value pairs for other values.
The question is how this information is going to be used. I do most of my work directly in SQL, so the worst option for me is (3). I don't want to parse strange information formats to get something potentially useful to a SQL query.
Option (4) is the most flexible, but it also requires more work to get a complete picture of all the possible attributes.
If I were starting from scratch, and I had a pretty good idea of what fields I wanted, then I would start with (1), a single table for peripherals. If I had requirements where peripherals and attributes would be changing fairly regularly, then I would seriously consider (4). If the tables are only being used by applications, then I might consider (3), but I would probably reject it anyway.
Only one question to answer when you do this sort of design. JSON, a serialised object, xml, or heaven forbid a csv, doesn't really matter.
Do you want to consume them outside of the API that knows the structure?
If you want to say use sql to get all peripherals of type keyboard with a number of keys property >= 102 say.
If you do, it gets messy, much messier than extra tables.
No different to say having a table of pdfs or docs and trying to find all the ones which have more than 10 pages.
Gets even funnier if you want to version the content as your application evolves.
Have a look at a Nosql back end, it's designed for stuff like this, a relational database is not.

Writing a kama sutra cipher

I took up cryptography recently, and 1 of my task was to create a kama sutra cipher. Up till the point of generating the keys, I will have no problems. However, due to the nature of kama sutra, I believe that the keys are not supposed to be hard coded into the program, but rather generated for each plain text it takes in.
What I understand is that the cipher text's length should be the same as the length of plain text. However, the thing is that where do I place the key, such that as long as the cipher text is generated by my program, the program would be able to decipher it even if the program was closed. Given that this is an algorithm, I am sure that I should not be looking at storing the key in another flat file/ database.
There are not many related information online regarding this cipher. What I saw are those that allow you to randomise a key set, generate a cipher text based on the given key set. When decrypting, you will also need to provide the same key set. Is this the correct way of implementation?
For those who have knowledge about this, please guide me along.
If you want to be able to decrypt the cyphertext, then you need to be able to recover the key whenever you need. For a classical cypher, this was usually done by using the same key for multiple messages, see the Caesar Cypher for an example. Caesar used a constant key, a -3/+3 shift while Augustus used a +1/-1 shift.
You may want to consult your instructor as to whether a fixed key or a varying key is required.
It will be simpler to develop a fixed key version, and then to add varying key functionality on top. That way you can get the rest of the program working correctly.
You may also want to look at classical techniques for using a keyphrase to mix an alphabet.

Isn't it difficult to recognize a successful decryption?

When I hear about methods for breaking encryption algorithms, I notice there is often focused on how to decrypt very rapidly and how to reduce the search space. However, I always wonder how you can recognize a successful decryption, and why this doesn't form a bottleneck. Or is it often assumed that a encrypted/decrypted pair is known?
From Cryptonomicon:
There is a compromise between the two
extremes of, on the one hand, not
knowing any of the plaintext at all,
and, on the other, knowing all of it.
In the Cryptonomicon that falls under
the heading of cribs. A crib is an
educated guess as to what words or
phrases might be present in the
message. For example if you were
decrypting German messages from World
War II, you might guess that the
plaintext included the phrase "HElL
HITLER" or "SIEG HElL." You might pick
out a sequence of ten characters at
random and say, "Let's assume that
this represented HEIL HITLER. If that
is the case, then what would it imply
about the remainder of the message?"
...
Sitting down in his office with the
fresh Arethusa intercepts, he went to
work, using FUNERAL as a crib: if this
group of seven letters decrypts to
FUNERAL, then what does the rest of
the message look like? Gibberish?
Okay, how about this group of seven
letters?
Generally, you have some idea of the format of the file you expect to result from the decryption, and most formats provide an easy way to identify them. For example, nearly all binary formats such as images, documents, zipfiles, etc, have easily identifiable headers, while text files will contain only ASCII, or only valid UTF-8 sequences.
In assymetric cryptography you usually have access to the public key. Therefore, any decryption of an encrypted ciphertext can be re-encrypted using the public key and compared to the original ciphertext, thus revealing if the decryption was succesful.
The same is true for symmetric encryption. If you think you have decrypted a cipher, you must also think that you have found the key. Therefore, you can use that key to encrypt your, presumably correct, decrypted text and see if the encrypted result is identical to the original ciphertext.
For symmetric encryption where the key length is shorter than the cipher-text length, you're guaranteed to not be able to produce every possible plain-text. You can probably guess what form your plain--text will take, to some degree -- you probably know whether it's an image, or XML, or if you don't even know that much then you can assume you'll be able to run file on it and not get 'data'. You have to hope that there are only a few keys which would give you even a vaguely sensible decryption and only one which matches the form you are looking for.
If you have a sample plain-text (or partial plain-text) then this gets a lot easier.

Functional testing of output files, when output is non-deterministic (or with low control)

A long time ago, I had to test a program generating a postscript file image. One quick way to figure out if the program was producing the correct, expected output was to do an md5 of the result to compare against the md5 of a "known good" output I checked beforehand.
Unfortunately, Postscript contains the current time within the file. This time is, of course, different depending on when the test runs, therefore changing the md5 of the result even if the expected output is obtained. As a fix, I just stripped off the date with sed.
This is a nice and simple scenario. We are not always so lucky. For example, now I am programming a writer program, which creates a big fat RDF file containing a bunch of anonymous nodes and uuids. It is basically impossible to check the functionality of the whole program with a simple md5, and the only way would be to read the file with a reader, and then validate the output through this reader. As you probably realize, this opens a new can of worms: first, you have to write a reader (which can be time consuming), second, you are assuming the reader is functionally correct and at the same time in sync with the writer. If both the reader and the writer are in sync, but on incorrect assumptions, the reader will say "no problem", but the file format is actually wrong.
This is a general issue when you have to perform functional testing of a file format, and the file format is not completely reproducible through the input you provide. How do you deal with this case?
In the past I have used a third party application to validate such output (preferably converting it into some other format which can be mechanically verified). The use of a third party ensures that my assumptions are at least shared by others, if not strictly correct. At the very least this approach can be used to verify syntax. Semantic correctness will probably require the creation of a consumer for the test data which will likely always be prone to the "incorrect assumptions" pitfall you mention.
Is the randomness always in the same places? I.e. is most of the file fixed but there are some parts that always change? If so, you might be able to take several outputs and use a programmatic diff to determine the nondeterministic parts. Once those are known, you could use the information to derive a mask and then do a comparison (md5 or just a straight compare). Think about pre-processing the file to remove (or overwrite with deterministic data) the parts that are non-deterministic.
If the whole file is non-deterministic then you'll have to come up with a different solution. I did testing of MPEG-2 decoders which are non-deterministic. In that case we were able to do a PSNR and fail if it was above some threshold. That may or may not work depending on your data but something similar might be possible.