What is difference between equivalence class testing and input domain partitioning? - testing

I'm learning software testing now, just wondering what is difference between equivalence class testing and input domain partitioning, seems like both of them about to partition input domain.

Frankly saing, during my career as software testing engineer I haven't met a lot of mentions about input domain partitions.
But nevertheless this term exists and let's try to take a look is there a difference between equivalence class testing and input domain partitioning?
Equivalence class technique divides possible test data for, let's say application module, into partitions of equivalent data. They're "equivalent" because any member of that partition can perfectly represent the other member of that partition, and theoretically you need only one test using one of the partitions' members in order to make testing of that partition enough sufficient. Moreover the partitions should not overlap.
Yes I know, that's a little bit cumbersome, but let's take a look on the example: you have an input field on the web page which accepts all kind of chars but up to 256 of them. It gives you following equivalence partitions (simplified):
Char types:
only letters
only numbers
only special chars
mixed chars (letters + numbers + spec. chars)
Char quantity:
0
>0
<256
256
Each of that equivalence partitions has sub-partitions, e.g. "letters":
Big letters
Small letters
Mixed letters
That means that in order to sufficiently test "letters partitions" you have to design test case which will include at least one of those sub-partitions. Let's say it will be "letters -> Big letters": "TEST INPUT STRING". Take a look that here we've also combined our test string with "Char quantity - >0" equivalence partition.
So basicly saying combining sub-partitions of "Char types" and "Char quantities" partitions, you'll be able to design a minimum test set for testing input data of that field.
From the other side input domain for a program contains all the possible inputs to that program which is farely equal to equivalence classes of possible inputs of the application module.
Sometimes the ones who speak about input domain for a program, say also about regions which is the same thing as sub-partition of equivalence partitions. Moreover those input domains (and accordingly regions) must not overlap (so must they not within equivalence partition testing).
With all that said I would consider those two terms as ones, that describe the same matter but using different words.

Related

Storing DNA in Swift

I'm going to write an application for dealing with raw DNA data samples, as the files you get from MyHeritage, Ancestry, FamilyTreeDNA, 23&me etc. Each of these files are basically a CSV-file with some quirks, and asked about decoding them in another question I posted earlier.
Now for the next part. When I have parsed/decoded those files, I want to put the DNA data in a database, so that I can compare one persons DNA to that of another person. It's a lot of data, but not more than most computers can handle.
In memory, I can have the full DNA for both persons, and compare them, and then create ArraySlices for the segments of DNA data that overlap, but ArraySlices aren't suitable for storage, In memory the ArraySlice can't exist by itself. It's just a reference into the full array, so if I would flatten the ArraySlice I would still get the whole array, even for the segments that don't match.
Each person shall have their full DNA on backing store, and can be read into memory, but how would you store the matching segments?
I'm thinking of something like:
//Bucket is the term FamilyTreeDNA for indicating whether the DNA match is on the maternal or paternal chromosome.
enum Bucket {
case .maternal
case .paternal
}
//THe first 22 chromosome pairs are just numbered from 1 to 22, but the last two are X and Y, so I use a String for storing chromosome "number"
struct SharedSegment {
let onChromosome: String
let bucket: Bucket
let range: Range<UInt>
}
I don't care if it takes more disk space, but I want to have lightning fast comparisons of DNA, so that I can compare all the DNA for all the individuals in a datavase, without it taking months to do so. Also storage space for the full DNA to make comparisons.
At the first stage I'm just building an app for storing the DNA kits I administer, but I already have plans for a services of type Gedmatch and DNAPainter if you have tried them. This means it's a services where people can upload their DNA to be compared to other peoples DNA, and lets says a million people upload their DNA to this service, and each of them should have their DNA compared to the other 999'999 people. The number of comparison will be huge, so my primary focus is on performance. Each file with raw DNA data will contain about 400-950 thousand lines of DNA data.
Each line will contain the chromosome number, the RSID, the position within the chromosome and a genotype. The latter is two letters "AA", "AC", "CT" etc. There are four different letters A, C, G and T. The reason there are two letter for each position is that you have chromosome pairs, where there is one chromosome inherited from the father and one from the mother, and there is one letter from each of those two chromosomes. Of course I can store them as just a string of characters, but there are chances of errors, so I would like to represent them in code as
enum Aminoacid {
noCall = 0
case A
case C
case G
case T
}
When sequencing DNA there are sometimes problem, and the sequencing equipment can't determine which amino acid it is in a certain position. This is called a "no call", therefore the case noCall in the enum. in the raw DNA file this is represented by a dash, so it can say in the results "-A"m which means that one of the parents had an A in that position, and the other could not be determined.
Is there any possibility to squeeze them together in 4 bits (nybble), so that I can store two of these letter per byte?It's even possible to squeeze into 3 bits, but I can't get three letters into a byte anyway. It solve be two letter á 3 birs each and two bits wasted in every byte, so I could just as well use 4 bits per amino acid. There are Uint64, Uint32, Uint16 and Uint8 in Swift, but no Uint4, which would be ideal for this case. I'm also thinking about whether to store the two letter from the maternal and paternal chromosome together or if U should split them into separate array One array for maternal DNA and one for paternal). There is a problem with that approach, and that is it's impossible to tell if the first letter on each row is maternal or paternal, until you have the DNA from at least one of the parents to compare with. In absence of their DNA, I would have to have a third array to store both letters in, until I can determine swhich one is maternal end paternal respectively. I'm trying to come up with the most effective way of storing this, to make the comparisons super fast.
In one way I don't like using enums, and that is because I will have to convert them to rawValue, do I can do something like
var genotype = Aminoacid.A.rawValue << 4 + Aminoacid.G.rawValue
As far as I can see that's the best way to squeeze two of these into ont byte, since there's no UInt4.
I'm not so fond of having lots of .rawValue all over my code. I would like to have only Aminoacid.A << 4 + Aminoacid.G, but unfortunately I don't think this is possible. Maybe there is a beter way to store these sequences of amino acids in the database, like enums with associated values or something. I don't know how efficient associated values till be, when working with such large data sets.
If there is anyone out there, that wants to collaborate on this project, that is so far just a hobby project, but I have plans for making a business out of it eventually. This means I can't employ anyone to do this, but if you're working on similar projects then let me know. We can make better things together. Just be aware that I'm writing in Swift, and I'm going to deploy on macOS, but Swift is also available for other platforms, so coders for Linux and Windows are equally welcome to work on a joint project.
This became a little offtopic. My question was about storage of raw DNA and shared segments in a way that is optimal for fast search and comparison of huge amounts of DNA.I probably won't use CoreData for storage, since I would like to keep the options for porting to other platforms than Apples. At the moment I'm using CoreData to experiment a little with storing DNA in different ways.

Non-cryptography algorithms to protect the data

I was able to find a few, but I was wondering, is there more algorithms that based on data encoding/modification instead of complete encryption of it. Examples that I found:
Steganography. The method is based on hiding a message within a message;
Tokenization. Data is mapped in the tokenization server to a random token that represents the real data outside of the server;
Data perturbation. As far as I know it works mostly with databases. Adds noise to the sensitive records yet allows to read general and public fields, like sum of the records on a specific day.
Are there any other methods like this?
If your purpose is to publish this data there are other methods similars to data perturbation, its called Data Anonymization [source]:
Data masking—hiding data with altered values. You can create a mirror
version of a database and apply modification techniques such as
character shuffling, encryption, and word or character substitution.
For example, you can replace a value character with a symbol such as
“*” or “x”. Data masking makes reverse engineering or detection
impossible.
Pseudonymization—a data management and de-identification method that
replaces private identifiers with fake identifiers or pseudonyms, for
example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity,
allowing the modified data to be used for training, development,
testing, and analytics while protecting data privacy.
Generalization—deliberately removes some of the data to make it less
identifiable. Data can be modified into a set of ranges or a broad
area with appropriate boundaries. You can remove the house number in
an address, but make sure you don’t remove the road name. The purpose
is to eliminate some of the identifiers while retaining a measure of
data accuracy.
Data swapping—also known as shuffling and permutation, a technique
used to rearrange the dataset attribute values so they don’t
correspond with the original records. Swapping attributes (columns)
that contain identifiers values such as date of birth, for example,
may have more impact on anonymization than membership type values.
Data perturbation—modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range
of values needs to be in proportion to the perturbation. A small base
may lead to weak anonymization while a large base can reduce the
utility of the dataset. For example, you can use a base of 5 for
rounding values like age or house number because it’s proportional to
the original value. You can multiply a house number by 15 and the
value may retain its credence. However, using higher bases like 15 can
make the age values seem fake.
Synthetic data—algorithmically manufactured information that has no
connection to real events. Synthetic data is used to create artificial
datasets instead of altering the original dataset or using it as is
and risking privacy and security. The process involves creating
statistical models based on patterns found in the original dataset.
You can use standard deviations, medians, linear regression or other
statistical techniques to generate the synthetic data.
Is this what are you looking for?
EDIT: added link to the source and quotation.

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

Oracle String Conversion - Alpha String to Numeric Score, Fuzzy Match

I'm working with a lot of name data where the following events are happening:
In one stream the data is submitted as "Sung" and in the other stream "Snug" my initial thought to this was to convert Sung and Snug to where each character equals a number then the sums would be the same, so even if they transverse a character, I'd be able to bucket these appropriately.
The other is where in one stream it comes in as "Lillly" as opposed to "Lilly" in the other stream. I'd like to figure out how to fuzzy match these such that I can identify them. I'm not sure if this is possible in Oracle.
I'm working with many millions of data points and trying to figure out how to write these classification buckets such that I can stop having so much noise in my primary task of finding where people are truly different people as opposed to a clerical error.
Any thoughts would be very appreciated.
A common measure for such distance is called Levenshtein distance (Wikipedia here). This measures the "edit" distance between two strings -- number of edit operations needed to convert one into the other.
That's the good news. More good news is that Oracle even has an implementation in the UTL_MATCH library.
The bad news is that it is really, really expensive on millions of data points. Unfortunately, I cannot help you there so much. One idea is to determine which names are "close enough" because they already share a certain minimum number of characters.
Another method is to convert the strings to what they sound like. That is called soundex. You may be able to use the two together -- assuming your names are predominantly English (the soundex algorithm was invented by the US Census Bureau, so it would work best on names in America).

dimensional and unit analysis in SQL database

Problem:
A relational database (Postgres) storing timeseries data of various measurement values. Each measurement value can have a specific "measurement type" (e.g. temperature, dissolved oxygen, etc) and can have specific "measurement units" (e.g. Fahrenheit/Celsius/Kelvin, percent/milligrams per liter, etc).
Question:
Has anyone built a similar database such that dimensional integrity is conserved? Have any suggestions?
I'm considering building a measurement_type and a measurement_unit table, both of these would have text two columns, ID and text. Then I would create foreign keys to these tables in the measured_value table. Text worries me somewhat because there's the possibility for non-unique duplicates (e.g. 'ug/l' vs 'µg/l' for micrograms per liter).
The purpose of this would be so that I can both convert and verify units on queries, or via programming externally. Ideally, I would have the ability later to include strict dimensional analysis (e.g. linking µg/l to the value 'M/V' (mass divided by volume)).
Is there a more elegant way to accomplish this?
I produced a database sub-schema for handling units an aeon ago (okay, I exaggerate slightly; it was about 20 years ago, though). Fortunately, it only had to deal with simple mass, length, time dimensions - not temperature, or electric current, or luminosity, etc. Rather less simple was the currency side of the game - there were a myriad different ways of converting between one currency and another depending on date, currency, and period over which conversion rate was valid. That was handled separately from the physical units.
Fundamentally, I created a table 'measures' with an 'id' column, a name for the unit, an abbreviation, and a set of dimension exponents - one each for mass, length, time. This gets populated with names such as 'volume' (length = 3, mass = 0, time = 0), 'density' (length = 3, mass = -1, time = 0) - and the like.
There was a second table of units, which identified a measure and then the actual units used by a particular measurement. For example, there were barrels, and cubic metres, and all sorts of other units of relevance.
There was a third table that defined conversion factors between specific units. This consisted of two units and the multiplicative conversion factor that converted unit 1 to unit 2. The biggest problem here was the dynamic range of the conversion factors. If the conversion from U1 to U2 is 1.234E+10, then the inverse is a rather small number (8.103727714749e-11).
The comment from S.Lott about temperatures is interesting - we didn't have to deal with those. A stored procedure would have addressed that - though integrating one stored procedure into the system might have been tricky.
The scheme I described allowed most conversions to be described once (including hypothetical units such as furlongs per fortnight, or less hypothetical but equally obscure ones - outside the USA - like acre-feet), and the conversions could be validated (for example, both units in the conversion factor table had to have the same measure). It could be extended to handle most of the other units - though the dimensionless units such as angles (or solid angles) present some interesting problems. There was supporting code that would handle arbitrary conversions - or generate an error when the conversion could not be supported. One reason for this system was that the various international affiliate companies would report their data in their locally convenient units, but the HQ system had to accept the original data and yet present the resulting aggregated data in units that suited the managers - where different managers each had their own idea (based on their national background and length of duty in the HQ) about the best units for their reports.
"Text worries me somewhat because there's the possibility for non-unique duplicates"
Right. So don't use text as a key. Use the ID as a key.
"Is there a more elegant way to accomplish this?"
Not really. It's hard. Temperature is it's own problem because temperature is itself an average, and doesn't sum like distance does; plus F to C conversion is not a multiply (as it is with every other unit conversion.)
A note about conversions: a lot of units are linearly related, and can be converted using a formula like "y = A + Bx", where A and B are constants which could be stored in the database for each pair of units that you need to convert between. For example, for Celsius to Farenheit the constants are A=32, B=1.8.
However, there are also rare exceptions. Converting between logarithmic and non-logarithmic units, for example. Or converting between mass-per-volume and molar-mass-per-volume (in which case you would need to know the molar mass of the compound being measured).
Of course, if you are sure that all the conversions required by the system are linear, then there's no need for over-engineering, just store the two constants. You can then extract standardized results from the database using straight SQL joins with calculated fields.