O(c) Time complexity to read a few gigs text file

O(c) Time complexity to read a few gigs text file - file-io

I've been tasked to solve a puzzle as follows:
Program an input parser (in C, Python, Java, or Go) that reads a file from standard input and
reads the data into a byte array (8-bit bytes).
checks if the line has a unique set of byte values. If so, it should keep track of the line number Finally, it should print out the line numbers that have a unique set of byte values on each line.
-The program should run in an efficient manner and time – it should not reach big O(n^2)
complexity or worse. Try to see if you can do it in big O(n) time.
- The file should be read into a byte array (8-bit values) without exceeding memory.
I'm reading from the sample file I'm using which is 50MBfile ,
line by line and storing the line in a byte array
then calling a method checkDuplicate(byte[] arr) and pass the byte array
and then create a hashset and loop through the elements of the array and add them to the hashset and then return the size of the hashset.
Since hashsets don't allow duplicates back in the main i check if the returned size is equal to the array size to determine if it is unique or not to save the line number.
private int checkDuplicate(byte[] arr) {
HashSet<Byte> byteSet = new HashSet<Byte>();
int size=0;
for (byte e : arr){
if (e != 0 && byteSet.add(e)) {}
size = byteSet.size();
}
return size;
}
Can O(c) or O(n) be achieved?
I'm getting O(n^2) so far and will handle memory exceptions later on when I reach O(n).
also, would solving the problem in python reduce the time/space complexity?

Related

How can I reduce the size of this file to 83 bytes or less?

Task description:
Create the divisibility function, which expects an array of integers and its size as a parameter. The function returns how many numbers divisible by two are in the array!
We want to save on the available storage space, so our source code can be a maximum of 83 bytes!
My current code:
int oszthatosag(int a[], int s){int r=0,i;for (i = 0; i < s; ++i)if(a[i] % 2 == 0)++r;return r;}
has a size of 96 bytes. I deleted all the unnecessary whitespaces and reduced the lenghths of the variables to a minimum, but it still doesn't seem to be enough.

You can remove additional spaces, use a branch-less strategy to remove the if, move the initialization/increment of i outside the loop, and decrease r starting from s so to remove the ==0. The resulting code should not only be shorter, but also faster. This assumes s>=0 (otherwise r is smaller than expected).
Here is the final result:
int oszthatosag(int a[],int s){int r=s,i=0;while(i<s)r-=a[i++]&1;return r;}
As pointed out by #Brendan, here is an even shorter version (still assuming s>=0):
int oszthatosag(int a[],int s){int r=s;while(s)r-=a[--s]&1;return r;}
Note that in C, the default return type is int so you can omit it (required in C++). This is generally not a good idea (and cause compiler warnings), but it is shorter:
oszthatosag(int a[],int s){int r=s;while(s)r-=a[--s]&1;return r;}

How does libgcrypt increment the counter for CTR mode?

I have a file encrypted with AES-256 using libgcrypt's CTR mode implementation.
I want to be able to decrypt the file in parts (e.g. decrypting blocks 5-10 out of 20 blocks without decrypting the whole file).
I know that by using CTR mode, I should be able to do it. All I need is to know the correct counter.
The problem lies in the fact that all I have is the initial counter for block 0. If I want to decrypt block 5 for example, I need another counter, one that is achieved by doing some action on the initial counter for each block from 0 to 5.
I can't seem to find an API that libgcrypt exposes in order to calculate counter for later blocks given the initial counter.
How can I calculate the counter of later blocks (e.g. block #5) given the counter of block #0?

When in doubt, go to the source. Here's the code in gcrypt's generic CTR mode implementation (_gcry_cipher_ctr_encrypt() in cipher-ctr.c) that increments the counter:
for (i = blocksize; i > 0; i--)
{
c->u_ctr.ctr[i-1]++;
if (c->u_ctr.ctr[i-1] != 0)
break;
}
There are other, more optimized implementations of counter incrementing found in other places in the libgcrypt source, e.g. in the various cipher-specific fast bulk CTR encryption implementations, but this generic one happens to be nice and readable. (Of course, all those alternative implementations need to produce the same sequence of counter values anyway, so that gcrypt stays compatible with itself.)
OK, so what does it actually do?
Well, looking at the context (or, more specifically, cipher-internal.h), it's clear that c->u_ctr.ctr is an array of blocksize unsigned bytes (where blocksize equals 16 bytes for AES). The code above increments its last byte by one, and checks if the result wrapped around to zero. If it didn't, it stops; if it did wrap, the code then moves to the second-to-last byte, increments it, checks to see if it wrapped, and keeps looping until it either finds a byte that doesn't wrap around when incremented, or it has incremented all of the blocksize bytes.
So, for example, if your original counter value was {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}, then after incrementing it would become {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1}. If incremented again, it would become {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2}, then {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3}, and so on up to {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,255}, after which the next counter value would be {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0} (and after that {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1}, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2}, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,3}, etc.).
Of course, what this is really doing is just arithmetically incrementing a single (blocksize × 8)-bit integer, stored in memory in big-endian byte order.

Optimal Solution: Get a random sample of items from a data set

So I recently had this as an interview question and I was wondering what the optimal solution would be. Code is in Objective-c.
Say we have a very large data set, and we want to get a random sample
of items from it for testing a new tool. Rather than worry about the
specifics of accessing things, let's assume the system provides these
things:
// Return a random number from the set 0, 1, 2, ..., n-2, n-1.
int Rand(int n);
// Interface to implementations other people write.
#interface Dataset : NSObject
// YES when there is no more data.
- (BOOL)endOfData;
// Get the next element and move forward.
- (NSString*)getNext;
#end
// This function reads elements from |input| until the end, and
// returns an array of |k| randomly-selected elements.
- (NSArray*)getSamples:(unsigned)k from:(Dataset*)input
{
// Describe how this works.
}
Edit: So you are supposed to randomly select items from a given array. So if k = 5, then I would want to randomly select 5 elements from the dataset and return an array of those items. Each element in the dataset has to have an equal chance of getting selected.

This seems like a good time to use Reservoir Sampling. The following is an Objective-C adaptation for this use case:
NSMutableArray* result = [[NSMutableArray alloc] initWithCapacity:k];
int i,j;
for (i = 0; i < k; i++) {
[result setObject:[input getNext] atIndexedSubscript:i];
}
for (i = k; ![input endOfData]; i++) {
j = Rand(i);
NSString* next = [input getNext];
if (j < k) {
[result setObject:next atIndexedSubscript:j];
}
}
return result;
The code above is not the most efficient reservoir sampling algorithm because it generates a random number for every entry of the reservoir past the entry at index k. Slightly more complex algorithms exist under the general category "reservoir sampling". This is an interesting read on an algorithm named "Algorithm Z". I would be curious if people find newer literature on reservoir sampling, too, because this article was published in 1985.

Interessting question, but as there is no count or similar method in DataSet and you are not allowed to iterate more than once, i can only come up with this solution to get good random samples (no k > Datasize handling):
- (NSArray *)getSamples:(unsigned)k from:(Dataset*)input {
NSMutableArray *source = [[NSMutableArray alloc] init];
while(![input endOfData]) {
[source addObject:[input getNext]];
}
NSMutableArray *ret = [[NSMutableArray alloc] initWithCapacity:k];
int count = [source count];
while ([ret count] < k) {
int index = Rand(count);
[ret addObject:[source objectAtIndex:index]];
[source removeObjectAtIndex:index];
count--;
}
return ret;
}

This is not the answer I did in the interview but here is what I wish I had done:
Store pointer to first element in dataset
Loop over dataset to get count
Reset dataset to point at first element
Create NSMutableDictionary for storing random indexes
Do for loop from i=0 to i=k. Each iteration, generate a random value, check if value exists in dictionary. If it does, keep generating a random value until you get a fresh value.
Loop over dataset. If the current index is within the dictionary, add it to a the array of random subset values.
Return array of random subsets.

There are multiple ways to do this, the first way:
1. use input parameter k to dynamically allocate an array of numbers
unsigned * numsArray = (unsigned *)malloc(sizeof(unsigned) * k);
2. run a loop that gets k random numbers and stores them into the numsArray (must be careful here to check each new random to see if we have gotten it before, and if we have, get another random, etc...)
3. sort numsArray
4. run a loop beginning at the beginning of DataSet with your own incrementing counter dataCount and another counter numsCount both beginning at 0. whenever dataCount is equal to numsArray[numsCount], grab the current data object and add it to your newly created random list then increment numsCount.
5. The loop in step 4 can end when either numsCount > k or when dataCount reaches the end of the dataset.
6. The only other step that may need to be added here is before any of this to use the next command of the object type to count how large the dataset is to be able to bound your random numbers and check to make sure k is less than or equal to that.
The 2nd way to do this would be to run through the actual list MULTIPLE times.
// one must assume that once we get to the end, we can start over within the set again
1. run a while loop that checks for endOfData
a. count up a count variable that is initialized to 0
2. run a loop from 0 through k-1
a. generate a random number that you constrain to the list size
b. run a loop that moves through the dataset until it hits the rand element
c. compare that element with all other elements in your new list to make sure it isnt already in your new list
d. store the element into your new list
there may be ways to speed up the 2nd method by storing a current list location, that way if you generate a random that is past the current pointer you dont have to move through the whole list again to get back to element 0, then to the element you wish to retreive.
A potential 3rd way to do this might be to:
1. run a loop from 0 through k-1
a. generate a random
b. use the generated random as a skip count, move skip count objects through the list
c. store the current item from the list into your new list
Problem with this 3rd method is without knowing how big the list is, you dont know how to constrain the random skip count. Further, even if you did, chances are that it wouldnt truly look like a randomly grabbed subset that could easily reach the last element in the list as it would become statistically unlikely that you would ever reach the end element (i.e. not every element is given an equal shot of being select.)
Arguably the FASTEST way to do this is method 1, where you generate the random numerics first, then traverse the list only once (yes its actually twice, once to get the size of the dataset list then again to grab the random elements)

We need a little probability theory. As others, I will ignore the case n < k. The probability that the n'th item will be selected into the set of size k is just C(n-1, k-1) / C(n, k) where C is the binomial coefficient. A bit of math says shows that this is just k/n. For the rest, note that the selection of the n'th item is independent of all other selections. In other words, "the past doesn't matter."
So an algorithm is:
S = set of up to k elements
n = 0
while not end of input
v = next value
n = n + 1
if |S| < k add v to S
else if random(0,1) >= k/n replace a randomly chosen element of S with v
I will let the coders code this one! It's pretty trivial. All you need is an array of size k and one pass over the data.

If you care about efficiency (as your tags suggest) and the number of items in the population is known, don't use reservior sampling. That would require you to loop through the entire data set and generate a random number for each.
Instead, pick five values ranges from 0 to n-1. In the unlikely case, there is a duplicate among the five indexes, replace the duplicate with another random value. Then use the five indexes to do a random-access lookup to the i-th element in the population.
This is simple. It uses a minimum number of calls the random number generator. And it accesses memory only for the relevant selections.
If you don't know the number of data elements in advance, you can loop-over the data once to get the population size and proceed as above.
If you aren't allow to iterate over the data more than once, use a chunked form of reservior sampling: 1) Choose the first five elements as the initial sample, each having a probability of 1/5th. 2) Read in a large chunk of data and choose five new samples from the new set (using only five calls to Rand). 3) Pairwise, decide whether to keep the new sample item or old sample element (with odds proportional the the probablities for each of the two sample groups). 4) Repeat until all the data has been read.
For example, assume there are 1000 data elements (but we don't know this in advance).
Choose the first five as the initial sample: current_sample = read(5); population=5.
Read a chunk of n datapoints (perhaps n=200 in this example):
subpop = read(200);
m = len(subpop);
new_sample = choose(5, subpop);
loop-over the two samples pairwise:
for (a, b) in (current_sample and new_sample): if random(0 to population + m) < population, then keep a, otherwise keep *b)
population += m
repeat

Does the "C" code algorithm in RFC1071 work well on big-endian machine?

As described in RFC1071, an extra 0-byte should be added to the last byte when calculating checksum in the situation of odd count of bytes:
But in the "C" code algorithm, only the last byte is added:
The above code does work on little-endian machine where [Z,0] equals Z, but I think there's some problem on big-endian one where [Z,0] equals Z*256.
So I wonder whether the example "C" code in RFC1071 only works on little-endian machine?
-------------New Added---------------
There's one more example of "breaking the sum into two groups" described in RFC1071:
We can just take the data here (addr[]={0x00, 0x01, 0xf2}) for example:
Here, "standard" represents the situation described in the formula [2], while "C-code" representing the C code algorithm situation.
As we can see, in "standard" situation, the final sum is f201 regardless of endian-difference since there's no endian-issue with the abstract form of [Z,0] after "Swap". But it matters in "C-code" situation because f2 is always the low-byte whether in big-endian or in little-endian.
Thus, the checksum is variable with the same data(addr&count) on different endian.

I think you're right. The code in the RFC adds the last byte in as low-order, regardless of whether it is on a litte-endian or big-endian machine.
In these examples of code on the web we see they have taken special care with the last byte:
https://github.com/sjaeckel/wireshark/blob/master/epan/in_cksum.c
and in
http://www.opensource.apple.com/source/tcpdump/tcpdump-23/tcpdump/print-ip.c
it does this:
if (nleft == 1)
sum += htons(*(u_char *)w<<8);
Which means that this text in the RFC is incorrect:
Therefore, the sum may be calculated in exactly the same way
regardless of the byte order ("big-endian" or "little-endian")
of the underlaying hardware. For example, assume a "little-
endian" machine summing data that is stored in memory in network
("big-endian") order. Fetching each 16-bit word will swap
bytes, resulting in the sum; however, storing the result
back into memory will swap the sum back into network byte order.

The following code in place of the original odd byte handling is portable (i.e. will work on both big- and little-endian machines), and doesn't depend on an external function:
if (count > 0)
{
char buf2[2] = {*addr, 0};
sum += *(unsigned short *)buf2;
}
(Assumes addr is char * or const char *).

Fast memmove for x86 and +1 shift (for Move-to-front transform)

For fast MTF ( http://en.wikipedia.org/wiki/Move-to-front_transform ) i need faster version of moving a char from inside the array into the front of it:
char mtfSymbol[256], front;
char i;
for(;;) { \\ a very big loop
...
i=get_i(); \\ i is in 0..256 but more likely to be smaller.
front=mtfSymbol[i];
memmove(mtfSymbol+1, mtfSymbol, i);
mtfSymbol[0]=front;
}
cachegrind shows, that for memmove there are lot of branch mispredictions here.
For the other version of code (not a memmove in the first example, but this one)
do
{
mtfSymbol[i] = mtfSymbol[i-1];
} while ( --i );
there are lot of byte reads/writes, conditional branches and branch mispredictions
i is not very big, as it is MTF used for "good" input - a text file after a BWT ( Burrows–Wheeler transform )
Compiler is GCC.

If you pre-allocate your buffer bigger than you're going to need it, and put your initial array somewhere in the middle (or at the end, if you're never going to have to extend it that way) then you can append items (up to a limit) by changing the address of the start of the array rather than by moving all the elements.
You'll obviously need to keep track of how far back you've moved, so you can re-allocate if you do fall off the start of your existing allocation, but this should still be quicker than moving all of your array entries around.

You can also use a dedicated data structure rather than an array to speed up the forward transform.
A fast implementation can be built with a list of linked lists to avoid the array element moves altogether.
See http://code.google.com/p/kanzi/source/browse/java/src/kanzi/transform/MTFT.java
For inverse transform, it turns out that arrays are as fast as linked lists.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

O(c) Time complexity to read a few gigs text file - file-io

Related

How can I reduce the size of this file to 83 bytes or less?

How does libgcrypt increment the counter for CTR mode?

Optimal Solution: Get a random sample of items from a data set

Does the "C" code algorithm in RFC1071 work well on big-endian machine?

Fast memmove for x86 and +1 shift (for Move-to-front transform)

Categories

Resources