Non repeating random numbers in Objective-C - objective-c

I'm using
for (int i = 1, i<100, i++)
int i = arc4random() % array count;
but I'm getting repeats every time. How can I fill out the chosen int value from the range, so that when the program loops I will not get any dupe?

It sounds like you want shuffling of a set rather than "true" randomness. Simply create an array where all the positions match the numbers and initialize a counter:
num[ 0] = 0
num[ 1] = 1
: :
num[99] = 99
numNums = 100
Then, whenever you want a random number, use the following method:
idx = rnd (numNums); // return value 0 through numNums-1
val = num[idx]; // get then number at that position.
num[idx] = val[numNums-1]; // remove it from pool by overwriting with highest
numNums--; // and removing the highest position from pool.
return val; // give it back to caller.
This will return a random value from an ever-decreasing pool, guaranteeing no repeats. You will have to beware of the pool running down to zero size of course, and intelligently re-initialize the pool.
This is a more deterministic solution than keeping a list of used numbers and continuing to loop until you find one not in that list. The performance of that sort of algorithm will degrade as the pool gets smaller.
A C function using static values something like this should do the trick. Call it with
int i = myRandom (200);
to set the pool up (with any number zero or greater specifying the size) or
int i = myRandom (-1);
to get the next number from the pool (any negative number will suffice). If the function can't allocate enough memory, it will return -2. If there's no numbers left in the pool, it will return -1 (at which point you could re-initialize the pool if you wish). Here's the function with a unit testing main for you to try out:
#include <stdio.h>
#include <stdlib.h>
#define ERR_NO_NUM -1
#define ERR_NO_MEM -2
int myRandom (int size) {
int i, n;
static int numNums = 0;
static int *numArr = NULL;
// Initialize with a specific size.
if (size >= 0) {
if (numArr != NULL)
free (numArr);
if ((numArr = malloc (sizeof(int) * size)) == NULL)
return ERR_NO_MEM;
for (i = 0; i < size; i++)
numArr[i] = i;
numNums = size;
}
// Error if no numbers left in pool.
if (numNums == 0)
return ERR_NO_NUM;
// Get random number from pool and remove it (rnd in this
// case returns a number between 0 and numNums-1 inclusive).
n = rand() % numNums;
i = numArr[n];
numArr[n] = numArr[numNums-1];
numNums--;
if (numNums == 0) {
free (numArr);
numArr = 0;
}
return i;
}
int main (void) {
int i;
srand (time (NULL));
i = myRandom (20);
while (i >= 0) {
printf ("Number = %3d\n", i);
i = myRandom (-1);
}
printf ("Final = %3d\n", i);
return 0;
}
And here's the output from one run:
Number = 19
Number = 10
Number = 2
Number = 15
Number = 0
Number = 6
Number = 1
Number = 3
Number = 17
Number = 14
Number = 12
Number = 18
Number = 4
Number = 9
Number = 7
Number = 8
Number = 16
Number = 5
Number = 11
Number = 13
Final = -1
Keep in mind that, because it uses statics, it's not safe for calling from two different places if they want to maintain their own separate pools. If that were the case, the statics would be replaced with a buffer (holding count and pool) that would "belong" to the caller (a double-pointer could be passed in for this purpose).
And, if you're looking for the "multiple pool" version, I include it here for completeness.
#include <stdio.h>
#include <stdlib.h>
#define ERR_NO_NUM -1
#define ERR_NO_MEM -2
int myRandom (int size, int *ppPool[]) {
int i, n;
// Initialize with a specific size.
if (size >= 0) {
if (*ppPool != NULL)
free (*ppPool);
if ((*ppPool = malloc (sizeof(int) * (size + 1))) == NULL)
return ERR_NO_MEM;
(*ppPool)[0] = size;
for (i = 0; i < size; i++) {
(*ppPool)[i+1] = i;
}
}
// Error if no numbers left in pool.
if (*ppPool == NULL)
return ERR_NO_NUM;
// Get random number from pool and remove it (rnd in this
// case returns a number between 0 and numNums-1 inclusive).
n = rand() % (*ppPool)[0];
i = (*ppPool)[n+1];
(*ppPool)[n+1] = (*ppPool)[(*ppPool)[0]];
(*ppPool)[0]--;
if ((*ppPool)[0] == 0) {
free (*ppPool);
*ppPool = NULL;
}
return i;
}
int main (void) {
int i;
int *pPool;
srand (time (NULL));
pPool = NULL;
i = myRandom (20, &pPool);
while (i >= 0) {
printf ("Number = %3d\n", i);
i = myRandom (-1, &pPool);
}
printf ("Final = %3d\n", i);
return 0;
}
As you can see from the modified main(), you need to first initialise an int pointer to NULL then pass its address to the myRandom() function. This allows each client (location in the code) to have their own pool which is automatically allocated and freed, although you could still share pools if you wish.

You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.
Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 2, width 16), with an algorithm which is still cryptographically robust.
It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.
AES-FFX is one proposed standard method to achieve this. I've experimented with some basic Python code which is based on the AES-FFX idea, although not fully conformant--see Python code here. It can e.g. encrypt a counter to a random-looking 7-digit decimal number, or a 16-bit number.

You need to keep track of the numbers you have already used (for instance, in an array). Get a random number, and discard it if it has already been used.

Without relying on external stochastic processes, like radioactive decay or user input, computers will always generate pseudorandom numbers - that is numbers which have many of the statistical properties of random numbers, but repeat in sequences.
This explains the suggestions to randomise the computer's output by shuffling.
Discarding previously used numbers may lengthen the sequence artificially, but at a cost to the statistics which give the impression of randomness.

The best way to do this is create an array for numbers already used. After a random number has been created then add it to the array. Then when you go to create another random number, ensure that it is not in the array of used numbers.

In addition to using secondary array to store already generated random numbers, invoking random no. seeding function before every call of random no. generation function might help to generate different seq. of random numbers in every run.

Related

What's the term for saving values of calculations instead of recalculating multiple times?

When you have code like this (written in java, but applicable to any similar language):
public static void main(String[] args) {
int total = 0;
for (int i = 0; i < 50; i++)
total += i * doStuff(i % 2); // multiplies i times doStuff(remainder of i / 2)
}
public static int doStuff(int i) {
// Lots of complicated calculations
}
You can see that there's room for improvement. doStuff(i % 2) only returns two different values - one for doStuff(0) on even numbers and one for doStuff(1) on odd numbers. Therefore you're wasting a lot of computation time/power on recalculating those values each time by saying doStuff(i % 2). You can improve like this:
public static void main(String[] args) {
int total = 0;
boolean[] alreadyCalculated = new boolean[2];
int[] results = new int[2];
for (int i = 0; i < 50; i++) {
if (!alreadyCalculated[i % 2]) {
results[i % 2] = doStuff(i % 2);
alreadyCalculated[i % 2] = true;
}
total += i * results[i % 2];
}
}
Now it accesses a stored value instead of recalculating each time. It might seem silly to keep arrays like that, but for cases like looping from, say, i = 0, i < 500 and you're checking i % 32 each time, or something, an array is an elegant approach.
Is there a term for this kind of code optimization? I'd like to read up more on the different forms and the conventions of it but I'm lacking a concise description.
Is there a term for this kind of code optimization?
Yes, there is:
In computing, memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again.
https://en.wikipedia.org/wiki/Memoization
Common-subexpression-elimination (CSE) is related to this. This case is a combination of that and hoisting a loop-invariant calculation out of a loop.
I'd agree with CBroe that you could call this specific form of caching memoization, esp the way you're implementing it with the clunky alreadyCalculated array. You can optimize that away since you know which calls will be new values and which will be repeats. Normally you'd implement memoization with a static array inside the called function, for the benefit of all callers. Ideally there's a sentinel value you can use to mark entries which don't have a result computed yet, instead of maintaining a separate array for that. Or for a sparse set of input values, just use a hash (instead of e.g. an array with 2^32 entries).
You can also avoid the if in the main loop.
public class Optim
{
public static int doStuff(int i) { return (i+5) << 1; }
public static void main(String[] args)
{
int total = 0;
int results[] = new int[2];
// more interesting if we pretend the loop count isn't known to be > 1, so avoiding calling doStuff(1) for n=1 is useful.
// otherwise you'd just do int[] results = { doStuff(0), doStuff(1) };
int n = 50;
for (int i = 0 ; i < Math.min(n, 2) ; i++) {
results[i] = doStuff(i);
total += i * results[i];
}
for (int i = 2; i < n; i++) { // runs zero times if n < 2
total += i * results[i % 2];
}
System.out.print(total);
}
}
Of course, in this case we can optimize a lot further. sum(0..n) = n * (n+1) / 2, so we can use that to get a closed-form (non-looping) solution in terms of doStuff(0) (sum of the even terms) and doStuff(1) (sum of the odd terms). So we only need the two doStuff() results once each, avoiding any need to memoize.

Using memcpy and malloc resulting in corrupted data stream

The code below attempts to save a data stream to a file using fwrite. The first example using malloc works but with the second example the data stream is %70 corrupted. Can someone explain to me why the second example is corrupted and how I can remedy it?
short int fwBuffer[1000000];
// short int *fwBuffer[1000000];
unsigned long fwSize[1000000];
// Not Working *********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int tmpbuffer[length*inchannels];
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
memcpy(&fwBuffer[saveBufferCount], tmpbuffer, sizeof(tmpbuffer));
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Working ***********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int *tmpbuffer = (short int*)malloc(size);
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
fwBuffer[saveBufferCount] = tmpbuffer;
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Write to file ***********
for (int i = 0; i < saveBufferCount; i++) {
if (isRecording && outFile != NULL) {
// fwrite(fwBuffer[i], 1, fwSize[i],outFile);
fwrite(&fwBuffer[i], 1, fwSize[i],outFile);
if (fwBuffer[i] != NULL) {
// free(fwBuffer[i]);
}
fwBuffer[i] = NULL;
}
}
You initialize your size as
size = sizeof(short int) * length * inchannels;
then you declare an array of size
short int tmpbuffer[size];
This is already highly suspect. Why did you include sizeof(short int) into the size and then declare an array of short int elements with that size? The byte size of your array in this case is
sizeof(short int) * sizeof(short int) * length * inchannels
i.e. the sizeof(short int) is factored in twice.
Later you initialize only length * inchannels elements of the array, which is not entire array, for the reasons described above. But the memcpy that follows still copies the entire array
memcpy(&fwBuffer[saveBufferCount], &tmpbuffer, sizeof (tmpbuffer));
(Tail portion of the copied data is garbage). I'd suspect that you are copying sizeof(short int) times more data than was intended. The recipient memory overflows and gets corrupted.
The version based on malloc does not suffer from this problem since malloc-ed memory size is specified in bytes, not in short int-s.
If you want to simulate the malloc behavior in the upper version of the code, you need to declare your tmpbuffer as an array of char elements, not of short int elements.
This has very good chances to crash
short int tmpbuffer[(short int)(size)];
first size could be too big, but then truncating it and having whatever size results of that is probably not what you want.
Edit: Try to write the whole code without a single cast. Only then the compiler has a chance to tell you if there is something wrong.

Getting a values most significant digit in Objective C

I currently have code in objective C that can pull out an integer's most significant digit value. My only question is if there is a better way to do it than with how I have provided below. It gets the job done, but it just feels like a cheap hack.
What the code does is that it takes a number passed in and loops through until that number has been successfully divided to a certain value. The reason I am doing this is for an educational app that splits a number up by it's value and shows the values added all together to produce the final output (1234 = 1000 + 200 + 30 + 4).
int test = 1;
int result = 0;
int value = 0;
do {
value = input / test;
result = test;
test = [[NSString stringWithFormat:#"%d0",test] intValue];
} while (value >= 10);
Any advice is always greatly appreciated.
Will this do the trick?
int sigDigit(int input)
{
int digits = (int) log10(input);
return input / pow(10, digits);
}
Basically it does the following:
Finds out the number of digits in input (log10(input)) and storing it in 'digits'.
divides input by 10 ^ digits.
You should now have the most significant number in digits.
EDIT: in case you need a function that get the integer value at a specific index, check this function out:
int digitAtIndex(int input, int index)
{
int trimmedLower = input / (pow(10, index)); // trim the lower half of the input
int trimmedUpper = trimmedLower % 10; // trim the upper half of the input
return trimmedUpper;
}

Functions to compress and uncompress array of integers

I was recently asked to complete a task for a c++ role, however as the application was decided not to be progressed any further I thought that I would post here for some feedback / advice / improvements / reminder of concepts I've forgotten.
The task was:
The following data is a time series of integer values
int timeseries[32] = {67497, 67376, 67173, 67235, 67057, 67031, 66951,
66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044,
67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620,
66579, 66596, 66713, 66852, 66715};
The series might be, for example, the closing price of a stock each day
over a 32 day period.
As stored above, the data will occupy 32 x sizeof(int) bytes = 128 bytes
assuming 4 byte ints.
Using delta encoding , write a function to compress, and a function to
uncompress data like the above.
Ok, so before this point I had never looked at compression so my solution is far from perfect. The manner in which I approached the problem is by compressing the array of integers into a array of bytes. When representing the integer as a byte I keep the calculate most
significant byte (msb) and keep everything up to this point, whilst throwing the rest away. This is then added to the byte array. For negative values I increment the msb by 1 so that we can
differentiate between positive and negative bytes when decoding by keeping the leading
1 bit values.
When decoding I parse this jagged byte array and simply reverse my
previous actions performed when compressing. As mentioned I have never looked at compression prior to this task so I did come up with my own method to compress the data. I was looking at C++/Cli recently, had not really used it previously so just decided to write it in this language, no particular reason. Below is the class, and a unit test at the very bottom. Any advice / improvements / enhancements will be much appreciated.
Thanks.
array<array<Byte>^>^ CDeltaEncoding::CompressArray(array<int>^ data)
{
int temp = 0;
int original;
int size = 0;
array<int>^ tempData = gcnew array<int>(data->Length);
data->CopyTo(tempData, 0);
array<array<Byte>^>^ byteArray = gcnew array<array<Byte>^>(tempData->Length);
for (int i = 0; i < tempData->Length; ++i)
{
original = tempData[i];
tempData[i] -= temp;
temp = original;
int msb = GetMostSignificantByte(tempData[i]);
byteArray[i] = gcnew array<Byte>(msb);
System::Buffer::BlockCopy(BitConverter::GetBytes(tempData[i]), 0, byteArray[i], 0, msb );
size += byteArray[i]->Length;
}
return byteArray;
}
array<int>^ CDeltaEncoding::DecompressArray(array<array<Byte>^>^ buffer)
{
System::Collections::Generic::List<int>^ decodedArray = gcnew System::Collections::Generic::List<int>();
int temp = 0;
for (int i = 0; i < buffer->Length; ++i)
{
int retrievedVal = GetValueAsInteger(buffer[i]);
decodedArray->Add(retrievedVal);
decodedArray[i] += temp;
temp = decodedArray[i];
}
return decodedArray->ToArray();
}
int CDeltaEncoding::GetMostSignificantByte(int value)
{
array<Byte>^ tempBuf = BitConverter::GetBytes(Math::Abs(value));
int msb = tempBuf->Length;
for (int i = tempBuf->Length -1; i >= 0; --i)
{
if (tempBuf[i] != 0)
{
msb = i + 1;
break;
}
}
if (!IsPositiveInteger(value))
{
//We need an extra byte to differentiate the negative integers
msb++;
}
return msb;
}
bool CDeltaEncoding::IsPositiveInteger(int value)
{
return value / Math::Abs(value) == 1;
}
int CDeltaEncoding::GetValueAsInteger(array<Byte>^ buffer)
{
array<Byte>^ tempBuf;
if(buffer->Length % 2 == 0)
{
//With even integers there is no need to allocate a new byte array
tempBuf = buffer;
}
else
{
tempBuf = gcnew array<Byte>(4);
System::Buffer::BlockCopy(buffer, 0, tempBuf, 0, buffer->Length );
unsigned int val = buffer[buffer->Length-1] &= 0xFF;
if ( val == 0xFF )
{
//We have negative integer compressed into 3 bytes
//Copy over the this last byte as well so we keep the negative pattern
System::Buffer::BlockCopy(buffer, buffer->Length-1, tempBuf, buffer->Length, 1 );
}
}
switch(tempBuf->Length)
{
case sizeof(short):
return BitConverter::ToInt16(tempBuf,0);
case sizeof(int):
default:
return BitConverter::ToInt32(tempBuf,0);
}
}
And then in a test class I had:
void CTestDeltaEncoding::TestCompression()
{
array<array<Byte>^>^ byteArray = CDeltaEncoding::CompressArray(m_testdata);
array<int>^ decompressedArray = CDeltaEncoding::DecompressArray(byteArray);
int totalBytes = 0;
for (int i = 0; i<byteArray->Length; i++)
{
totalBytes += byteArray[i]->Length;
}
Assert::IsTrue(m_testdata->Length * sizeof(m_testdata) > totalBytes, "Expected the total bytes to be less than the original array!!");
//Expected totalBytes = 53
}
This smells a lot like homework to me. The crucial phrase is: "Using delta encoding."
Delta encoding means you encode the delta (difference) between each number and the next:
67497, 67376, 67173, 67235, 67057, 67031, 66951, 66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044, 67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620, 66579, 66596, 66713, 66852, 66715
would turn into:
[Base: 67497]: -121, -203, +62
and so on. Assuming 8-bit bytes, the original numbers require 3 bytes apiece (and given the number of compilers with 3-byte integer types, you're normally going to end up with 4 bytes apiece). From the looks of things, the differences will fit quite easily in 2 bytes apiece, and if you can ignore one (or possibly two) of the least significant bits, you can fit them in one byte apiece.
Delta encoding is most often used for things like sound encoding where you can "fudge" the accuracy at times without major problems. For example, if you have a change from one sample to the next that's larger than you've left space to encode, you can encode a maximum change in the current difference, and add the difference to the next delta (and if you don't mind some back-tracking, you can distribute some to the previous delta as well). This will act as a low-pass filter, limiting the gradient between samples.
For example, in the series you gave, a simple delta encoding requires ten bits to represent all the differences. By dropping the LSB, however, nearly all the samples (all but one, in fact) can be encoded in 8 bits. That one has a difference (right shifted one bit) of -173, so if we represent it as -128, we have 45 left. We can distribute that error evenly between the preceding and following sample. In that case, the output won't be an exact match for the input, but if we're talking about something like sound, the difference probably won't be particularly obvious.
I did mention that it was an exercise that I had to complete and the solution that I received was deemed not good enough, so I wanted some constructive feedback seeing as actual companies never decide to tell you what you did wrong.
When the array is compressed I store the differences and not the original values except the first as this was my understanding. If you had looked at my code I have provided a full solution but my question was how bad was it?

Generate combinations ordered by an attribute

I'm looking for a way to generate combinations of objects ordered by a single attribute. I don't think lexicographical order is what I'm looking for... I'll try to give an example. Let's say I have a list of objects A,B,C,D with the attribute values I want to order by being 3,3,2,1. This gives A3, B3, C2, D1 objects. Now I want to generate combinations of 2 objects, but they need to be ordered in a descending way:
A3 B3
A3 C2
B3 C2
A3 D1
B3 D1
C2 D1
Generating all combinations and sorting them is not acceptable because the real world scenario involves large sets and millions of combinations. (set of 40, order of 8), and I need only combinations above the certain threshold.
Actually I need count of combinations above a threshold grouped by a sum of a given attribute, but I think it is far more difficult to do - so I'd settle for developing all combinations above a threshold and counting them. If that's possible at all.
EDIT - My original question wasn't very precise... I don't actually need these combinations ordered, just thought it would help to isolate combinations above a threshold. To be more precise, in the above example, giving a threshold of 5, I'm looking for an information that the given set produces 1 combination with a sum of 6 ( A3 B3 ) and 2 with a sum of 5 ( A3 C2, B3 C2). I don't actually need the combinations themselves.
I was looking into subset-sum problem, but if I understood correctly given dynamic solution it will only give you information is there a given sum or no, not count of the sums.
Thanks
Actually, I think you do want lexicographic order, but descending rather than ascending. In addition:
It's not clear to me from your description that A, B, ... D play any role in your answer (except possibly as the container for the values).
I think your question example is simply "For each integer at least 5, up to the maximum possible total of two values, how many distinct pairs from the set {3, 3, 2, 1} have sums of that integer?"
The interesting part is the early bailout, once no possible solution can be reached (remaining achievable sums are too small).
I'll post sample code later.
Here's the sample code I promised, with a few remarks following:
public class Combos {
/* permanent state for instance */
private int values[];
private int length;
/* transient state during single "count" computation */
private int n;
private int limit;
private Tally<Integer> tally;
private int best[][]; // used for early-bail-out
private void initializeForCount(int n, int limit) {
this.n = n;
this.limit = limit;
best = new int[n+1][length+1];
for (int i = 1; i <= n; ++i) {
for (int j = 0; j <= length - i; ++j) {
best[i][j] = values[j] + best[i-1][j+1];
}
}
}
private void countAt(int left, int start, int sum) {
if (left == 0) {
tally.inc(sum);
} else {
for (
int i = start;
i <= length - left
&& limit <= sum + best[left][i]; // bail-out-check
++i
) {
countAt(left - 1, i + 1, sum + values[i]);
}
}
}
public Tally<Integer> count(int n, int limit) {
tally = new Tally<Integer>();
if (n <= length) {
initializeForCount(n, limit);
countAt(n, 0, 0);
}
return tally;
}
public Combos(int[] values) {
this.values = values;
this.length = values.length;
}
}
Preface remarks:
This uses a little helper class called Tally, that just isolates the tabulation (including initialization for never-before-seen keys). I'll put it at the end.
To keep this concise, I've taken some shortcuts that aren't good practice for "real" code:
This doesn't check for a null value array, etc.
I assume that the value array is already sorted into descending order, required for the early-bail-out technique. (Good production code would include the sorting.)
I put transient data into instance variables instead of passing them as arguments among the private methods that support count. That makes this class non-thread-safe.
Explanation:
An instance of Combos is created with the (descending ordered) array of integers to combine. The value array is set up once per instance, but multiple calls to count can be made with varying population sizes and limits.
The count method triggers a (mostly) standard recursive traversal of unique combinations of n integers from values. The limit argument gives the lower bound on sums of interest.
The countAt method examines combinations of integers from values. The left argument is how many integers remain to make up n integers in a sum, start is the position in values from which to search, and sum is the partial sum.
The early-bail-out mechanism is based on computing best, a two-dimensional array that specifies the "best" sum reachable from a given state. The value in best[n][p] is the largest sum of n values beginning in position p of the original values.
The recursion of countAt bottoms out when the correct population has been accumulated; this adds the current sum (of n values) to the tally. If countAt has not bottomed out, it sweeps the values from the start-ing position to increase the current partial sum, as long as:
enough positions remain in values to achieve the specified population, and
the best (largest) subtotal remaining is big enough to make the limit.
A sample run with your question's data:
int[] values = {3, 3, 2, 1};
Combos mine = new Combos(values);
Tally<Integer> tally = mine.count(2, 5);
for (int i = 5; i < 9; ++i) {
int n = tally.get(i);
if (0 < n) {
System.out.println("found " + tally.get(i) + " sums of " + i);
}
}
produces the results you specified:
found 2 sums of 5
found 1 sums of 6
Here's the Tally code:
public static class Tally<T> {
private Map<T,Integer> tally = new HashMap<T,Integer>();
public Tally() {/* nothing */}
public void inc(T key) {
Integer value = tally.get(key);
if (value == null) {
value = Integer.valueOf(0);
}
tally.put(key, (value + 1));
}
public int get(T key) {
Integer result = tally.get(key);
return result == null ? 0 : result;
}
public Collection<T> keys() {
return tally.keySet();
}
}
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
Check out this question in stackoverflow: Algorithm to return all combinations
I also just used a the java code below to generate all permutations, but it could easily be used to generate unique combination's given an index.
public static <E> E[] permutation(E[] s, int num) {//s is the input elements array and num is the number which represents the permutation
int factorial = 1;
for(int i = 2; i < s.length; i++)
factorial *= i;//calculates the factorial of (s.length - 1)
if (num/s.length >= factorial)// Optional. if the number is not in the range of [0, s.length! - 1]
return null;
for(int i = 0; i < s.length - 1; i++){//go over the array
int tempi = (num / factorial) % (s.length - i);//calculates the next cell from the cells left (the cells in the range [i, s.length - 1])
E temp = s[i + tempi];//Temporarily saves the value of the cell needed to add to the permutation this time
for(int j = i + tempi; j > i; j--)//shift all elements to "cover" the "missing" cell
s[j] = s[j-1];
s[i] = temp;//put the chosen cell in the correct spot
factorial /= (s.length - (i + 1));//updates the factorial
}
return s;
}
I am extremely sorry (after all those clarifications in the comments) to say that I could not find an efficient solution to this problem. I tried for the past hour with no results.
The reason (I think) is that this problem is very similar to problems like the traveling salesman problem. Until unless you try all the combinations, there is no way to know which attributes will add upto the threshold.
There seems to be no clever trick that can solve this class of problems.
Still there are many optimizations that you can do to the actual code.
Try sorting the data according to the attributes. You may be able to avoid processing some values from the list when you find that a higher value cannot satisfy the threshold (so all lower values can be eliminated).
If you're using C# there is a fairly good generics library here. Note though that the generation of some permutations is not in lexicographic order
Here's a recursive approach to count the number of these subsets: We define a function count(minIndex,numElements,minSum) that returns the number of subsets of size numElements whose sum is at least minSum, containing elements with indices minIndex or greater.
As in the problem statement, we sort our elements in descending order, e.g. [3,3,2,1], and call the first index zero, and the total number of elements N. We assume all elements are nonnegative. To find all 2-subsets whose sum is at least 5, we call count(0,2,5).
Sample Code (Java):
int count(int minIndex, int numElements, int minSum)
{
int total = 0;
if (numElements == 1)
{
// just count number of elements >= minSum
for (int i = minIndex; i <= N-1; i++)
if (a[i] >= minSum) total++; else break;
}
else
{
if (minSum <= 0)
{
// any subset will do (n-choose-k of them)
if (numElements <= (N-minIndex))
total = nchoosek(N-minIndex, numElements);
}
else
{
// add element a[i] to the set, and then consider the count
// for all elements to its right
for (int i = minIndex; i <= (N-numElements); i++)
total += count(i+1, numElements-1, minSum-a[i]);
}
}
return total;
}
Btw, I've run the above with an array of 40 elements, and size-8 subsets and consistently got back results in less than a second.