Array 1 | Array 2
=================
1 | 2
2 | 3
3 | 4
5 | 5
| 6
What is a good algorithm to 'sync' or combine Array 2 into Array 1? The following needs to happen:
Integers in Array 2 but not in Array 1 should be added to Array 1.
Integers in both Arrays can be left alone.
Integers in Array 1 but not in Array 2 should be removed from Array 1.
I'll eventually be coding this in Obj-C, but I'm really just looking for a pseudo-code representation of an efficient algorithm to solve this problem so feel free to suggest an answer in whatever form you'd like.
EDIT:
The end result I need is bit hard to explain without giving the backstory. I have a Cocoa application that has a Core Data entity whose data needs to be updated with data from a web service. I cannot simply overwrite the contents of Array 1 (the core data entity) with the content of Array 2 (the data parsed from the web into an array) because the Array 1 has relationships with other core data entities in my application. So basically it is important that integers contained in both Arrays are not overwritten in array one.
Array1 = Array2.Clone() or some equivalent might be the simplest solution, unless the order of elements is important.
I'm kind of guessing since your example leaves some things up in the air, but typically in situations like this I would use a set. Here's some example code in Obj-C.
NSMutableSet *set = [NSMutableSet set];
[set addObjectsFromArray:array1];
[set addObjectsFromArray:array2];
NSArray *finalArray = [[set allObjects] sortedArrayUsingSelector:#selector(compare:)];
(Assuming this is not a simple Array1 = Array2 question,) if the arrays are sorted, you're looking at a single O(n+m) pass over both arrays. Point to the beginning of both arrays, then advance the pointer containing the smaller element. Keep comparing the elements as you go and add/delete elements accordingly. The efficiency of this might be worth the cost of sorting the arrays, if they aren't already such.
In my approach, You will need Set data structure. I hope you can find some implementations in Obj-C.
Add all elements of Array1 to Set1
and do the same for Array2 to Set2.
Loop through elements of Array1.
Check if it is contained in Set2
(using provided method.) If it is
not, removed the element from Set1.
Loop through elements of Array2. If it
is not existed in Set1 yet, add it
to Set1.
All elements of Set1 is now your "synced" data.
The algorithm complexity of "contains","delete", and "add" operation of "Set" on some good implementation, such as HashSet, would give you the efficiency you want.
EDIT: Here is a simple implementation of Set assumed that the integer are in limited range of 0 - 100 with every elements initialized to 0, just to give more clear idea about Set.
You first need to define array bucket of size 101. And then for ..
contains(n) - check if bucket[n] is 1 or not.
add(n) - set bucket[n] to 1.
delete(n) - set bucket[n] to 0.
You say:
What is a good algorithm to 'sync' or combine Array 2 into Array 1? The following needs to happen:
Integers in Array 2 but not in Array 1 should be added to Array 1.
Integers in both Arrays can be left alone.
Integers in Array 1 but not in Array 2 should be removed from Array 1.
Here's some literal algorithmic to help you (python):
def sync(a, b):
# a is array 1
# b is array 2
c = a + b
for el in c:
if el in b and el not in a:
a.append(el) # add to array 1
elif el in a and el not in b:
a.remove(el) # remove from array 1
# el is in both arrays, skip
return a # done
Instead of "here's what needs to happen", try describing the requirements in terms of
"here's the required final condition". From that perspective it appears that the desired end-state is for array1 to contain exactly the same values as array2.
If that's the case, then why not the equivalent of this pseudocode (unless your environment has a clone or copy method)?
array1 = new int[array2.length]
for (int i in 0 .. array2.length - 1) {
array1[i] = array2[i]
}
If order, retention of duplicates, etc. are issues, then please update the question and we can try again.
Well, if the order doesn't matter, you already have your algorithm:
Integers in Array 2 but not in Array 1 should be added to Array 1.
Integers in both Arrays can be left alone.
Integers in Array 1 but not in Array 2 should be removed from Array 1.
If the order of the integers matter, what you want is a variation of the algorithms that determine the "difference" between two strings. The Levenshtein algorithm should suit your well.
However, I suspect you actually want the first part. In that case, what exactly is the question? How to find the integers in an array? Or ... what?
Your question says nothing about order, and if you examine your three requirements, you'll see that the postcondition says that Array2 is unchanged and Array1 now contains exactly the same set of integers that is in Array2. Unless there's some requirement on the order that you're not telling us about, you may as well just make a copy of Array2.
Related
Is there an equivalent of TTree::AddFriend() with uproot ?
I have 2 parallel trees in 2 different files which I'd need to read with uproot.iterate and using interpretations (setting the 'branches' option of uproot.iterate).
Maybe I can do that by manually obtaining several iterators from iterate() calls on the files, and then calling next() on each iterators... but maybe there's a simpler way akin to AddFriend ?
Thanks for any hint !
edit: I'm not sure I've been clear, so here's a bit more details. My question is not about usage of arrays, but about how to read them from different files. Here's a mockup of what I'm doing :
# I will fill this array and give it as input to my DNN
# it's very big so I will fill it in place
bigarray = ndarray( (2,numentries),...)
# get a handle on a tree, just to be able to build interpretations :
t0 = .. first tree in input_files
interpretations = dict(
a=t0['a'].interpretation.toarray(bigarray[0]),
b=t0['b'].interpretation.toarray(bigarray[1]),
)
# iterate with :
uproot.iterate( input_files, treename,
branches = interpretations )
So what if a and b belong to 2 trees in 2 different files ?
In array-based programming, friends are implicit: you can JOIN any two columns after the fact—you don't have to declare them as friends ahead of time.
In the simplest case, if your arrays a and b have the same length and the same order, you can just use them together, like a + b. It doesn't matter whether a and b came from the same file or not. Even if I've if these is jagged (like jets.phi) and the other is not (like met.phi), you're still fine because the non-jagged array will be broadcasted to match the jagged one.
Note that awkward.Table and awkward.JaggedArray.zip can combine arrays into a single Table or jagged Table for bookkeeping.
If the two arrays are not in the same order, possibly because each writer was individually parallelized, then you'll need some column to act as the key associating rows of one array with different rows of the other. This is a classic database-style JOIN and although Uproot and Awkward don't provide routines for it, Pandas does. (Look up "merging, joining, and concatenating" in the Pandas documenting—there's a lot!) You can maintain an array's jaggedness in Pandas by preparing the column with the awkward.topandas function.
The following issue talks about a lot of these things, though the users in the issue below had to join sets of files, rather than just a single tree. (In principle, a process would have to look ahead to all the files to see which contain which keys: a distributed database problem.) Even if that's not your case, you might find more hints there to see how to get started.
https://github.com/scikit-hep/uproot/issues/314
This is how I have "friended" (befriended?) two TTree's in different files with uproot/awkward.
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
for field in array2.fields:
array1 = awkward.with_field(array1, getattr(array2, field), where=field)
# array1 now has branch "a" and "b"
print(array1.a)
print(array1.b)
Alternatively, if it is acceptable to "name" the trees,
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
zippedArray = awkward.zip({"tree1": array1, "tree2": array2})
# zippedArray. now has branch "tree1.a" and "tree2.b"
print(zippedArray.tree1.a)
print(zippedArray.tree2.b)
Of course you can use array1 and array2 together without merging them like this. But if you have already written code that expects only 1 Array this can be useful.
The goal of this post is to find a more efficient way to create this method. Right now, as I start adding more and more values, I'm going to have a very messy and confusing app. Any help is appreciated!
I am making a workout app and assign an integer value to each workout. For example:
Where the number is exersiceInt:
01 is High Knees
02 is Jumping Jacks
03 is Jog in Place
etc.
I am making it so there is a feature to randomize the workout. To do this I am using this code:
-(IBAction) setWorkoutIntervals {
exerciseInt01 = 1 + (rand() %3);
exerciseInt02 = 1 + (rand() %3);
exerciseInt03 = 1 + (rand() %3);
}
So basically the workout intervals will first be a random workout (between high knees, jumping jacks, and jog in place). What I want to do is make a universal that defines the following so I don't have to continuously hard code everything.
Right now I have:
-(void) setLabelText {
if (exerciseInt01 == 1) {
exercise01Label.text = [NSString stringWithFormat:#"High Knees"];
}
if (exerciseInt01 == 2) {
exercise01Label.text = [NSString stringWithFormat:#"Jumping Jacks"];
}
if (exerciseInt01 == 3) {
exercise01Label.text = [NSString stringWithFormat:#"Jog in Place"];
}
}
I can already tell this about to get really messy once I start specifying images for each workout and start adding workouts. Additionally, my plan was to put the same code for exercise02Label, exercise03Label, etc. which would become extremely redundant and probably unnecessary.
What I'm thinking would be perfect if there would be someway to say
exercise01Label.text = exercise01Int; (I want to to say that the Label's text equals Jumping Jacks based on the current integer value)
How can I make it so I only have to state everything once and make the code less messy and less lengthy?
Three things for you to explore to make your code easier:
1. Count from zero
A number of things can be easier if you count from zero. A simple example is if your first exercise was numbered 0 then your random calculation would just be rand() % 3 (BTW look up uniform random number, there are much better ways to get a random number).
2. Learn about enumerations
An enumeration is a type with a set of named literal values. In (Objective-)C you can also think of them as just a collection of named integer values. For example you might declare:
typedef enum
{
HighKnees,
JumpingJacks,
JogInPlace,
ExerciseKindCount
} ExerciseCount;
Which declares ExerciseCount as a new type with 4 values. Each of these is equivalent to an integer, here HighKnees is equivalent to 0 and ExerciseKindCount to 3 - this should make you think of the first thing, count from zero...
3. Discover arrays
An array is an ordered collection of items where each item has an index - which is usually an integer or enumeration value. In (Objective-)C there are two basic kinds of arrays: C-style and object-style represented by NSArray and NSMutableArray. For example here is a simple C-style array:
NSString *gExerciseLabels[ExerciseKindCount] =
{ #"High Knees",
#"Jumping Jacks",
#"Jog in Place"
}
You've probably guessed by now, the first item of the above array has index 0, back to counting from zero...
Exploring these three things should quickly show you ways to simplify your code. Later you may wish to explore structures and objects.
HTH
A simple way to start is by putting the exercise names in an array. Then you can access the names by index. eg - exerciseNames[exerciseNumber]. You can also make the list of exercises in an array (of integers). So you would get; exerciseNames[exerciseTable[i]]; for example. Eventually you will want an object to define an exercise so that you can include images, videos, counts, durations etc.
So I recently had this as an interview question and I was wondering what the optimal solution would be. Code is in Objective-c.
Say we have a very large data set, and we want to get a random sample
of items from it for testing a new tool. Rather than worry about the
specifics of accessing things, let's assume the system provides these
things:
// Return a random number from the set 0, 1, 2, ..., n-2, n-1.
int Rand(int n);
// Interface to implementations other people write.
#interface Dataset : NSObject
// YES when there is no more data.
- (BOOL)endOfData;
// Get the next element and move forward.
- (NSString*)getNext;
#end
// This function reads elements from |input| until the end, and
// returns an array of |k| randomly-selected elements.
- (NSArray*)getSamples:(unsigned)k from:(Dataset*)input
{
// Describe how this works.
}
Edit: So you are supposed to randomly select items from a given array. So if k = 5, then I would want to randomly select 5 elements from the dataset and return an array of those items. Each element in the dataset has to have an equal chance of getting selected.
This seems like a good time to use Reservoir Sampling. The following is an Objective-C adaptation for this use case:
NSMutableArray* result = [[NSMutableArray alloc] initWithCapacity:k];
int i,j;
for (i = 0; i < k; i++) {
[result setObject:[input getNext] atIndexedSubscript:i];
}
for (i = k; ![input endOfData]; i++) {
j = Rand(i);
NSString* next = [input getNext];
if (j < k) {
[result setObject:next atIndexedSubscript:j];
}
}
return result;
The code above is not the most efficient reservoir sampling algorithm because it generates a random number for every entry of the reservoir past the entry at index k. Slightly more complex algorithms exist under the general category "reservoir sampling". This is an interesting read on an algorithm named "Algorithm Z". I would be curious if people find newer literature on reservoir sampling, too, because this article was published in 1985.
Interessting question, but as there is no count or similar method in DataSet and you are not allowed to iterate more than once, i can only come up with this solution to get good random samples (no k > Datasize handling):
- (NSArray *)getSamples:(unsigned)k from:(Dataset*)input {
NSMutableArray *source = [[NSMutableArray alloc] init];
while(![input endOfData]) {
[source addObject:[input getNext]];
}
NSMutableArray *ret = [[NSMutableArray alloc] initWithCapacity:k];
int count = [source count];
while ([ret count] < k) {
int index = Rand(count);
[ret addObject:[source objectAtIndex:index]];
[source removeObjectAtIndex:index];
count--;
}
return ret;
}
This is not the answer I did in the interview but here is what I wish I had done:
Store pointer to first element in dataset
Loop over dataset to get count
Reset dataset to point at first element
Create NSMutableDictionary for storing random indexes
Do for loop from i=0 to i=k. Each iteration, generate a random value, check if value exists in dictionary. If it does, keep generating a random value until you get a fresh value.
Loop over dataset. If the current index is within the dictionary, add it to a the array of random subset values.
Return array of random subsets.
There are multiple ways to do this, the first way:
1. use input parameter k to dynamically allocate an array of numbers
unsigned * numsArray = (unsigned *)malloc(sizeof(unsigned) * k);
2. run a loop that gets k random numbers and stores them into the numsArray (must be careful here to check each new random to see if we have gotten it before, and if we have, get another random, etc...)
3. sort numsArray
4. run a loop beginning at the beginning of DataSet with your own incrementing counter dataCount and another counter numsCount both beginning at 0. whenever dataCount is equal to numsArray[numsCount], grab the current data object and add it to your newly created random list then increment numsCount.
5. The loop in step 4 can end when either numsCount > k or when dataCount reaches the end of the dataset.
6. The only other step that may need to be added here is before any of this to use the next command of the object type to count how large the dataset is to be able to bound your random numbers and check to make sure k is less than or equal to that.
The 2nd way to do this would be to run through the actual list MULTIPLE times.
// one must assume that once we get to the end, we can start over within the set again
1. run a while loop that checks for endOfData
a. count up a count variable that is initialized to 0
2. run a loop from 0 through k-1
a. generate a random number that you constrain to the list size
b. run a loop that moves through the dataset until it hits the rand element
c. compare that element with all other elements in your new list to make sure it isnt already in your new list
d. store the element into your new list
there may be ways to speed up the 2nd method by storing a current list location, that way if you generate a random that is past the current pointer you dont have to move through the whole list again to get back to element 0, then to the element you wish to retreive.
A potential 3rd way to do this might be to:
1. run a loop from 0 through k-1
a. generate a random
b. use the generated random as a skip count, move skip count objects through the list
c. store the current item from the list into your new list
Problem with this 3rd method is without knowing how big the list is, you dont know how to constrain the random skip count. Further, even if you did, chances are that it wouldnt truly look like a randomly grabbed subset that could easily reach the last element in the list as it would become statistically unlikely that you would ever reach the end element (i.e. not every element is given an equal shot of being select.)
Arguably the FASTEST way to do this is method 1, where you generate the random numerics first, then traverse the list only once (yes its actually twice, once to get the size of the dataset list then again to grab the random elements)
We need a little probability theory. As others, I will ignore the case n < k. The probability that the n'th item will be selected into the set of size k is just C(n-1, k-1) / C(n, k) where C is the binomial coefficient. A bit of math says shows that this is just k/n. For the rest, note that the selection of the n'th item is independent of all other selections. In other words, "the past doesn't matter."
So an algorithm is:
S = set of up to k elements
n = 0
while not end of input
v = next value
n = n + 1
if |S| < k add v to S
else if random(0,1) >= k/n replace a randomly chosen element of S with v
I will let the coders code this one! It's pretty trivial. All you need is an array of size k and one pass over the data.
If you care about efficiency (as your tags suggest) and the number of items in the population is known, don't use reservior sampling. That would require you to loop through the entire data set and generate a random number for each.
Instead, pick five values ranges from 0 to n-1. In the unlikely case, there is a duplicate among the five indexes, replace the duplicate with another random value. Then use the five indexes to do a random-access lookup to the i-th element in the population.
This is simple. It uses a minimum number of calls the random number generator. And it accesses memory only for the relevant selections.
If you don't know the number of data elements in advance, you can loop-over the data once to get the population size and proceed as above.
If you aren't allow to iterate over the data more than once, use a chunked form of reservior sampling: 1) Choose the first five elements as the initial sample, each having a probability of 1/5th. 2) Read in a large chunk of data and choose five new samples from the new set (using only five calls to Rand). 3) Pairwise, decide whether to keep the new sample item or old sample element (with odds proportional the the probablities for each of the two sample groups). 4) Repeat until all the data has been read.
For example, assume there are 1000 data elements (but we don't know this in advance).
Choose the first five as the initial sample: current_sample = read(5); population=5.
Read a chunk of n datapoints (perhaps n=200 in this example):
subpop = read(200);
m = len(subpop);
new_sample = choose(5, subpop);
loop-over the two samples pairwise:
for (a, b) in (current_sample and new_sample): if random(0 to population + m) < population, then keep a, otherwise keep *b)
population += m
repeat
I have an assignment which says I have to read from file the distances between 20 citys each. I wonder how to handle that data in the application. I thought of multidimensional array, something like Distances(0, 1, 0)=500 which means that the distance between city0 and city1 is 500 miles. Also I think this is a waste of memory because Distances(0, 1, 0) and Distances(1, 0, 0) is the same thing. My mentor told me to use triangular matrix to save the data into the app. Can you show me an example of a similar data handling or some other idea of how to handle the data? I just can't imagine it. Thank you!
What I think he means is something like this:
http://www.arenalogisticsinc.com/images/chart4.jpg
Basically a 2-D array - And if you want to save space, just remove the top half of the array since it will have repetitions.
Hope this helps.
You want an array of arrays. A multidimensional array is useful if you have a consistent array size for each of the inner arrays, but you want your first array to have length=0, the second to have length=1, etc... So, actually, you don't even need the first array - as it is just empty.
Dim triangle As Array(19)
For i = 0 To 18
Dim innerArray(i+1) As Integer
triangle(i) = innerArray
Next
I am not sure how this is possible but I have a for loop not incrementing properly through an array.
Basically what I have is:
for (AISMessage *report in disarray){
NSLog(#"Position of array = %ld\n", [aisArray indexOfObject:report]);
}
There is more code in the loop but is nothing strange just formatting some of the data in the object and outputting it to a file.
The output of these lines would look something like:
Position of array = 0
...
Posiiton of array = 78176
Posiiton of array = 78177
Posiiton of array = 78178
Posiiton of array = 78178
Posiiton of array = 78180
Posiiton of array = 78181
...
Posiiton of array = 490187
For some reason the report at index 78178 gets read in twice and the report at 78179 gets skipped completely.
Any ideas on what may cause this?
I am totally confused.
Thanks in advance,
Jason
The object occurs twice in the array, so the indexOfObject finds the element at index 78179 at index 78178.
In other words, you have this case:
...
[78177] = x
[78178] = y
[78179] = y
[78180] = z
...
Also, you're not searching the same array you're iterating over, that might have something to do with it as well.
Since the positions reported are so high, I would try to find a better data structure than a simple array. For it to report a position of 78178, it will have to have compared the object against the preceeding 78177 elements, and this will only take more and more time as you get further into the array.
From the posted code you are iterating over AISMessage objects in the disarray array, but you are reporting on the position of the object in the aisArray array.
I have no idea if these are meant to be the same array or not. But if they are different arrays, then do you expect the objects to be in the same order?