Why are these hashcodes equals? - .net-4.0

This test is failing :
var hashCode = new
{
CustomerId = 3354,
ServiceId = 3,
CmsThematicId = (int?)605,
StartDate = (DateTime?)new DateTime(2013, 1, 5),
EndDate = (DateTime?)new DateTime(2013, 1, 6)
}.GetHashCode();
var hashCode2 = new
{
CustomerId = 1210,
ServiceId = 3,
CmsThematicId = (int?)591,
StartDate = (DateTime?)new DateTime(2013, 3, 31),
EndDate = (DateTime?)new DateTime(2013, 4, 1)
}.GetHashCode();
Assert.AreNotEqual(hashCode, hashCode2);
Can you tell me why ?

It's kinda amazing you found this coincidence.
Anonymous classes have a generated GetHashCode() method that generates a hash code by combining the hash codes of all properties.
The calculation is basically this:
public override int GetHashCode()
{
return -1521134295 *
( -1521134295 *
( -1521134295 *
( -1521134295 *
( -1521134295 *
1170354300 +
CustomerId.GetHashCode()) +
ServiceId.GetHashCode()) +
CmsThematicId.GetHashCode()) +
StartDate.GetHashCode()) +
EndDate.GetHashCode();
}
If you change any of the values of any of the fields, the hash code does change. The fact that you found two different sets of values that happen to get the same hash codes is a coincidence.
Note that hash codes are not necessarily unique. It's impossible to say hash codes would always be unique since there can be more objects than hash codes (although that is a lot of objects). Good hash codes provide a random distribution of values.
NOTE: The above is from .NET 4. Different versions of .NET may different and Mono differs.
If you want to actually compare two objects for equality then use .Equals(). For anonymous objects it compares each field. An even better option is to use an NUnit constraint that compares each field and reports back which field differs. I posted a constraint here:
https://stackoverflow.com/a/2046566/118703

Your test is not valid.
Because hash codes are not guaranteed to be unique (see other answers for a good explanation), you should not test for uniqueness of hash codes.
When writing your own GetHashCode() method, it is a good idea to test for even distribution of random input, just not for uniqueness. Just make sure that you use enough random input to get a good test.
The MSDN spec on GetHashCode specifically states:
For the best performance, a hash function must generate a random
distribution for all input.
This is all relative, of course. A GetHashCode() method that is being used to put 100 objects in a dictionary doesn't need to be nearly as random as a GetHashCode() that puts 10,000,000 objects in a dictionary.

Did you run into this when processing a fairly large amount of data?
Welcome to the wonderful world of hash codes. A hash code is not a "unique identifier." It can't be. There is an essentially infinite number of possible different instances of that anonymous type, but only 2^32 possible hash codes. So it's guaranteed that if you create enough of those objects, you're going to see some duplicates. In fact, if you generate 70,000 of those objects randomly, the odds are better than 50% that two of them will have the same hash code.
See Birthdays, Random Numbers, and Hash Codes, and the linked Wikipedia article for more info.
As for why some people didn't see a duplicate and others did, it's likely that they ran the program on different versions of .NET. The algorithm for generating hash codes is not guaranteed to remain the same across versions or platforms:
The GetHashCode method for an object must consistently return the same
hash code as long as there is no modification to the object state that
determines the return value of the object's Equals method. Note that
this is true only for the current execution of an application, and
that a different hash code can be returned if the application is run
again.

Jim suggested me (in the chat room) to store my parameters so when i display my parameters, select the not used ones, then when someone registers I flag it as used. But it's a big PITA to generate all the parameters.
So my solution is to build a int64 hashcode like this
const long i = -1521134295;
return -i * (-i * (-i * (-i * -117147284 + customerId.GetHashCode()) + serviceId.GetHashCode()) + cmsThematicId.GetHashCode()) + startDate.GetHashCode();
I removed the end date because Its value was depending on serviceId and startDate so I shouldn't have add this to the hashcode in the firstplace.
I copy/pasted it from a decompilation of the generated class. I got not colision if I do a test with 300 000 differents combinations.

Related

What is the most efficient way to join one list to another in kotlin?

I start with a list of integers from 1 to 1000 in listOfRandoms.
I would like to left join on random from the createDatabase list.
I am currently using a find{} statement within a loop to do this but feel like this is too heavy. Is there not a better (quicker) way to achieve same result?
Psuedo Code
data class DatabaseRow(
val refKey: Int,
val random: Int
)
fun main() {
val createDatabase = (1..1000).map { i -> DatabaseRow(i, Random()) }
val listOfRandoms = (1..1000).map { j ->
val lookup = createDatabase.find { it.refKey == j }
lookup.random
}
}
As mentioned in comments, the question seems to be mixing up database and programming ideas, which isn't helping.
And it's not entirely clear which parts of the code are needed, and which can be replaced. I'm assuming that you already have the createDatabase list, but that listOfRandoms is open to improvement.
The ‘pseudo’ code compiles fine except that:
You don't give an import for Random(), but none of the likely ones return an Int. I'm going to assume that should be kotlin.random.Random.nextInt().
And because lookup is nullable, you can't simply call lookup.random; a quick fix is lookup!!.random, but it would be safer to handle the null case properly with e.g. lookup?.random ?: -1. (That's irrelevant, though, given the assumption above.)
I think the general solution is to create a Map. This can be done very easily from createDatabase, by calling associate():
val map = createDatabase.associate{ it.refKey to it.random }
That should take time roughly proportional to the size of the list. Looking up values in the map is then very efficient (approx. constant time):
map[someKey]
In this case, that takes rather more memory than needed, because both keys and values are integers and will be boxed (stored as separate objects on the heap). Also, most maps use a hash table, which takes some memory.
Since the key is (according to comments) “an ascending list starting from a random number, like 18123..19123”, in this particular case it can instead be stored in an IntArray without any boxing. As you say, array indexes start from 0, so using the key directly would need a huge array and use only the last few cells — but if you know the start key, you could simply subtract that from the array index each time.
Creating such an array would be a bit more complex, for example:
val minKey = createDatabase.minOf{ it.refKey }
val maxKey = createDatabase.maxOf{ it.refKey }
val array = IntArray(maxKey - minKey + 1)
for (row in createDatabase)
array[row.refKey - minKey] = row.random
You'd then access values with:
array[someKey - minKey]
…which is also constant-time.
Some caveats with this approach:
If createDatabase is empty, then minOf() will throw a NoSuchElementException.
If it has ‘holes’, omitting some keys inside that range, then the array will hold its default value of 0 — you can change that by using the alternative IntArray constructor which also takes a lambda giving the initial value.)
Trying to look up a value outside that range will give an ArrayIndexOutOfBoundsException.
Whether it's worth the extra complexity to save a bit of memory will depend on things like the size of the ‘database’, and how long it's in memory for; I wouldn't add that complexity unless you have good reason to think memory usage will be an issue.

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.
I read a blog post about this topic and there you can find this code snippets:
List implementation:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
measure {
list
.filter { it % 3 == 0 }
.average()
}
// 8644 ms
Sequence implementation:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
measure {
sequence
.filter { it % 3 == 0 }
.average()
}
// 822 ms
The point here is that the Sequence implementation is about 10x faster.
However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.
However, here I know why a Sequence is generally faster:
val result = sequenceOf("a", "b", "c")
.map {
println("map: $it")
it.toUpperCase()
}
.any {
println("any: $it")
it.startsWith("B")
}
Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.
So, why is it also faster in the first example?
Let's look at what those two implementations are actually doing:
The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.
(In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)
Then it has to read all the values back from the list, filter them, and create another list in memory.
Finally, it has to read all those values back in again, to calculate the average.
The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.
(That's pretty much how you'd do it if you were implementing it ‘by hand’.)
You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!
Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).
In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!
Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.
Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.
Initial code produces a list containing 50_000_000 objects:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
then iterates through it and creates another list containing a subset of its elements:
.filter { it % 3 == 0 }
... and then proceeds with calculating the average:
.average()
Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):
.filter { it % 3 == 0 }
and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:
.average()
Some relevant reading:
Kotlin: Beware of Java Stream API Habits
Kotlin Collections API Performance Antipatterns

how common is it to use both a guid and an int as unique identifiers for a table?

I think a Guid is generally the preferred unique table row identifier from a dba perspective. But I'm working on a project where the developers and managers appear to want a way to reference things by an int value. I can understand their perspective b/c they want a simple and easy way to reference different entities.
I was thinking about using a pattern for my tables where each table would have an int Id column representing the PK column but then it would also include a Guid column as a globally unique identifier. How common is it to use this type of pattern?
In the vast majority of cases you'll want to either use an INT or BIGINT for you primekey/foreign key. For the most part you are looking to make sure that table can be joined to and have a way to easily select a single unique row. In theory using GUIDs all over the place gets you there too, if you were a robot and could quickly ask a colleague, "Hey can you check out ROW_ID FD229C39-2074-4B04-8A50-456402705C02" vs "Hey can you check out ROW_ID 523". But we are human. I don't think there is a really good reason to include another column that is simply a GUID in addition to your PK (which should be an INT or BIGINT)
It can also be nice to have your PK in an order, that seems to come in handy. GUIDs won't be in a order. However, a case for using a GUID would be if you have to expose this value to a customer. You may not want them to know they are customer #6. But being customer #B8D44820-DF75-44C9-8527-F6AC7D1D259B isn't too great if they have to call in and identify themselves, but might be fine for writing code against (say a webservice or some kind of API). SQL is a lot of art with the science!
In addition do you really need a global unique id for a row? Probably not. If you are designing a system that could use up more than what INT can handle (say total number of tweets in all time) then use BIGINT. If you can use up all the BIGINTs, wow. I'd be interested in hearing how and would like to subscribe to your newsletter.
A question I ask myself when writing stuff, "If I'm wrong how hard will it be to do the other way?". If you really need a GUID later, add it. If you put it in now and just 1 person uses it you can never take it out and it will have to be maintained... job security? nah, don't think that way :) Don't over engineer it.
I would not say GUID is generally preferred from a DBA perspective. It is larger (16 bytes rather than 4 for int or 8 for bigint) and the random variety introduces fragmentation and causes much more IO with large tables due to lower page life expectancy. This is especially a problem with spinning media and limited RAM.
When a GUID is actually needed, some of these issues can be avoided using a sequential version for the GUID value rather than introducing another surrogate key. The value can be assigned in by SQL Server with a NEWSEQUENTIALID() default constraint on a column or generated in application code with the bytes ordered properly for SQL Server. Below is a Windows C# example of the latter technique.
using System;
using System.Runtime.InteropServices;
public class Example
{
[DllImport("rpcrt4.dll", CharSet = CharSet.Auto)]
public static extern int UuidCreateSequential(ref Guid guid);
/// sequential guid for SQL Server
public static Guid NewSequentialGuid()
{
const int S_OK = 0;
const int RPC_S_UUID_LOCAL_ONLY = 1824;
Guid oldGuid = Guid.Empty;
int result = UuidCreateSequential(ref oldGuid);
if (result != S_OK && result != RPC_S_UUID_LOCAL_ONLY)
{
throw new ExternalException("UuidCreateSequential call failed", result);
}
byte[] oldGuidBytes = oldGuid.ToByteArray();
byte[] newGuidBytes = new byte[16];
oldGuidBytes.CopyTo(newGuidBytes, 0);
// swap low timestamp bytes (0-3)
newGuidBytes[0] = oldGuidBytes[3];
newGuidBytes[1] = oldGuidBytes[2];
newGuidBytes[2] = oldGuidBytes[1];
newGuidBytes[3] = oldGuidBytes[0];
// swap middle timestamp bytes (4-5)
newGuidBytes[4] = oldGuidBytes[5];
newGuidBytes[5] = oldGuidBytes[4];
// swap high timestamp bytes (6-7)
newGuidBytes[6] = oldGuidBytes[7];
newGuidBytes[7] = oldGuidBytes[6];
//remaining 8 bytes are unchanged (8-15)
return new Guid(newGuidBytes);
}
}

Linking Text to an Integer Objective C

The goal of this post is to find a more efficient way to create this method. Right now, as I start adding more and more values, I'm going to have a very messy and confusing app. Any help is appreciated!
I am making a workout app and assign an integer value to each workout. For example:
Where the number is exersiceInt:
01 is High Knees
02 is Jumping Jacks
03 is Jog in Place
etc.
I am making it so there is a feature to randomize the workout. To do this I am using this code:
-(IBAction) setWorkoutIntervals {
exerciseInt01 = 1 + (rand() %3);
exerciseInt02 = 1 + (rand() %3);
exerciseInt03 = 1 + (rand() %3);
}
So basically the workout intervals will first be a random workout (between high knees, jumping jacks, and jog in place). What I want to do is make a universal that defines the following so I don't have to continuously hard code everything.
Right now I have:
-(void) setLabelText {
if (exerciseInt01 == 1) {
exercise01Label.text = [NSString stringWithFormat:#"High Knees"];
}
if (exerciseInt01 == 2) {
exercise01Label.text = [NSString stringWithFormat:#"Jumping Jacks"];
}
if (exerciseInt01 == 3) {
exercise01Label.text = [NSString stringWithFormat:#"Jog in Place"];
}
}
I can already tell this about to get really messy once I start specifying images for each workout and start adding workouts. Additionally, my plan was to put the same code for exercise02Label, exercise03Label, etc. which would become extremely redundant and probably unnecessary.
What I'm thinking would be perfect if there would be someway to say
exercise01Label.text = exercise01Int; (I want to to say that the Label's text equals Jumping Jacks based on the current integer value)
How can I make it so I only have to state everything once and make the code less messy and less lengthy?
Three things for you to explore to make your code easier:
1. Count from zero
A number of things can be easier if you count from zero. A simple example is if your first exercise was numbered 0 then your random calculation would just be rand() % 3 (BTW look up uniform random number, there are much better ways to get a random number).
2. Learn about enumerations
An enumeration is a type with a set of named literal values. In (Objective-)C you can also think of them as just a collection of named integer values. For example you might declare:
typedef enum
{
HighKnees,
JumpingJacks,
JogInPlace,
ExerciseKindCount
} ExerciseCount;
Which declares ExerciseCount as a new type with 4 values. Each of these is equivalent to an integer, here HighKnees is equivalent to 0 and ExerciseKindCount to 3 - this should make you think of the first thing, count from zero...
3. Discover arrays
An array is an ordered collection of items where each item has an index - which is usually an integer or enumeration value. In (Objective-)C there are two basic kinds of arrays: C-style and object-style represented by NSArray and NSMutableArray. For example here is a simple C-style array:
NSString *gExerciseLabels[ExerciseKindCount] =
{ #"High Knees",
#"Jumping Jacks",
#"Jog in Place"
}
You've probably guessed by now, the first item of the above array has index 0, back to counting from zero...
Exploring these three things should quickly show you ways to simplify your code. Later you may wish to explore structures and objects.
HTH
A simple way to start is by putting the exercise names in an array. Then you can access the names by index. eg - exerciseNames[exerciseNumber]. You can also make the list of exercises in an array (of integers). So you would get; exerciseNames[exerciseTable[i]]; for example. Eventually you will want an object to define an exercise so that you can include images, videos, counts, durations etc.

How to make rand() more likely to select certain numbers?

Is it possible to use rand() or any other pseudo-random generator to pick out random numbers, but have it be more likely that it will pick certain numbers that the user feeds it? In other words, is there a way, with rand() or something else, to pick a pseudo random number, but be able to adjust the odds of getting certain outcomes, and how do you do that if it is possible.
BTW, I'm just asking how to change the numbers that rand() outputs, not how to get the user input.
Well, your question is a bit vague... but if you wanted to pick a number from 0-100 but with a bias for (say) 43 and 27, you could pick a number in the range [0, 102] and map 101 to 43 and 102 to 27. It will really depend on how much bias you want to put in, what your range is etc.
You want a mapping function between uniform density of rand() and the probability density that you desire. The mapping function can be done lots of different ways.
You can certainly use any random number generator to skew the results. Example in C#, since I don't know objective-c syntax. I assume that rand() return a number tween 0 and 1, 0 inclusive and 1 exclusive. It should be quite easy to understand the idear and convert the code to any other language.
/// <summary>
/// Dice roll with a double chance of rolling a 6.
/// </summary>
int SkewedDiceRoll()
{
// Set diceRool to a value from 1 to 7.
int diceRool = Math.Floor(7 * rand()) + 1;
// Treat a value of 7 as a 6.
if (diceRoll == 7)
{
diceRoll = 6;
}
return diceRoll;
}
This is not too difficult..
simply create an array of all possible numbers, then pad the array with extra numbers of which you want to result more often.
ie:
array('1',1','1','1','2','3','4','4');
Obviously when you query that array, it will call "1" the most, followed by "4"
In other words, is there a way, with rand() or something else, to pick a pseudo random number, but be able to adjust the odds of getting certain outcomes, and how do you do that if it is possible.
For simplicity sake, let's use the drand48() which returns "values uniformly distributed over the interval [0.0,1.0)".
To make the values close to one more likely to appear, apply skew function log2():
log2( drand48() + 1.0 ); // +1 since log2() in is [0.0, 1.0) for values in [1.0, 2.0)
To make the values close to zero more likely to appear, use the e.g. exp():
(exp(drand48()) - 1.0) * (1/(M_E-1.0)); // exp(0)=1, exp(1)=e
Generally you need to crate a function which would map the uniformly distributed values from the random function into values which are distributed differently, non-uniformly.
You can use the follwing trick
This example has a 50 percent chance of producing one of your 'favourite' numbers
int[] highlyProbable = new int[]{...};
public int biasedRand() {
double rand = rand();
if (rand < 0.5) {
return highlyProbable[(int)(highlyProbable.length * rand())];
} else {
return (int)YOUR_RANGE * rand();
}
}
In addition to what Kevin suggested, you could have your regular group of numbers (the wide range) chopped into a number of smaller ranges, and have the RNG pick from the ranges you find favorable. You could access these ranges in a particular order, or, you can access them in some random order (but I can assume this wouldn't be what you want.) Since you're using manually specified ranges to be accessed within the wide range of elements, you're likely to see the numbers you want pop up more than others. Of course, this is just how I'd approach it, and it may not seem all that rational.
Good luck.
By definition the output of a random number generator is random, which means that each number is equally likely to occur next (1/10 chance) and you should not be able to affect the outcome.
Of-course, a pseudo-random generator creates an output that will always follow the same pattern for a given input seed. So if you know the seed, then you may have some idea of the output sequence. You can, of-course, use the modulus operator to play around with the set of numbers being output from the generator (eg. %5 + 2 to generate numbers from 2 to 7).