Problem 98 - Project Euler - anagram

The problem is as follows:
By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^(2). What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^(2). We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.
Using words.txt (right click and 'Save Link/Target As...'), a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).
What is the largest square number formed by any member of such a pair?
NOTE: All anagrams formed must be contained in the given text file.
I don't understand the mapping of CARE to 1296? How does that work? or are all permutation mappings meant to be tried i.e. all letters to 1-9?

All assignments of digits to letters are allowed. So C=1, A=2, R=3, E=4 would be a possible assignment ... except that 1234 is not a square, so that would be no good.
Maybe another example would help make it clear? If we assign A=6, E=5, T=2, then TEA = 256 = 16² and EAT = 625 = 25². So (TEA=256, EAT=625) is a square anagram word pair.
(Just because all assignments of digits to letters are allowed, does not mean that actually trying out all such assignments is the best way to solve the problem. There may be some other, cleverer, way to do it.)

In short: yes, all permutations need to be tried.

If you test all substitutions letter for digit, than you are looking for pairs of squares with properties:
have same length
have same digits with number of occurrences as in input string.
It is faster to find all these pairs of squares. There are 68 squares with length 4, 216 squares with length 5, ... Filtering all squares of same length by upper properties will generate 'small' number of pairs, which are solutions you are looking for.
These data is 'static', and doesn't depend on input strings. It can be calculated once and used for all input strings.

Hmm. How to put this. The people who put together Project Euler promise that there is a solution that is under one minute for every problem, and there is only one problem that I think might fail this promise, but this is not it.
Yes, you could permute the digits, and try all permutations against all squares, but that would be a very large search space, not at all likely to be the (TM) right thing. In general, when you see that your "look" at the problem is going to generate a search that will take too long, you need to search something else.
Like, suppose you were asked to determine what numbers would be the result of multiplying two primes between 1 and a zillion. You could factor every number between 1 and a zillion, but it might be faster to take all combinations of two primes and multiply them. Since you are looking at combinations, you can start with two and go until your results are too large, then do the same with three, etc. By comparison, this should be much faster - and you don't have to multiply all the numbers out, you could take logs of all the primes and then just add them and find the limit for every prime, giving you a list of numbers you could add up.
There are a bunch of innovative solutions, but the first one you think of - especially the one you think of when Project Euler describes the problem, is likely to be wrong.
So, how can you approach this problem? There are probably too many permutations to look at, but maybe you can figure out something with mappings and comparing mappings?
(Trying to avoid giving it all away.)


Could you explain how to convert from lz77 to huffman?

Could you explain how to convert from lz77 to huffman on the example in the below picture?
In the first step your output is essentially 3 numbers:
prev index
number of characters to repeat
next character (be it ascii or unicode)
The algorithm demands that you specify a sliding window up front. That means you know how big (1) and (2) can be at most.
In other words, you know how many bits (1) and (2) will take up.
Since (3) is essentially also a character from a fixed length alphabet, you also know the bit-length of (3)
That means it's safe to simply concatenate them.
So, the output of the first algorithm can be thought of as outputting a bit-sequence, where every item in the sequence has a fixed length.
That's ideal for applying huffman.
Of course the specifics are not mentioned, and you can choose from a lot of options.
normalized huffman table
1 on left-branch vs 0 on left-branch
priorities when merging items of similar count
So I can not readily explain the exact output values you are showing.
But I hope I can at least explain how to get from A to B.
You can't. The coding shown is, well, figurative. Not literal. The symbols A, B, and C are all coded to the single bit 0. Obviously that's not going to be very helpful on the decoding end.

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled.
So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.
The solutions I found (examples are in R) :
Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.
agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE)
## [1] "accusait" "abusait"
The comparison of phonemes
## [1] "A223" "A223" "A123"
The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on addresses I guess.
I thougt about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.
You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.
I look at this as a spelling-correction problem, where you need to find the nearest-matching word in some sort of dictionary.
What I mean by "near" is Levenshtein distance, except the smallest number of single-character insertions, deletions, and replacements is too restrictive.
Other kinds of "spelling mistakes" are also possible, for example transposing two characters.
I've done this a few times, but not lately.
The most recent case had to do with concomitant medications for clinical trials.
You would be amazed how many ways there are to misspell "acetylsalicylic".
Here is an outline in C++ of how it is done.
Briefly, the dictionary is stored as a trie, and you are presented with a possibly misspelled word, which you try to look up in the trie.
As you search, you try the word as it is, and you try all possible alterations of the word at each point.
As you go, you have an integer budget of how may alterations you can tolerate, which you decrement every time you put in an alteration.
If you exhaust the budget, you allow no further alterations.
Now there is a top-level loop, where you call the search.
On the first iteration, you call it with a budget of 0.
When the budget is 0, it will allow no alterations, so it is simply a direct lookup.
If it fails to find the word with a budget of 0, you call it again with a budget of 1, so it will allow a single alteration.
If that fails, try a budget of 2, and so on.
Something I have not tried is a fractional budget.
For example, suppose a typical alteration reduces the budget by 2, not 1, and the budget goes 0, 2, 4, etc.
Then some alterations could be considered "cheaper". For example, a vowel replacement might only decrement the budget by 1, so for the cost of one consonant replacement you could do two vowel replacements.
If the word is not misspelled, the time it takes is proportional to the number of letters in the word.
In general, the time it takes is exponential in the number of alterations in the word.
If you are working in R (as I was in the example above), I would have it call out to the C++ program, because you need the speed of a compiled language for this.
Extending what Mike had to say, and using the string matching library stringdist in R to match a vector of addresses that errored out in ARCGIS's geocoding function:
#vector to put our matched addresses in
matched_add<-rep(NA, rows)
score<-rep(NA, rows)
#for instructional purposes only, you should use sapply to apply functions to vectors
for (u in c(1:rows)){
#gives you the position of the closest match in an address vector
pos<-amatch(unmatched$address[u],index$address, maxDist = Inf)
#stringsim here will give you the score to go back and adjust your
Stringdist has several methods you can use to find approximate matches, including Levenshtein (method="lv"). You'll probably want to tinker with these to fit your dataset as well as you can.

Big O Notation - input size

I am reading a blog abt big O notation on topcoder.
I have come across the below paragraph
Formal notes on the input size
What exactly is this "input size" we started to talk about? In the
formal definitions this is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255. Then the "input
size" turns out to be exactly the size of the input file in bytes.
can anyone please explain what does this statement say?
it is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255.
The statement is about the fundamental representation of information using symbols. The more symbols you use (the bigger the alphabet is), the more information you can represent with less characters although you can represent everything with just two "letters", i.e. one bit of information per character. Using the numbers 0..255 is equivalent to using 8 bit, i.e. one byte (2^8=256).
In computer programming, you normally use bytes but in theoretical computer science bits are used as they have the same capabilities (you just need more of them) and it makes proofs easier to write.
This statement means the following. You have to represent the input to process it by the algorithm, i.e. you have to "write it down". You can write the input down with letters (=symbols). The number of symbols have to be finite (or else you or the algorithm can not understand it), i.e. they comes from a fixed finite alphabet (=set of possible symbols). The size of input is that how many letters did you used to write down the input.
In the example mentioned in the text there is written that the alphabet contains the numbers between 0 and 255. This means that each letter can be written with an ASCII character. So, you can write down your input with ASCII characters. Each ASCII character can be stored in one byte, i.e. the size of input (=number of ASCII characters) is the number of bytes.
Let me explain by example.
Let's take, say, the factorization (sub)problem: given number n (not prime), find any of its divisors different from 1 and n. Clearly, we need to check at most sqrt(n) numbers to find one. Thus it seems to be a subpolynomial problem. Why it is considered a hard nut to crack then? That's because we usually need only log(n) digits to write down n, and we naturally want to resolve the problems which are "easy to write down". But although sqrt(n) may seem a little when compared to n, it's too much for us when compared to log(n).
That is the point why we need to say a word about "input alphabet" before talking of problem's complexity.

Link numbers with an equation/algorithm

I am making an anagram solver in Visual Basic that gives you every possible combination when you enter a string. I need to work out how many combinations there are depending on the amount of characters in the string and how many different characters there are.
Sample string:
Total characters: 3, Different Characters: 3
Possible combinations: 6
abc, acb, bac, bca, cab, cba
I need an equation (using the number of characters and different characters) to link this to a string that contains a different amount of characters.
I've been using trial and error to try and figure is out, but I can't quite get my head around it. So far I have:
((letters - 1) ^ (different letters - 1)) + (letters - 1)
which works for a few different letter counts but now for all.
Help please???
I'll lead you to the answer, but I'll try to explain along the way. Let's say you had 10 different letters. You'd have 10 choices for the first, 9 for the second, 8 for the third, etc. Ultimately, there would be 10*9*8*7*6...*2*1 = 10! possibilities. However, sometimes you'll have multiple instances of the same letter. For example, using that for the string "aaabcd" would overcount possibilities, because it counts each of the a's as distinct letters, even though they're not. To correct for that, you would have to divide by the factorial of the number of repeated letters. A good way to calculate the total number of possibilities would be (total number of letters factorial)/ (product of the factorials of the number of repeated instances of each letter).
For example:
There are 6!/(3!) ways to arrange the letters in "aaabcd"
There are 6! ways to arrange the letters is "abcdef"
There are 6!/(3!*2!) ways to arrange the letters in "aaabbc"
There are 10!/(5!*3!*2!) ways to arrange the letters in "aaaaabbbcc"
I hope this helps.
For the possible counting number, it's exactly the same as computing Multinomial Coefficient
A simple explanation is that, for no repeating characters,
It's simply permutation = n!
(It is easy to understand if you draw a tree diagram, with first character has n choices, second character has n-1choices...etc.)
However as you may have repeating characters, you will double count many of them.
Let's see an simple example: for aaa, how many possible arrangements IF WE COUNT EVEN THE OUTCOME IS THE SAME?
Answer is 3!(aaa,aaa,aaa,aaa,aaa,aaa)
This gives us an idea that, when we have a character appearing for m times, we will count m! instead of 1
So the counting is just n!(all possible arrangements, including same outcome) / m! (a character appear for m times)
Same for more characters repeating: n!/a!b!c!.. (first character appear a times, another appear for b times...)
If you understand the concept behind, then you will find that, actually for those "non-repeating" characters, it's just dividing an 1!. For eg, character (multi)set = {a,a,a,b,b,c}, #a = 3, #b = 2, #c = 1, so the answer (without repeating count) is (3+2+1)!/3!2!1! and fraction of this format is named multinomial coefficient as stated above.
In programming point of view, you can just pre-compute all factorials (with a pretty small n though as n~30 is already too large for a variable to store) with simple for loop
declare frac = array(n);
frac[0] = 1;
FOR i=1; i<=n;i++
frac[i] = i*frac[i-1]
For a larger n, you may just calculate double/float division on the fly in the loop to avoid may face precision problem though.
If you further need to output the different strings, you may use DFS to backtrack all the possible outcomes. Or if you could use another language like C++, you can use built-in function like next_permutation() after sort the character set.

How does rand() work? Does it have certain tendencies? Is there something better to use?

I have read that it has something to do with time, also you get from including time.h, so I assumed that much, but how does it work exactly? Also, does it have any tendencies towards odd or even numbers or something like that? And finally is there something with better distribution in the C standard library or the Foundation framework?
You use time.h to get a seed, which is an initial random number. C then does a bunch of operations on this number to get the next random number, then operations on that one to get the next, then... you get the picture.
rand() is able to touch on every possible integer. It will not prefer even or odd numbers regardless of the input seed, happily. Still, it has limits - it repeats itself relatively quickly, and in almost every implementation only gives numbers up to 32767.
C does not have another built-in random number generator. If you need a real tough one, there are many packages available online, but the Mersenne Twister algorithm is probably the most popular pick.
Now, if you are interested on the reasons why the above is true, here are the gory details on how rand() works:
rand() is what's called a "linear congruential generator." This means that it employs an equation of the form:
xn+1 = (*a****xn + ***b*) mod m
where xn is the nth random number, and a and b are some predetermined integers. The arithmetic is performed modulo m, with m usually 232 depending on the machine, so that only the lowest 32 bits are kept in the calculation of xn+1.
In English, then, the idea is this: To get the next random number, multiply the last random number by something, add a number to it, and then take the last few digits.
A few limitations are quickly apparent:
First, you need a starting random number. This is the "seed" of your random number generator, and this is where you've heard of time.h being used. Since we want a really random number, it is common practice to ask the system what time it is (in integer form) and use this as the first "random number." Also, this explains why using the same seed twice will always give exactly the same sequence of random numbers. This sounds bad, but is actually useful, since debugging is a lot easier when you control the inputs to your program
Second, a and b have to be chosen very, very carefully or you'll get some disastrous results. Fortunately, the equation for a linear congruential generator is simple enough that the math has been worked out in some detail. It turns out that choosing an a which satisfies *a***mod8 = 5 together with ***b* = 1 will insure that all m integers are equally likely, independent of choice of seed. You also want a value of a that is really big, so that every time you multiply it by xn you trigger a the modulo and chop off a lot of digits, or else many numbers in a row will just be multiples of each other. As a result, two common values of a (for example) are 1566083941 and 1812433253 according to Knuth. The GNU C library happens to use a=1103515245 and b=12345. A list of values for lots of implementations is available at the wikipedia page for LCGs.
Third, the linear congruential generator will actually repeat itself because of that modulo. This gets to be some pretty heady math, but the result of it all is happily very simple: The sequence will repeat itself after m numbers of have been generated. In most cases, this means that your random number generator will repeat every 232 cycles. That sounds like a lot, but it really isn't for many applications. If you are doing serious numerical work with Monte Carlo simulations, this number is hopelessly inadequate.
A fourth much less obvious problem is that the numbers are actually not really random. They have a funny sort of correlation. If you take three consecutive integers, (x, y, z), from an LCG with some value of a and m, those three points will always fall on the lattice of points generated by all linear combinations of the three points (1, a, a2), (0, m, 0), (0, 0, m). This is known as Marsaglia's Theorem, and if you don't understand it, that's okay. All it means is this: Triplets of random numbers from an LCG will show correlations at some deep, deep level. Usually it's too deep for you or I to notice, but its there. It's possible to even reconstruct the first number in a "random" sequence of three numbers if you are given the second and third! This is not good for cryptography at all.
The good part is that LCGs like rand() are very, very low footprint. It typically requires only 32 bits to retain state, which is really nice. It's also very fast, requiring very few operations. These make it good for noncritical embedded systems, video games, casual applications, stuff like that.
PRNGs are a fascinating topic. Wikipedia is always a good place to go if you are hungry to learn more on the history or the various implementations that are around today.
rand returns numbers generated by a pseudo-random number generator (PRNG). The sequence of numbers it returns is deterministic, based on the value with which the PRNG was initialized (by calling srand).
The numbers should be distributed such that they appear somewhat random, so, for example, odd and even numbers should be returned at roughly the same frequency. The actual implementation of the random number generator is left unspecified, so the actual behavior is specific to the implementation.
The important thing to remember is that rand does not return random numbers; it returns pseudo-random numbers, and the values it returns are determined by the seed value and the number of times rand has been called. This behavior is fine for many use cases, but is not appropriate for others (for example, rand would not be appropriate for use in many cryptographic applications).
How does rand() work?
I have read that it has something to
do with time, also you get from
including time.h
rand() has nothing at all to do with the time. However, it's very common to use time() to obtain the "seed" for the PRNG so that you get different "random" numbers each time your program is run.
Also, does it have any tendencies
towards odd or even numbers or
something like that?
Depends on the exact method used. There's one popular implementation of rand() that alternates between odd and even numbers. So avoid writing code like rand() % 2 that depends on the lowest bit being random.