Could you explain how to convert from lz77 to huffman? - gzip

Could you explain how to convert from lz77 to huffman on the example in the below picture?

Easy:
In the first step your output is essentially 3 numbers:
prev index
number of characters to repeat
next character (be it ascii or unicode)
The algorithm demands that you specify a sliding window up front. That means you know how big (1) and (2) can be at most.
In other words, you know how many bits (1) and (2) will take up.
Since (3) is essentially also a character from a fixed length alphabet, you also know the bit-length of (3)
That means it's safe to simply concatenate them.
So, the output of the first algorithm can be thought of as outputting a bit-sequence, where every item in the sequence has a fixed length.
That's ideal for applying huffman.
Of course the specifics are not mentioned, and you can choose from a lot of options.
normalized huffman table
1 on left-branch vs 0 on left-branch
priorities when merging items of similar count
etc
So I can not readily explain the exact output values you are showing.
But I hope I can at least explain how to get from A to B.

You can't. The coding shown is, well, figurative. Not literal. The symbols A, B, and C are all coded to the single bit 0. Obviously that's not going to be very helpful on the decoding end.

Related

SQL Query - hexadecimal vs. decimal values

I expected to find the answer to this question fairly quickly, but surprisingly, don't seem to see it anywhere.
I'm guessing that a comparison to a binary constant in an SQL query would be faster than a comparison to a decimal number, as the binary constant is probably a direct lookup while decimal numbers need to be converted, but is the performance difference measurable?
In other words, is the first query better than the second one? If so, how much better?
select *
from Cats
where Cats_Id = 0x0000000000000086
select *
from Cats
where Cats_Id = 134
There absolutely no difference: 0x0000000000000086 is an integer with a decimal value of 134. It's just written in base 16 (hexadecimal).
The two queries are exactly identical and will get exactly the same execution plan.
The one different will be if the column you are comparing to is binary(n) or varbinary(n). There the hex constant is representing a sequence of octets.
Your premise is based on a misunderstanding:
the binary constant is probably a direct lookup while decimal numbers need to be converted
The SQL you enter consists of characters in some text encoding; for simplicity, let's assume ASCII.
In the computer's memory, it's composed of a long string of binary states we normally write as 0 and 1, but could equally write as _ and |.
The binary data for the string 134 in ASCII looks something like __||___|__||__||__||_|__. The binary data for the string 0x0086 looks something like __||_____||||_____||______||______|||_____||_||_
When actually working with the data, e.g. comparing numbers, the computer will use a different representation altogether. The number "one hundred and thirty-four" will look something more like |____||_.
So whichever representation you use, there is a conversion going on.
Nonetheless, one conversion might be more efficient, by some incidental detail of its implementation, but by such a tiny margin that it would be almost impossible to measure amongst the noise of the system you're testing.
The answer is yes. In some cases, the query with the hexadecimal value is substantially better.
I hired a consultant DBA to help us with our system and after reviewing one of our queries, which was running a bit slow, he showed me that changing the value to hexadecimal improved it substantially (by about 95%).
He then showed me that, although I had an index on the field I was searching (binary foreign key), the index wasn't used when executing the query with the decimal value.
If anyone can provide a more detailed answer about the different cases in which these queries are identical or not, performance wise, I would appreciate that.

Karp-Rabin algorithm

The below image is from : 6.006-Introduction to algorithms,
While doing the course 6.006-Introduction to algorithms, provided by MIT OCW, I came across the Rabin-Karp algorithm.
Can anyone help me understand as to why the first rs()==rt() required? If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead? Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
In the image, rs() is for hash value, and rs.skip[arg] is to remove the first character of that string assuming it is ‘arg’
Can anyone help me understand as to why the first rs()==rt() required?
I assume you mean the one right before the range-loop. If the strings have the same length, then the range-loop will not run (empty range). The check is necessary to cover that case.
If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead?
Not sure what you mean here. The posted code leaves blank (with ...) after matching hashes are found. Let's not forget that at that point, we must compare strings to confirm we really found a match. And, it's up to the (not shown) implementation to continue searching until the end or not.
Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
I really don't get this part. Note the first two loops are to populate the rolling hashes for the input strings. Then comes a check if we have a match at this point, and then the loop updating the rolling hashes pair wise, and then comparing them. The entire t is checked, from start to end.

Big O Notation - input size

I am reading a blog abt big O notation on topcoder.
https://www.topcoder.com/community/data-science/data-science-tutorials/computational-complexity-section-1/
I have come across the below paragraph
Formal notes on the input size
What exactly is this "input size" we started to talk about? In the
formal definitions this is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255. Then the "input
size" turns out to be exactly the size of the input file in bytes.
can anyone please explain what does this statement say?
it is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255.
The statement is about the fundamental representation of information using symbols. The more symbols you use (the bigger the alphabet is), the more information you can represent with less characters although you can represent everything with just two "letters", i.e. one bit of information per character. Using the numbers 0..255 is equivalent to using 8 bit, i.e. one byte (2^8=256).
In computer programming, you normally use bytes but in theoretical computer science bits are used as they have the same capabilities (you just need more of them) and it makes proofs easier to write.
This statement means the following. You have to represent the input to process it by the algorithm, i.e. you have to "write it down". You can write the input down with letters (=symbols). The number of symbols have to be finite (or else you or the algorithm can not understand it), i.e. they comes from a fixed finite alphabet (=set of possible symbols). The size of input is that how many letters did you used to write down the input.
In the example mentioned in the text there is written that the alphabet contains the numbers between 0 and 255. This means that each letter can be written with an ASCII character. So, you can write down your input with ASCII characters. Each ASCII character can be stored in one byte, i.e. the size of input (=number of ASCII characters) is the number of bytes.
Let me explain by example.
Let's take, say, the factorization (sub)problem: given number n (not prime), find any of its divisors different from 1 and n. Clearly, we need to check at most sqrt(n) numbers to find one. Thus it seems to be a subpolynomial problem. Why it is considered a hard nut to crack then? That's because we usually need only log(n) digits to write down n, and we naturally want to resolve the problems which are "easy to write down". But although sqrt(n) may seem a little when compared to n, it's too much for us when compared to log(n).
That is the point why we need to say a word about "input alphabet" before talking of problem's complexity.

Problem 98 - Project Euler

The problem is as follows:
By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^(2). What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^(2). We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.
Using words.txt (right click and 'Save Link/Target As...'), a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).
What is the largest square number formed by any member of such a pair?
NOTE: All anagrams formed must be contained in the given text file.
I don't understand the mapping of CARE to 1296? How does that work? or are all permutation mappings meant to be tried i.e. all letters to 1-9?
All assignments of digits to letters are allowed. So C=1, A=2, R=3, E=4 would be a possible assignment ... except that 1234 is not a square, so that would be no good.
Maybe another example would help make it clear? If we assign A=6, E=5, T=2, then TEA = 256 = 16² and EAT = 625 = 25². So (TEA=256, EAT=625) is a square anagram word pair.
(Just because all assignments of digits to letters are allowed, does not mean that actually trying out all such assignments is the best way to solve the problem. There may be some other, cleverer, way to do it.)
In short: yes, all permutations need to be tried.
If you test all substitutions letter for digit, than you are looking for pairs of squares with properties:
have same length
have same digits with number of occurrences as in input string.
It is faster to find all these pairs of squares. There are 68 squares with length 4, 216 squares with length 5, ... Filtering all squares of same length by upper properties will generate 'small' number of pairs, which are solutions you are looking for.
These data is 'static', and doesn't depend on input strings. It can be calculated once and used for all input strings.
Hmm. How to put this. The people who put together Project Euler promise that there is a solution that is under one minute for every problem, and there is only one problem that I think might fail this promise, but this is not it.
Yes, you could permute the digits, and try all permutations against all squares, but that would be a very large search space, not at all likely to be the (TM) right thing. In general, when you see that your "look" at the problem is going to generate a search that will take too long, you need to search something else.
Like, suppose you were asked to determine what numbers would be the result of multiplying two primes between 1 and a zillion. You could factor every number between 1 and a zillion, but it might be faster to take all combinations of two primes and multiply them. Since you are looking at combinations, you can start with two and go until your results are too large, then do the same with three, etc. By comparison, this should be much faster - and you don't have to multiply all the numbers out, you could take logs of all the primes and then just add them and find the limit for every prime, giving you a list of numbers you could add up.
There are a bunch of innovative solutions, but the first one you think of - especially the one you think of when Project Euler describes the problem, is likely to be wrong.
So, how can you approach this problem? There are probably too many permutations to look at, but maybe you can figure out something with mappings and comparing mappings?
(Trying to avoid giving it all away.)

Storage algorithm question - verify sequential data with little memory

I found this on an "interview questions" site and have been pondering it for a couple of days. I will keep churning, but am interested what you guys think
"10 Gbytes of 32-bit numbers on a magnetic tape, all there from 0 to 10G in random order. You have 64 32 bit words of memory available: design an algorithm to check that each number from 0 to 10G occurs once and only once on the tape, with minimum passes of the tape by a read head connected to your algorithm."
32-bit numbers can take 4G = 2^32 different values. There are 2.5*2^32 numbers on tape total. So after 2^32 count one of numbers will repeat 100%. If there were <= 2^32 numbers on tape then it was possible that there are two different cases – when all numbers are different or when at least one repeats.
It's a trick question, as Michael Anderson and I have figured out. You can't store 10G 32b numbers on a 10G tape. The interviewer (a) is messing with you and (b) is trying to find out how much you think about a problem before you start solving it.
The utterly naive algorithm, which takes as many passes as there are numbers to check, would be to walk through and verify that the lowest number is there. Then do it again checking that the next lowest is there. And so on.
This requires one word of storage to keep track of where you are - you could cut down the number of passes by a factor of 64 by using all 64 words to keep track of where you're up to in several different locations in the search space - checking all of your current ones on each pass. Still O(n) passes, of course.
You could probably cut it down even more by using portions of the words - given that your search space for each segment is smaller, you won't need to keep track of the full 32-bit range.
Perform an in-place mergesort or quicksort, using tape for storage? Then iterate through the numbers in sequence, tracking to see that each number = previous+1.
Requires cleverly implemented sort, and is fairly slow, but achieves the goal I believe.
Edit: oh bugger, it's never specified you can write.
Here's a second approach: scan through trying to build up to 30-ish ranges of contiginous numbers. IE 1,2,3,4,5 would be one range, 8,9,10,11,12 would be another, etc. If ranges overlap with existing, then they are merged. I think you only need to make a limited number of passes to either get the complete range or prove there are gaps... much less than just scanning through in blocks of a couple thousand to see if all digits are present.
It'll take me a bit to prove or disprove the limits for this though.
Do 2 reduces on the numbers, a sum and a bitwise XOR.
The sum should be (10G + 1) * 10G / 2
The XOR should be ... something
It looks like there is a catch in the question that no one has talked about so far; the interviewer has only asked the interviewee to write a program that CHECKS
(i) if each number that makes up the 10G is present once and only once--- what should the interviewee do if the numbers in the given list are present multple times? should he assume that he should stop execting the programme and throw exception or should he assume that he should correct the mistake by removing the repeating number and replace it with another (this may actually be a costly excercise as this involves complete reshuffle of the number set)? correcting this is required to perform the second step in the question, i.e. to verify that the data is stored in the best possible way that it requires least possible passes.
(ii) When the interviewee was asked to only check if the 10G weight data set of numbers are stored in such a way that they require least paases to access any of those numbers;
what should the interviewee do? should he stop and throw exception the moment he finds an issue in the algorithm they were stored in, or correct the mistake and continue till all the elements are sorted in the order of least possible passes?
If the intension of the interviewer is to ask the interviewee to write an algorithm that finds the best combinaton of numbers that can be stored in 10GB, given 64 32 Bit registers; and also to write an algorithm to save these chosen set of numbers in the best possible way that require least number of passes to access each; he should have asked this directly, woudn't he?
I suppose the intension of the interviewer may be to only see how the interviewee is approaching the problem rather than to actually extract a working solution from the interviewee; wold any buy this notion?
Regards,
Samba