Big O Notation - input size - input

I am reading a blog abt big O notation on topcoder.
https://www.topcoder.com/community/data-science/data-science-tutorials/computational-complexity-section-1/
I have come across the below paragraph
Formal notes on the input size
What exactly is this "input size" we started to talk about? In the
formal definitions this is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255. Then the "input
size" turns out to be exactly the size of the input file in bytes.
can anyone please explain what does this statement say?
it is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255.

The statement is about the fundamental representation of information using symbols. The more symbols you use (the bigger the alphabet is), the more information you can represent with less characters although you can represent everything with just two "letters", i.e. one bit of information per character. Using the numbers 0..255 is equivalent to using 8 bit, i.e. one byte (2^8=256).
In computer programming, you normally use bytes but in theoretical computer science bits are used as they have the same capabilities (you just need more of them) and it makes proofs easier to write.

This statement means the following. You have to represent the input to process it by the algorithm, i.e. you have to "write it down". You can write the input down with letters (=symbols). The number of symbols have to be finite (or else you or the algorithm can not understand it), i.e. they comes from a fixed finite alphabet (=set of possible symbols). The size of input is that how many letters did you used to write down the input.
In the example mentioned in the text there is written that the alphabet contains the numbers between 0 and 255. This means that each letter can be written with an ASCII character. So, you can write down your input with ASCII characters. Each ASCII character can be stored in one byte, i.e. the size of input (=number of ASCII characters) is the number of bytes.

Let me explain by example.
Let's take, say, the factorization (sub)problem: given number n (not prime), find any of its divisors different from 1 and n. Clearly, we need to check at most sqrt(n) numbers to find one. Thus it seems to be a subpolynomial problem. Why it is considered a hard nut to crack then? That's because we usually need only log(n) digits to write down n, and we naturally want to resolve the problems which are "easy to write down". But although sqrt(n) may seem a little when compared to n, it's too much for us when compared to log(n).
That is the point why we need to say a word about "input alphabet" before talking of problem's complexity.

Related

Could you explain how to convert from lz77 to huffman?

Could you explain how to convert from lz77 to huffman on the example in the below picture?
Easy:
In the first step your output is essentially 3 numbers:
prev index
number of characters to repeat
next character (be it ascii or unicode)
The algorithm demands that you specify a sliding window up front. That means you know how big (1) and (2) can be at most.
In other words, you know how many bits (1) and (2) will take up.
Since (3) is essentially also a character from a fixed length alphabet, you also know the bit-length of (3)
That means it's safe to simply concatenate them.
So, the output of the first algorithm can be thought of as outputting a bit-sequence, where every item in the sequence has a fixed length.
That's ideal for applying huffman.
Of course the specifics are not mentioned, and you can choose from a lot of options.
normalized huffman table
1 on left-branch vs 0 on left-branch
priorities when merging items of similar count
etc
So I can not readily explain the exact output values you are showing.
But I hope I can at least explain how to get from A to B.
You can't. The coding shown is, well, figurative. Not literal. The symbols A, B, and C are all coded to the single bit 0. Obviously that's not going to be very helpful on the decoding end.

Simulating regular expressions with deterministic finite automata and the size of the alphabet

I'm currently working my way through the "Dragon Book" (Compilers: Principles, Techniques, & Tools) and I'm kind of stuck at the lexical analysis chapter, which uses DFAs (Deterministic finite automata).
A DFA is a two-dimensional array, the first dimension contains the state and the second the transition symbols. This means that every DFA state contains all the symbols of the language. The examples in the book use a small language (usually two symbols), and they make the following note at the end of the chapter: "since a typical lexical analyzer has several hundred states in its DFA and involves the ASCII alphabet of 128 input characters, the array consumes less than a megabyte".
However, for matching strings I want to match all characters, which means the entire character set, and a lot of input files use UTF-8 encoding. This causes the alphabet, and thus the size of the DFA, to rise enormously.
This is the point where I'm stuck. How are lexical analyzers, or regular expression simulators in general handling this?
Thanks!
I've had an epiphany on this problem. In lexical analysis, about the only time you want to match characters beyond the ASCII range is while doing wildcard matching, like in strings or comments. Because these are only used in wildcards, and not individually, all the characters with a value of 128 or higher can be represented as a single 'other' value. The alphabet and DFA remain small this way, while I am still able to use transition tables and match the entire unicode charset.
Here's an interesting tool that converts a regular expression to non-deterministic finite automata.

Formula for checking the probability of a character appearing multiple times consecutively in an encrypted string

My question today is fairly specific and not so much about programming, more about statistics.
I asked myself if there is a formula how often a character is likely to appear multiple times in a row. I made the assumption that every printable character on the keyboard (95) is equally likely to appear, so that the formula would be something like:
1/95^n(*95) (= 1/95^(n-1))
(*95 if you are not making any assumptions which character and are happy with just any)
I am sorry for the eye-hurting formatting, but I did not know how to format it more clearly
Now that is kind of nice as a formula, but it is based on too many assumptions and I am sure somebody has made more of that than an educated guess. Could you point me to a paper, a person or just the formula?
EDIT: This may be different for different encryption algorithms. Up until now, I have not dwelled in the realm of statistics in cryptography. If someone could provide a paper on that(specifically character appearance probability) that would be nice as well.
Ideally, a cipher should produce ciphertext that is indistinguishable from random data. In fact, any cipher that does not fill this criterion is fundamentally weak.
In random data, each byte value is equally likely. An 8-bit byte can have 256 different values, so the probability of n consecutive bytes with the same value is (1/256)^(n-1).

What is the typical alphabet size of Finite State Machines?

Not quite sure if this is the correct forum, but it was suggested at Theoretical Computer Science that I move it here...
What is the typical alphabet size of Finite State Machines?
I am currently busy implementing a high-performance FA library and need to make some design considerations before continuing. My state space will be in the order of 2 147 483 647 (Integer.MAX_VALUE) which I feel is more than enough, even for non-general use. Now, all that remains is the alphabet space.
Is there any merit in assuming that the alphabet would usually only consist of all displayable characters (in which case it can be stored as a byte which would result in really good performance)? Or should alphabet symbols rather be translated into Strings so that you rather have alphabet labels? In this case I would need to keep a Map that translates a String into either a int, short or byte, depending on how large I want to make it.
Really the alphabet of a finite state machine is a mathematical 'set' of any type. There is nothing restricting the content of the set, it could be 1's and 0's, A-Z, or apples-oranges. There is no 'typical' FSM alphabet size as per se. Do you have a user in mind for your library?

Problem 98 - Project Euler

The problem is as follows:
By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 36^(2). What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 96^(2). We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.
Using words.txt (right click and 'Save Link/Target As...'), a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).
What is the largest square number formed by any member of such a pair?
NOTE: All anagrams formed must be contained in the given text file.
I don't understand the mapping of CARE to 1296? How does that work? or are all permutation mappings meant to be tried i.e. all letters to 1-9?
All assignments of digits to letters are allowed. So C=1, A=2, R=3, E=4 would be a possible assignment ... except that 1234 is not a square, so that would be no good.
Maybe another example would help make it clear? If we assign A=6, E=5, T=2, then TEA = 256 = 16² and EAT = 625 = 25². So (TEA=256, EAT=625) is a square anagram word pair.
(Just because all assignments of digits to letters are allowed, does not mean that actually trying out all such assignments is the best way to solve the problem. There may be some other, cleverer, way to do it.)
In short: yes, all permutations need to be tried.
If you test all substitutions letter for digit, than you are looking for pairs of squares with properties:
have same length
have same digits with number of occurrences as in input string.
It is faster to find all these pairs of squares. There are 68 squares with length 4, 216 squares with length 5, ... Filtering all squares of same length by upper properties will generate 'small' number of pairs, which are solutions you are looking for.
These data is 'static', and doesn't depend on input strings. It can be calculated once and used for all input strings.
Hmm. How to put this. The people who put together Project Euler promise that there is a solution that is under one minute for every problem, and there is only one problem that I think might fail this promise, but this is not it.
Yes, you could permute the digits, and try all permutations against all squares, but that would be a very large search space, not at all likely to be the (TM) right thing. In general, when you see that your "look" at the problem is going to generate a search that will take too long, you need to search something else.
Like, suppose you were asked to determine what numbers would be the result of multiplying two primes between 1 and a zillion. You could factor every number between 1 and a zillion, but it might be faster to take all combinations of two primes and multiply them. Since you are looking at combinations, you can start with two and go until your results are too large, then do the same with three, etc. By comparison, this should be much faster - and you don't have to multiply all the numbers out, you could take logs of all the primes and then just add them and find the limit for every prime, giving you a list of numbers you could add up.
There are a bunch of innovative solutions, but the first one you think of - especially the one you think of when Project Euler describes the problem, is likely to be wrong.
So, how can you approach this problem? There are probably too many permutations to look at, but maybe you can figure out something with mappings and comparing mappings?
(Trying to avoid giving it all away.)