Simulating regular expressions with deterministic finite automata and the size of the alphabet - finite-automata

I'm currently working my way through the "Dragon Book" (Compilers: Principles, Techniques, & Tools) and I'm kind of stuck at the lexical analysis chapter, which uses DFAs (Deterministic finite automata).
A DFA is a two-dimensional array, the first dimension contains the state and the second the transition symbols. This means that every DFA state contains all the symbols of the language. The examples in the book use a small language (usually two symbols), and they make the following note at the end of the chapter: "since a typical lexical analyzer has several hundred states in its DFA and involves the ASCII alphabet of 128 input characters, the array consumes less than a megabyte".
However, for matching strings I want to match all characters, which means the entire character set, and a lot of input files use UTF-8 encoding. This causes the alphabet, and thus the size of the DFA, to rise enormously.
This is the point where I'm stuck. How are lexical analyzers, or regular expression simulators in general handling this?
Thanks!

I've had an epiphany on this problem. In lexical analysis, about the only time you want to match characters beyond the ASCII range is while doing wildcard matching, like in strings or comments. Because these are only used in wildcards, and not individually, all the characters with a value of 128 or higher can be represented as a single 'other' value. The alphabet and DFA remain small this way, while I am still able to use transition tables and match the entire unicode charset.

Here's an interesting tool that converts a regular expression to non-deterministic finite automata.

Related

How can one define a language which does not fit in the Chomsky Hierarchy?

I'm asking this question because I've stumbled across the accepted answer of Chomsky Language Types
This quote is referring to Type-0 Grammars:
This means that if you have a language that is more expressive than
this type (e.g. English), you cannot write an algorithm that can list
each an every (and only these) words of the language
As far as I know:
There is no mathematical description for what English is so it is meaningless to argue about where it lands in the hierarchy of formal languages.
If there was, then English would certainly be recognizable by some Type-0 Grammar by virtue of it being defined by a finite amount of reasoning - where it be axioms, a grammar, anything. (If not - how could've someone define it if not by a finite amount of steps?)
Hence:
We can't start talking about how 'expressive' a grammar needs to be to generate precisely an unknown mathematical object
Therefore my problem:
How can one define a language which does not fit in the Chomsky Hierarchy?
If (?) it takes a finite amount of steps for mathematicians to define
sets with cardinalities that do not make them recursively enumerable - then grammars must exist which are more expressive than Type-0 since they (mathematicians) have followed a finite amount of rules (production rules if you will) to produce a non-RE set. Where are they?
A language is a possibly-infinite set of finite words written with some finite alphabet. Since the alphabet is finite and the length of each word is finite, the words of any language are enumerable, in the sense that there exists an enumeration. In other words, the size of any language is at most countably infinite.
However, since any subset of the Kleene closure of the alphabet is a language, the number of languages is not countably infinite. Hence, there is no enumeration of languages.
The Chomsky hierarchy is based on a formalism which can be expressed as a finite sentence with a finite alphabet (the same alphabet as the language being described, plus a couple of extra symbols). [Note 1] So the number of possible Type 0 grammars is countably infinite, and there cannot be a correspondence between the set of grammars and the set of languages.
However. The existence of languages (i.e. sets) for which no generative grammar exists does not necessarily mean that there is some other way of describing these languages which is "more expressive" than generative grammars. Any description which can be written as a finite string using a finite alphabet can only describe a countable infinity of sets. Whether or not it is the same countable infinity will depend on the formalisms, and in general there will be no algorithm which can demonstrate homomorphism. But some equivalences are known (such as the equivalence with Turing machines, which is a particularly interesting equivalence).
So, we have an interesting little conundrum, which is (of course) related to Gödel's Incompleteness Theorems. That is, there are more languages than ways of describing a language, no matter what system we use to describe a language. So the question "How do we describe a language for which no description is available?" does not have a good answer (and if we answer it, by calling some set "Sue", then there will still be an uncountable infinitude of possible sets for which no name exists).
While all this foraging into infinitudes is interesting, it has a few issues:
It has very little (if anything) to do with programming, so it's questionable whether it's on topic for StackOverflow.
Kurt Gödel and Georg Cantor, the two mathematicians responsible for most of the concepts in this answer, both suffered from severe depression. Just saying.
Notes
Although at first glance it might appear that the alphabet for a Type 0 grammar might be arbitrarily larger than the alphabet of the language being described, that is not actually the case. The grammar's alphabet consists of the target alphabet plus a finite set of non-terminals plus an → symbol; the non-terminals can be written using numbers in any convenient base, say binary. So only three additional symbols are required (and you could reduce that to two by arbitrarily designating one of the non-terminal numbers to be the arrow). (It might seem like you need a third symbol to delimit the names of non-terminals, but you can use a fibonacci encoding to produce codes which always start with a 1 and never include two 1s, so that you can use an extra 1 at the beginning to unambiguously mark the start of the symbol.)

OpenNLP splits our sentences in half along special characters

We're facing an issue while processing text extracted from PDF documents (the content of which we do not have control over).
Most of our text data happen to have sections which pose a challenge for OpenNLP which we use for detecting sentences for further processing. We are using the en-sent.bin model file from the OpenNLP website.
For example, one can often encounter GPS coordinates like 40° 43.554’ N, 73° 59.814’ W in these texts, where OpenNLP believes anything after a ’ character must belong to a new sentence.
This results in unwanted splitting of some of our sentences, for which we'd like to a find a solution or workaround.
The above character turns out not to be a regular single quote (U+0027), but one called 'RIGHT SINGLE QUOTATION MARK' (U+2019 or 0xE2 0x80 0x99 in hex). It looks like the sentence that contains the coordinates is split exactly along these.
We don't know how the en-sent.bin Sentence Detector model is trained or what character encoding it is working with (our input is UTF-8), as we found no such information in the documentation of OpenNLP (despite it mentioning that the character encoding to be used is specified during training of the model).
Filtering out such characters (i.e. all of those along which the splits happen) as a solution was dismissed, since we can't know for sure which ones are affected and it might also introduce the very similar problem of accidentally joining two sentences.
Since our team is highly inexperienced with OpenNLP, we're struggling to fix this. We have so far identified what we believe to be two candidate causes for the unwanted split, which I'd rather not post unless absolutely necessary, in order not to affect your thinking.
Please note that I'm obliged not to include our source code or the exact data we're feeding as those are highly confidential and the latter may contain personal or otherwise protected information.

Big O Notation - input size

I am reading a blog abt big O notation on topcoder.
https://www.topcoder.com/community/data-science/data-science-tutorials/computational-complexity-section-1/
I have come across the below paragraph
Formal notes on the input size
What exactly is this "input size" we started to talk about? In the
formal definitions this is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255. Then the "input
size" turns out to be exactly the size of the input file in bytes.
can anyone please explain what does this statement say?
it is the size of the input written in some
fixed finite alphabet (with at least 2 "letters"). For our needs, we
may consider this alphabet to be the numbers 0..255.
The statement is about the fundamental representation of information using symbols. The more symbols you use (the bigger the alphabet is), the more information you can represent with less characters although you can represent everything with just two "letters", i.e. one bit of information per character. Using the numbers 0..255 is equivalent to using 8 bit, i.e. one byte (2^8=256).
In computer programming, you normally use bytes but in theoretical computer science bits are used as they have the same capabilities (you just need more of them) and it makes proofs easier to write.
This statement means the following. You have to represent the input to process it by the algorithm, i.e. you have to "write it down". You can write the input down with letters (=symbols). The number of symbols have to be finite (or else you or the algorithm can not understand it), i.e. they comes from a fixed finite alphabet (=set of possible symbols). The size of input is that how many letters did you used to write down the input.
In the example mentioned in the text there is written that the alphabet contains the numbers between 0 and 255. This means that each letter can be written with an ASCII character. So, you can write down your input with ASCII characters. Each ASCII character can be stored in one byte, i.e. the size of input (=number of ASCII characters) is the number of bytes.
Let me explain by example.
Let's take, say, the factorization (sub)problem: given number n (not prime), find any of its divisors different from 1 and n. Clearly, we need to check at most sqrt(n) numbers to find one. Thus it seems to be a subpolynomial problem. Why it is considered a hard nut to crack then? That's because we usually need only log(n) digits to write down n, and we naturally want to resolve the problems which are "easy to write down". But although sqrt(n) may seem a little when compared to n, it's too much for us when compared to log(n).
That is the point why we need to say a word about "input alphabet" before talking of problem's complexity.

Impossible Finite State Automata?

I fully believe this non deterministic FSA is not possible from all of my attempts. The FSA (Non deterministic): A language is made up of an alphabet of only 2's and 3's within strings that have only an odd number of digits (223, 32232) and the sum of the digits must be divisible by 5. (Final inclusion examples: 22222, 33333, 2222322).
Would someone be able to construct this non deterministic FSA with acceptance states graphically? I would be very impressed because from all of both my attempts and also a colleague of mine, the only result is that it cannot be done.
First, I think it is called NFA not FSA. Any regular expression can be converted to an NFA. But not all languages can be specified by REs. Yours may be such an example. Here two simple examples: REs cannot be used to check whether parentheses are balanced or to check whether a string has an equal number of A's and B's. So if you can find an RE for your problem, you are done.

What is the typical alphabet size of Finite State Machines?

Not quite sure if this is the correct forum, but it was suggested at Theoretical Computer Science that I move it here...
What is the typical alphabet size of Finite State Machines?
I am currently busy implementing a high-performance FA library and need to make some design considerations before continuing. My state space will be in the order of 2 147 483 647 (Integer.MAX_VALUE) which I feel is more than enough, even for non-general use. Now, all that remains is the alphabet space.
Is there any merit in assuming that the alphabet would usually only consist of all displayable characters (in which case it can be stored as a byte which would result in really good performance)? Or should alphabet symbols rather be translated into Strings so that you rather have alphabet labels? In this case I would need to keep a Map that translates a String into either a int, short or byte, depending on how large I want to make it.
Really the alphabet of a finite state machine is a mathematical 'set' of any type. There is nothing restricting the content of the set, it could be 1's and 0's, A-Z, or apples-oranges. There is no 'typical' FSM alphabet size as per se. Do you have a user in mind for your library?