I run across several blogs, such as this one:http://mortoray.com/2012/07/20/why-i-dont-use-a-parser-generator/, somehow they use "recursive descent parsing" refers to handmade parser vs. parser generator like ANTLR.
To me "recursive descent parsing" and ANTLR are 2 different things, one is a general parsing theory while the other is an exact technology. But I am wondering why, it seems quite popular, people are mixing/comparing them together?
Recursive descent parsers are a specific subset of top-down parsers (LL). Recursive descent parsers are what programmers typically build by hand because that is the natural expression when building things my hand. Tools can generate all sorts of funny machines. ANTLR's goal for the last 25 years has been to generate what programmers build by hand, which means that it generates recursive descent parsers. Necessarily the generated parsers are more complicated because they are not hand optimized by human.
I would assume it's just because handwritten parsers tend to be recursive descent because that form follows closely from the [E]BNF definition and is very easy to verify manually, and if necessary to debug. Conversely, tools like ANTLR, Bison and the rest don't generally produce recursive descent parsers.
So you're right in that the comparison is strictly an approach to parsing versus a tool for parser generation but somewhere along the way recursive descent and handwritten have become idiomatic synonyms.
Related
Chunking or shallow parsing segments a sentence into a sequence of syntactic constituents or chunks, i.e. sequences of adjacent words grouped on the basis of linguistic properties. It is often referred as efficient and robust approach to parsing natural language and a popular alternative to the full parsing but in which scenario chunking would be the more appropriate technique
over full parsing.
This is nothing more than my own personal bias, but if for some reason you only need to detect noun and/or verb phrases, you are often might be better off with chunking. E.g., for document clustering, topic tagging, or simply identifying keywords, NP or VP chunking can be more than sufficient. Also, if you need to work with a language for which no tree-banks exist, you might want to fall back to chunking.
Chunking typically has the advantage of being orders of magnitude faster than deep parsing, but modern (perceptron/neural) parsers are much faster than deep parsers used to be five or ten years ago. However, even to date, deep parsing can choke on long sentences. And, obviously, annotating tree-banks to train a deep parser is more costly than annotating the NP/VP phrases or even just building a rule-based chunker - particularly if you need to detect phrases in non-English texts.
as I understand the latest ANTLR4 went away from building static DFA tables for the lexer and for parser and now they do it at runtime. Is this correct? Could someone please explain in general how ANTLR4 works?
Our recent paper describes the mechanism in excruciating, academic detail. Section 3 provides a high level overview:
Instead of relying on static grammar analysis, an ALL(*) parser adapts to the input sentences presented to it at parse- time. The parser analyzes the current decision point (nonterminal with multiple productions) using a GLR-like mechanism to explore all possible decision paths with respect to the current “call” stack of in-process nonterminals and the remaining input on-demand. The parser incrementally and dynamically builds a lookahead DFA per decision that records a mapping from lookahead sequence to predicted production number. If the DFA constructed to date matches the current lookahead, the parser can skip analysis and immediately expand the predicted alternative.
We use an augmented transition network (ATN) to represent the grammar but build DFA using an algorithm very similar to the subset-construction algorithm of NFA-to-DFA conversion fame.
Hope this helps.
Every now and then, I hear someone saying things like "functional programming languages are more mathematical". Is it so? If so, why and how? Is, for instance, Scheme more mathematical than Java or C? Or Haskell?
I cannot define precisely what is "mathematical", but I believe you can get the feeling.
Thanks!
There are two common(*) models of computation: the Lambda Calculus (LC) model and the Turing Machine (TM) model.
Lambda Calculus approaches computation by representing it using a mathematical formalism in which results are produced through the composition of functions over a domain of types. LC is also related to Combinatory Logic, which is considered a more generalized approach to the same topic.
The Turing Machine model approaches computation by representing it as the manipulation of symbols stored on idealized storage using a body of basic operations (like addition, mutation, etc).
These different models of computation are the basis for different families of programming languages. Lambda Calculus has given rise to languages like ML, Scheme, and Haskell. The Turing Model has given rise to C, C++, Pascal, and others. As a generalization, most functional programming languages have a theoretical basis in lambda calculus.
Due to the nature of Lambda Calculus, certain proofs are possible about the behavior of systems built on its principles. In fact, provability (ie correctness) is an important concept in LC, and makes possible certain kinds of reasoning and conclusions about LC systems. LC is also related to (and relies on) type theory and category theory.
By contrast, Turing models rely less on type theory and more on structuring computation as a series of state transitions in the underlying model. Turing Machine models of computation are more difficult to make assertions about and do not lend themselves to the same kinds of mathematical proofs and manipulation that LC-based programs do. However, this does not mean that no such analysis is possible - some important aspects of TM models is used when studying virtualization and static analysis of programs.
Because functional programming relies on careful selection of types and transformation between types, FP can be perceived as more "mathematical".
(*) Other models of computation exist as well, but they are less relevant to this discussion.
Pure functional programming languages are examples of a functional calculus and so in theory programs written in a functional language can be reasoned about in a mathematical sense. Ideally you'd like to be able to 'prove' the program is correct.
In practice such reasoning is very hard except in trivial cases, but it's still possible to some degree. You might be able to prove certain properties of the program, for example you might be able to prove that given all numeric inputs to the program, the output is always constrained within a certain range.
In non-functional languages with mutable state and side effects attempts to reason about a program and 'prove' correctness are all but impossible, at the moment at least. With non-functional programs you can think through the program and convince yourself parts of it are correct, and you can run unit tests that test certain inputs, but it's usually not possible to construct rigorous mathematical proofs about the behaviour of the program.
I think one major reason is that pure functional languages have no side effects, i.e. no mutable state, they only map input parameters to result values, which is just what a mathematical function does.
The logic structures of functional programming is heavily based on lambda calculus. While it may not appear to be mathematical based solely on algebraic forms of math, it is written very easily from discrete mathematics.
In comparison to imperative programming, it doesn't prescribe exactly how to do something, but what must be done. This reflects topology.
The mathematical feel of functional programming languages comes from a few different features. The most obvious is the name; "functional", i.e. using functions, which are fundamental to math. The other significant reason is that functional programming involves defining a collection of things that will always be true, which by their interactions achieve the desired computation -- this is similar to how mathematical proofs are done.
I've been out of the modeling biz, so to speak, for a while now. When I was in college, most of the models I worked with were written in FORTRAN, which I never liked. I'm looking to get back into science, so I'm wondering if there are modern languages with feature sets suited for this kind of application. What would you consider to be an optimal language for simulating complex physics systems?
While certainly Fortran was the absolute ruler for this, Python is being used more and more exactly for this purpose. While it is very hard to say which is the BEST program for this, I've found python pretty useful for physics simulations and physics education.
It depends on the task
C++ is good at complicated data structures, but it is bad at slicing and multiply matrices. (This task equires you to spend a lot of time writing for loops.)
FORTRAN has a nice notation for slicing and multiplying matrices, but it is clumsy for creating complicated data structure such as graphs and linked lists.
Python/scipy has a nice notation for everything, but python is an interepreted language, so it is slow at certain tasks.
Some people are interested in languages like CUDA that allow you to use your GPU to speed up your simulations.
In the molecular dynamics community c++ seems to be popular, because you need somewhat complicated data structures to represent the shapes of the molecules.
I think it's arguable that FORTRAN is still dominant when it comes to solving large-scale problems in physics, as long as we're talking about serial calculations.
I know that parallelization is changing the game. I'm less certain about whether or not parallelized versions of LINPACK and other linear algebra packages are still written in FORTRAN.
A lot of engineers are using MATLAB and Mathematica these days, because they combine numerical and graphics capabilities.
I'd also point out that there's a difference between calculation and display engines. The former might still be written in FORTRAN, but the latter may be using more modern languages and OpenGL.
I'm also unsure about how much modeling has crept into biology. Physical chemistry might be a very different animal altogether.
If you write a terrific parallel linear algebra package in Scala or F# or Haskell that performs well, the world will beat a path to your door.
Python + Matplotlib + NumPy + α
The nuclear/particle/high energy physics community has moved heavily toward c++ (in part due to ROOT and Geant4), with some interest in Python (because it has ROOT bindings).
But you'll note that this is sub-discipline dependent..."physics" and "modeling" are big, broad topics, so there is no one answer.
Modelica is a specialized language for modeling (and simulating) physical systems. OpenModelica is an open source implementation of Modelica.
Python is very popular among science-oriented people, as is Matlab. The issue with these is that they are both VERY slow (to run). If you want to do large simulations that may take minutes/hours/days, you're going to have to pick another language.
As long as you are picking a language for speed, suck it up and use C/C++, maybe with CUDA depending on your needs.
Final thought though: if it takes you two days longer to write and debug your model in C than in python, and the resulting code takes 10 minutes to run instead of an hour, have your really saved any time?
There's also a lot of capability with MATLAB. Especially when interfacing your simulations with hardware, or if you need your results visualised.
I'll chime in with Python but you should also look to R for any statistical work you may need to do. You should really be asking more about what packages for which languages to use rather than the language itself.
Do Lex and Yacc provide optimized code or is it required that we write our own code manually for higher performance?
The code you write has a substantial effect on the speed, especially on the lexer side. Some versions of Flex come with half a dozen (or so) different word counters, most written with Flex, and a few written by hand -- the code gives a pretty good idea of how to optimize scanning speed (and a fairly reasonable comparison of what you can expect in hand-written vs. machine generated lexers).
On the parser side, you're generally a bit more constrained -- you can't make as many changes without affecting the semantics. Here, a great deal depends on the parser generator you use -- for example, some algorithms for less constrained grammars consistently produce relatively slow parsers, while other algorithms only slow down for the less constrained constructs, and run relatively fast as long as the input doesn't use the more complex constructs.
Yacc produces a table-driven parser, which can never be as fast as a well-written hand-coded one. I don't know if the same applies to lex.