Tools to generating a grammar using examples? - grammar

This answer shows a pretty example of using a parser generator to look through text for some patterns of interest. In that example, it's product prices.
Does anyone know of tools to generate the grammars given training examples (document + info I want from it)? I found a couple papers, but no tools. I looked through ANTLR docs a bit, but it deals with grammars; a "recognizer" takes as input a grammar, not training examples.

This is a machine learning problem. You can at best get an approximation. But I don't think anybody has done this well, let alone released a tool. (I actively track what people do to build grammars for computer languages, and this idea has been proposed many times, but I have yet to see a useful implementation).
The problem is that for any fixed set of examples, there's a huge number of possible grammars. It is easy to construct a naive one: for the fixed set of examples, simply propose a grammar that has one rule to recognize each example. That works, but is hardly helpful. Now the question is, how many ways can you generalize this, and which one is the best? In fact you can't know, because your next new example may be a total surprise in terms of structure. (Theory definition: A language is the set of sentences that comprise it).
We haven't even talked about the simpler problem of learning the lexemes of the language. How would you propose to learn what legal strings for floating point numbers are?

One tool that does this is NLTK. I Highly recommend it, and the O'Reilly book that covers it is available free online. There are tools for parsing, learning grammars, etc... The only downside is that it is mainly a research rather than production tool, so the emphasis isn't on performance.
NLTK is able to construct grammar from labeled training samples, which is exactly what you are asking. Have a look at the great docs and the book. (My last experience with it also had it working on the JVM through Jython without any issues.)

Related

Information about Evolutionary Art

I have begun to do some research about Evolutionary Art algorithm. I read a lot of documents about it. But it seems not easy to understand.
The website http://picbreeder.com is a great example for this. But I don't need this in the beginning because it is too complex.
Where can I find some simple code about this in Java? I think read code could help me much.
Thanks!!!
"Evolutionary Design by Computers" By David Bentley (amazon) has a couple of chapters on evolutionary art. However i dont think it includes any code / pseudocode. The canonical GA should do all you need, however the termination condition could be tricky as art is a subjective subject. (Not objective and therefore enumerable)
Hope this helps...
It looks like the EJC library might help you out, and it looks like a number of open source projects/tools come up if you Google for "java evolutionary computing."
I don't know how simple it is, and believe me it needs to be cleaned up a bit, but I have something that might get you started at https://github.com/murmux/Evo/tree/master/assignment2c It doesn't deal with art, rather game theory, but you can use it under the terms of the GPLv3 if you'd like. That uses Genetic Programming... I have another example using a more vanilla EA I might put up later.
Instead of evolving programs to play "Iterated Prisoner's Dilemma," you would evolve programs to generate an art work. The fun part is coming up with a way to "score" an image for its fitness. (Although Picbreeder seems to skip scoring by having you pick the mating pool directly...)
Check this app: EvoPic, its An evolutionary picture creator that uses Steady-State Genetic Algorithm, to produce evolutionary pictures by drawing uni-code characters in a picture box.
Example:

Part-Of-Speech tagging and Named Entity Recognition for C/C++/Obj-C

need some help!
I'm trying to write some code in objective-c that requires part-of-speech tagging, and ideally also named entity recognition. I don't have much interest in "rolling my own", so I'm looking for a decent library to use for this purpose. Obviously the more accurate the better, but we're not talking anything critical here -- so as long as it's generally pretty accurate that's good enough.
It's going to be English-only, at least for the time being, but I don't want to have to do any training of models myself. So whatever the solution, it has to have an English language model already built.
And finally, it has to be available via a commercial-friendly license (e.g. BSD/Berkeley, LGPL). Can't do GPL or anything restrictive like that, though I'm open to paying a small amount for a commercial license if that's the only option.
C, C++ or Obj-C code is all fine.
So: Anyone familiar with something that'd do the trick here? Thanks!!
I suggest you check out the iOS 5 beta release notes.
As you've probably figured out most of the NLP code that's freely available is in python, perl or java. However, a quick look at Stanford's NLP tools page shows a few things in C/C++ that are available. Another list of tools can be found at a blog post.
Of the POS taggers, YamCha is well-known, though I have not used it myself (being a java/python/perl guy).
Unfortunately, I cannot suggest any NER nlp tools. However, I bet there's a maxent or svm implentation in C/C++ that you can work with:
1) create your training data and annotate it
2) define your features
3) use the ml library
Sorry I can't be of more help, but if anything else comes to mind I'll add it.
Maybe once I figure out objective-c to a respectable degree I'll write an NLP library for it!

Applying Darwinian evolution to programming

A while back I recall reading a magazine article (in Wired I believe) about applying Darwinian evolution to programs to create better programs. Essentially multiple mutations of a program would be spawned, and the one that performed the best would be selected for the next round of mutations.
Unforunately I can't make the subject sound nearly as interesting as is sounded in the article, but I can't find the article.
Since this sounds like just the coolest thing ever to me, I was wondering what mutations one could have inside of a program
Yes. It is called Genetic Programming, where a master program that writes programs itself. And the programs it writes can evolve to a certain criterion.
E.g. 8 queen could be solved by GP.
I think you're referring to Genetic Algorithms. I want to work on this topic for my dissertation. I can't stop reading about it :-)
Found this article/paper - is this what you're referring to?. Also found this PDF. Quite an interesting topic
What it sounds like is that you could use self-modifying code that reproduces the program itself based on self-monitoring optimizations. This would currently point at interpreted-language programs.
I read an article on Coding Horror about something like that the other day: Go That Way, Really Fast. Basically, the idea I got from it was that software should constantly be improved which means constantly pushing out new versions/releases. This seems to match the idea of evolution in that your software is always improving into something better.
As said before it's called Genetic Programming (GP).
The interesting thing is that GP is a systematic, domain-independent method for getting computers to solve problems automatically starting from a high-level statement of what needs to be done.
Using ideas from natural evolution, GP starts from a population of random computer programs and progressively refines them through processes of mutation and crossover (recombination), until solutions emerge.
All this without the user having to know or specify the form or structure of solutions in advance.
GP has generated a plethora of human-competitive results and applications, including novel scientific discoveries and patentable inventions (see also What are good examples of genetic algorithms/genetic programming solutions?).
I was wondering what mutations one could have inside of a program
There are many genetic operators (not only mutation) and many implementations. The fundamental property they are required to have is closure (they must mantain the structural integrity of the genetic program).
In general mutation replaces a symbol of the program with a compatible terminal / function choosen from a group of available symbols. Crossover operator mixes the information of two or more programs.
Probably the best free introduction to the subject is A Field Guide to Genetic Programming
Some nice links are:
Genetic Programming: Evolution of Mona Lisa
Genetic Cars
Smart rockets

Correct Approach for mastering SAP R3 and ABAP

I have been working on SAP technology for the last 2.5 years.
As there were so many technical concepts, I couldn't get a single source where I can learn about everything related to it. I didn't get the confidence of mastering all the technical concepts.
Please help me out if you have faced such an experience and how you overcame it.
Suggest some books or a methodology you followed which may be helpful.
Note: I have already worked in Java/J2EE. I am confident enough in mastering the concepts.
Obviously the (or rather a) correct approach would be one that works for you, the conclusion being that you don't know whether the approach was correct until you've tried it. :-)
Enough philosophy - my suggestions would be:
Learn how to read the SAP Online documentation. That's a bit different from reading other documentation - the SAP docs are littered with information about legacy techniques that you don't really need. Learn to identify and skip these parts - you can always come back later.
Develop a knowledge of the data dictionary. It's really at the heart of all things, and if you can't navigate there and read the structures, you're lost. Start reading the chapter in the online docs at http://help.sap.com/saphelp_webas620/helpdata/en/cf/21ea0b446011d189700000e8322d00/frameset.htm.
Read http://help.sap.com/saphelp_webas620/helpdata/EN/fc/eb3138358411d1829f0000e829fbfe/frameset.htm. I know that's a lot of stuff, but it's available for free, and almost everything is in there. Again, do a "fast indexing run" first to get a feeling of what's inside there, then dig into the basic concepts. ABAP at its lowest level isn't fundamentally different from other imperative / procedural languages.
Follow the example programs (transaction code ABAPDOCU). Learn how to use the debugger (vital!) and understand what's going on in the demo programs.
Once you've got a mental model of the basic language, take a look at ABAP Objects. If you already know Java, there should be no problem with the basic concepts, but there are a few specialties.
Feel free to ask if you run into something you don't understand.
There is no single source of information that will provide you with everything you have to know, especially since some of the knowledge is very specific to the context (FI, MM, IS-H, ...).

Getting started with ANTLR and avoiding common mistakes

I have started to learn ANTLR and have both the 2007 book "The Definitive ANTLR Reference" and ANTLRWorks (an interactive tool for creating grammars). And, being that sort of person, I started at Chapter 3. ("A quick tour for the impatient").
It's a fairly painful process especially as some errors are rather impenetrable (e.g. ANTLR: "missing attribute access on rule scope" problem which just means to me "you got something wrong"). Also I have some very simple grammars (3-4 productions only) and simple input (2 lines) which when run give "OutOfMemory" error.
The ANTLR site is useful but somewhat fragmented and some SO users have commented (https://stackoverflow.com/questions/278480/good-tutorial-for-antlr) that the book and the tutorials expect a high entry level. I've been reluctant to approach the ANTLR discussion list because of this.
LATER We are beginning to get to grips with it. It would be useful to have simple reliable examples that could be gently expanded. It's certainly worth mastering as we have remodelled quite a lot of our thinking based on ANTLR.
One problem is that ANTLR V3 has signifcant changes from V2. One answer on SO (and on the ANTLR pages) refered to a V2 syntax that is no longer available.
Some of the ANTLR questions on SO have helped me a lot, but finding them is a bit ad hoc. So I'd like to know how SO users can help to make the learning process less painful. (If you refer to the reference book it would be useful to point to particular pages).
EDIT. #duffymo and #JamesAnderson have confirmed that ANTLR is hard work - largely because parsers are difficult. (FWIW I have been through LEX/YACC, etc. and there's no doubt that ANTLR is more powerful and easier to work with.) I think it would still be useful to have areas where it's possible to avoid fouling up such as:
ensure correct capitalisation of variable names
add package name to lexer as well as parser
take care over order of rules as it affects precedence
and more of these sort would be useful.
I agree - ANTLR is not for the faint of heart. It does expect a high entry level, because grammars and parsers are not trivial.
With that said, here are a few suggestions:
Forget about v2. Version 3 is the standard; don't even waste time considering the earlier version or its documentation.
OutOfMemoryError is telling you that there's something circular in the grammar you've defined.
IntelliJ has a wonderful IDE for working with ANTLR v3. It'll give you a graphical representation of your grammar, step-through debugging, etc. If you're going to be doing a lot of work with ANTLR it'd be worth a few dollars to buy a license.
ANTLR won't be easy to master. The book is good, but dense. The error messages are cryptic, as you've noted. I'd be surprised if anyone here could make it easy.
Sorry but my experience of ANTLR (indeed javacc, bison or any full function parser) is that most of your learning will be by fixing your own mistakes!
Getting good examples of other peoples code will cut this down somewhat, the best examples look really simple -- but you are missing all the sweat and hair pulling it took to get them looking that easy.
Even if you prefer command line, it is worth using AntlrWorks when you have problems. The diagramatic representation can make it easier to see what i sgoing wrong.
A picture is worth a thousand error messages.