Is there a free library for morphological analysis of the German language? - morphological-analysis

I'm looking for a library which can perform a morphological analysis on German words, i.e. it converts any word into its root form and providing meta information about the analysed word.
For example:
gegessen -> essen
wurde [...] gefasst -> fassen
Häuser -> Haus
Hunde -> Hund
My wishlist:
It has to work with both nouns and verbs.
I'm aware that this is a very hard task given the complexity of the German language, so I'm also looking for libaries which provide only approximations or may only be 80% accurate.
I'd prefer libraries which don't work with dictionaries, but again I'm open to compromise given the cirumstances.
I'd also prefer C/C++/Delphi Windows libraries, because that would make them easier to integrate but .NET, Java, ... will also do.
It has to be a free library. (L)GPL, MPL, ...
EDIT: I'm aware that there is no way to perform a morphological analysis without any dictionary at all, because of the irregular words.
When I say, I prefer a library without a dictionary I mean those full blown dictionaries which map each and every word:
arbeite -> arbeiten
arbeitest -> arbeiten
arbeitet -> arbeiten
arbeitete -> arbeiten
arbeitetest -> arbeiten
arbeiteten -> arbeiten
arbeitetet -> arbeiten
gearbeitet -> arbeiten
arbeite -> arbeiten
...
Those dictionaries have several drawbacks, including the huge size and the inability to process unknown words.
Of course all exceptions can only be handled with a dictionary:
esse -> essen
isst -> essen
eßt -> essen
aß -> essen
aßt -> essen
aßen -> essen
...
(My mind is spinning right now :) )

I think you are looking for a "stemming algorithm".
Martin Porter's approach is well known among linguists. The Porter stemmer is basically an affix stripping algorithm, combined with a few substitution rules for those special cases.
Most stemmers deliver stems that are linguistically "incorrect". For example: both "beautiful" and "beauty" can result in the stem "beauti", which, of course, is not a real word. This doesn't matter, though, if you're using those stems to improve search results in information retrieval systems. Lucene comes with support for the Porter stemmer, for instance.
Porter also devised a simple programming language for developing stemmers, called Snowball.
There are also stemmers for German available in Snowball. A C version, generated from the Snowball source, is also available on the website, along with a plain text explanation of the algorithm.
Here's the German stemmer in Snowball: http://snowball.tartarus.org/algorithms/german/stemmer.html
If you're looking for the corresponding stem of a word as you would find it in a dictionary, along with information on the part of speech, you should Google for "lemmatization".

(Disclaimer: I'm linking my own Open Source projects here)
This data in form of a word list is available at http://www.danielnaber.de/morphologie/. It could be combined with a word splitter library (like jwordsplitter) to cover compound nouns not in the list.
Or just use LanguageTool from Java, which has the word list embedded in form of a compact finite state machine (plus it also includes compound splitting).

You asked this a while ago, but you might still give it a try with morphisto.
Here's an example on how to do it in Ubuntu:
Install the Stuttgart finite-state transducer tools
$ sudo apt-get install sfst
Download the morphisto morphology, e.g. morphisto-02022011.a
Compact it, e.g.
$ fst-compact morphisto-02022011.a morphisto-02022011.ac
Use it! Here are some examples:
$ echo Hochzeit | fst-proc morphisto-02022011.ac
^Hochzeit/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/hohZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/HochZeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>/Hochzeit<+NN>$
$ echo gearbeitet | fst-proc morphisto-02022011.ac
^gearbeitet/arbeiten<+ADJ>/arbeiten<+ADJ>/arbeiten<+V>$

Have a look at LemmaGen (http://lemmatise.ijs.si/) which is a project that aims at providing standardized open source multilingual platform for lemmatisation. It is doing exactly what you want.

I don't think that this can be done without a dictionary.
Rules-based approaches will invariably trip over things like
gegessen -> essen
gegangen -> angen
(note to people who don't speak german: the correct solution in the second case is "gehen").

Have a look at Leo.
They offer the data which you are after, maybe it gives you some ideas.

One can use morphisto with ParZu (https://github.com/rsennrich/parzu). ParZu is a dependency parser for German.
This means that the ParZu also disambiguate the output from morphisto

There are some tools out there which you could use like the morph. component in the Matetools, Morphisto etc. But the pain is to integrate them in your tool chain. A very good wrapper around quite a lot of these linguistic tools is DKpro (https://dkpro.github.io/dkpro-core/), a framework using UIMA. It allows you to write your own preprocessing pipeline using different linguistic tools from different resources which are all downloaded automatically on your computer and speak to each other. You can use Java or Groovy or even Jython to use it. DKPro provides you easy access to two morphological analyzers, MateMorphTagger and SfstAnnotator.
You don't want to use a stemmer like Porter, it will reduce the word form in a way which does not make any sense linguistically and does not have the behaviour you describe. If you only want to find the basic form, for a verb that would be the infinitive and for a noun the nominative singular, then you should use a lemmatizer. You can find a list of German lemmatizers here. Treetagger is widely used. You can also use a more complex analysis provided by a morphological analyzer like SMORS. It will give you something like this (example from the SMORS website):
And here is the analysis of "unübersetzbarstes" showing prefixation, suffixation and >gradation:
un<PREF>übersetzen<V>bar<SUFF><+ADJ><Sup><Neut><Nom><Sg><St>

Related

How to create a refactor tool?

I want to learn how to create refactor tools.
As an example: I want to create migration scripts
for when some library removes deprecated function
and we want to transform code to use the newer adequate functionality.
My idea was to use ANTLR to parse the code into AST,
use some pattern matching on this tree to modify contents,
and output the modified contents.
However, from what I read ANTLR isn't preserving formatting in AST tree,
therefore it would be hard to get unbroken content back.
Do you have a solution that would comply with:
allows me to modify code with preserving formatting
(optionally) allows me to use AST transformations for code transformation
(optionally) can transform variety languages like ANTLR
Question is not limited to one particular language,
I'd be happy to heard solutions created for different languages.
If you want a
general purpose tool to parse source code from arbitrary languages producing ASTs
apply procedural or preferably source-to-source pattern-directed rewrite rules to manipulate the ASTs
regenerate valid source code retaining formatting and comments
I know of only two systems at present that can do this robustly.
RASCAL Metaprogramming language, a research platform
Semantic Designs' (my company) DMS Software Reengineering Toolkit (DMS)
You probably don't want to try building frameworks like this yourself; these tools both have decades of PhD level investment to make them practical.
One issue that occurs repeatedly is the mistake of thinking that having a parser (e.g., ANTLR) solves most of the problem. See my essay on Life After Parsing. A key insight is that you can't transform "just the syntax (ASTs)" without context; you have to take in account the language semantics and whatever framework you choose to use had better help you do semantic analysis to support the rewrite rules.
There are other (general purpose) program transformation systems out there. Many are research. Few have been used to do serious software reengineering in practice. RASCAL has been applied to some quite interesting tasks but not in any commercial context that I know. DMS has been used in production for over 20 years to carry out massive code base changes including refactoring, API revision, and fully automated language migrations.
ANTLR has a TokenStreamRewriterTokenStreamRewriter class that is very good at preserving your source input.
It has very robust capabilities. It allow you to delete, insert or replace text in the input stream. IT actually stores up a series of pending changes, and then applies them when you ask for the modified input stream (even allows for rolling back changes, as well as multiple sets of changes).
A couple of examples from a recent presentation I did that touched on the Rewriter:
private void plus0(RefactorUtilContext ctx, String pName) {
for (var match : plus0PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
}
for (var match : plus0PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
}
}
private void times1(RefactorUtilContext ctx, String pName) {
for (var match : times1PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
rewriter.insertBefore(pName, matchCtx.op, "/* ");
rewriter.insertAfter(pName, matchCtx.rhs.getStart(), " */");
}
for (var match : times1PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
rewriter.insertBefore(pName, matchCtx.lhs.getStart(), "/* ");
rewriter.insertAfter(pName, matchCtx.op, " */");
}
}
TokenStreamRewriter, basically, just stores a set of instruction about how to modify you input stream, so everything about you input stream that you don't modify is, ummmm, unmodified :).
You may also wish to look into the XPath capabilities that ANTLR has. These allow you to find very specific patterns in the parse tree to locate the portions you would like to refactor. As the name suggests, the syntax is very similar to XPath for XML documents, but works on the parse tree instead of an XML DOM.
Note: all of these operate on the parse tree, not an AST (which would necessarily be of your own design, so ANTLR wouldn't know of it's structure. Also, it's in the nature of ASTs to drop irrelevant information (like comments, whitespace, etc.) so they're just not a good starting point for anything where you want to preserve formatting.)
I put together a quite small project for this presentation and it's on GitHub at LittleCalc. The most relevant file is LittleCalcExecutionVisitor.
You can start a REPL and test things out by running LittleCalcRepl.java

How (if possible) to use PostgreSQL's parser (in C) independently?

I need a parser (mainly for the "select" type of queries) and avoid the hassle of doing it from scratch. Does anybody know how to use the scan.l/gram.y of pgsql for this purpose? I've looked up pgpool too, but it seems similar. Currently, it might be very helpful if someone could give instructions to compile the parser (using the makefile provided maybe) without errors so that it can be supplied (valid?) queries and outputs the parse tree (in whatever form)!
You probably cannot take any file from postgres source tarball and compile it separately. Parser use internal OOP structures (implemented in C). But there is some possibility (not simple) - ecpg preprocessor try to transform PostgreSQL gram file to secondary gram file - and you can use same mechanism. It use a small utility parse.pl (it is part of PostgreSQL source code (src/postgresql/src/interfaces/ecpg/preproc))
PostgreSQL compiles the language parser using yacc. Presumably you could take the yacc files and create a compatible parser with very little effort. Note you must have flex and yacc installed to do this.
Note this is not taking a .c file from source and transplanting it into your system. All you are getting is the parser, not the planner or anything else.
Given the level of detail in the question no more detail can be possible. Perhaps you could start there and post another question when you get stuck.

Autodocumentation type functionality for Fortran?

In the past I've used Doxygen for C and C++, but now I've been thrown on Fortran project and I would like to get a quick all encompassing look at the architecture.
In the past I've found reverse engineering tools to be useful where no documentation of the architecture exists.
So, is there a tool out there that will reverse engineer Fortran code?
I tried to use Doxygen, but didn't have any luck. I will be working with two different projects - one Fortran 90 and I think is in Fortran 77.
Thanks for any insights and feedback.
Tools which may help with reverse engineering:
SciTools Understand
Link with some more tools (search "fortran")
Also, maybe some of these unit testing frameworks will be helpful (I haven't used them, so I cannot comment on the pros and cons of any of them):
FUnit
FRUIT
Ftnunit
(these links link to fortranwiki, where you can find a tidbit on every one of them, and from there there are links to their home sites).
Doxygen 1.6.1 will generate documentation, call graphs, etc. for Fortran source code in free-format (F90) format. You are out of luck for auto-documenting fixed-format (F77) code with doxygen.
All is not lost, however. The conversion from fixed to free format is straightforward and can be automated to a great degree - change comment characters to '!', change continuation characters to '&', and append '&' to lines to be continued. In fact, if the appended continuation character is placed in column 73, it should be ignored by standard F77 compilers (which still only recognize code in columns 1 through 72) but will be recognized by F9x/F2003/F2008 compilers. This allows the same code to be recognized as both in fixed and free format, which lets you gracefully migrate from one format to the other.
Conveniently, there are about a thousand small programs that will do this format adjustment to some degree or another. Realistically, if you're going to be maintaining the code, you might as well move it away from the 1928 spec for Hollerith (IBM) punched cards. :)

What language is to binary, as Perl is to text?

I am looking for a scripting (or higher level programming) language (or e.g. modules for Python or similar languages) for effortlessly analyzing and manipulating binary data in files (e.g. core dumps), much like Perl allows manipulating text files very smoothly.
Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Any suggestions?
Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Well, while it may seem counter-intuitive, I found erlang extremely well-suited for this, namely due to its powerful support for pattern matching, even for bytes and bits (called "Erlang Bit Syntax"). Which makes it very easy to create even very advanced programs that deal with inspecting and manipulating data on a byte- and even on a bit-level:
Since 2001, the functional language Erlang comes with a byte-oriented datatype (called binary) and with constructs to do pattern matching on a binary.
And to quote informIT.com:
(Erlang) Pattern matching really starts to get
fun when combined with the binary
type. Consider an application that
receives packets from a network and
then processes them. The four bytes in
a packet might be a network byte-order
packet type identifier. In Erlang, you
would just need a single processPacket
function that could convert this into
a data structure for internal
processing. It would look something
like this:
processPacket(<<1:32/big,RestOfPacket>>) ->
% Process type one packets
...
;
processPacket(<<2:32/big,RestOfPacket>>) ->
% Process type two packets
...
So, erlang with its built-in support for pattern matching and it being a functional language is pretty expressive, see for example the implementation of ueencode in erlang:
uuencode(BitStr) ->
<< (X+32):8 || <<X:6>> <= BitStr >>.
uudecode(Text) ->
<< (X-32):6 || <<X:8>> <= Text >>.
For an introduction, see Bitlevel Binaries and Generalized Comprehensions in Erlang.You may also want to check out some of the following pointers:
Parsing Binaries with erlang, lamers inside
More File Processing with Erlang
Learning Erlang and Adobe Flash format same time
Large Binary Data is (not) a Weakness of Erlang
Programming Efficiently with Binaries and Bit Strings
Erlang bit syntax and network programming
erlang, the language for network programming (1)
Erlang, the language for network programming Issue 2: binary pattern matching
An Erlang MIDI File Reader/Writer
Erlang Bit Syntax
Comprehending endianness
Playing with Erlang
Erlang: Pattern Matching Declarations vs Case Statements/Other
A Stream Library using Erlang Binaries
Bit-level Binaries and Generalized Comprehensions in Erlang
Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang
perl's pack and unpack ?
Take a look at python bitstring, it looks like exactly what you want :)
The Python bitstring module was written for this purpose. It lets you take arbitary slices of binary data and offers a number of different interpretations through Python properties. It also gives plenty of tools for constructing and modifying binary data.
For example:
>>> from bitstring import BitArray, ConstBitStream
>>> s = BitArray('0x00cf') # 16 bits long
>>> print(s.hex, s.bin, s.int) # Some different views
00cf 0000000011001111 207
>>> s[2:5] = '0b001100001' # slice assignment
>>> s.replace('0b110', '0x345') # find and replace
2 # 2 replacements made
>>> s.prepend([1]) # Add 1 bit to the start
>>> s.byteswap() # Byte reversal
>>> ordinary_string = s.bytes # Back to Python string
There are also functions for bit-wise reading and navigation in the bitstring, much like in files; in fact this can be done straight from a file without reading it into memory:
>>> s = ConstBitStream(filename='somefile.ext')
>>> hex_code, a, b = s.readlist('hex:32, uint:7, uint:13')
>>> s.find('0x0001') # Seek to next occurence, if found
True
There are also views with different endiannesses as well as the ability to swap endianness and much more - take a look at the manual.
I'm using 010 Editor to view binary files all the time to view binary files.
It's especially geared to work with binary files.
It has an easy to use c-like scripting language to parse binary files and present them in a very readable way (as a tree, fields coded by color, stuff like that)..
There are some example scripts to parse zipfiles and bmpfiles.
Whenever I create a binary file format, I always make a little script for 010 editor to view the files. If you've got some header files with some structs, making a reader for binary files is a matter of minutes.
Any high-level programming language with pack/unpack functions will do. All 3 Perl, Python and Ruby can do it. It's matter of personal preference. I wrote a bit of binary parsing in each of these and felt that Ruby was easiest/most elegant for this task.
Why not use a C interpreter? I always used them to experiment with snippets, but you could use one to script something like you describe without too much trouble.
I have always liked EiC. It was dead, but the project has been resurrected lately. EiC is surprisingly capable and reasonably quick. There is also CINT. Both can be compiled for different platforms, though I think CINT needs Cygwin on windows.
Python's standard library has some of what you require -- the array module in particular lets you easily read parts of binary files, swap endianness, etc; the struct module allows for finer-grained interpretation of binary strings. However, neither is quite as rich as you require: for example, to present the same data as bytes or halfwords, you need to copy it between two arrays (the numpy third-party add-on is much more powerful for interpreting the same area of memory in several different ways), and, for example, to display some bytes in hex there's nothing much "bundled" beyond a simple loop or list comprehension such as [hex(b) for b in thebytes[start:stop]]. I suspect there are reusable third-party modules to facilitate such tasks yet further, but I can't point you to one...
Forth can also be pretty good at this, but it's a bit arcane.
Well, if speed is not a consideration, and you want perl, then translate each line of binary into a line of chars - 0's and 1's. Yes, I know there are no linefeeds in binary :) but presumably you have some fixed size -- e.g. by byte or some other unit, with which you can break up the binary blob.
Then just use the perl string processing on that data :)
If you're doing binary level processing, it is very low level and likely needs to be very efficient and have minimal dependencies/install requirements.
So I would go with C - handles bytes well - and you can probably google for some library packages that handle bytes.
Going with something like Erlang introduces inefficiencies, dependencies, and other baggage you probably don't want with a low-level library.

Batch source-code aware spell check

What is a tool or technique that can be used to perform spell checks upon a whole source code base and its associated resource files?
The spell check should be source code aware meaning that it would stick to checking string literals in the code and not the code itself. Bonus points if the spell checker understands common resource file formats, for example text files containing name-value pairs (only check the values). Super-bonus points if you can tell it which parts of an XML DTD or Schema should be checked and which should be ignored.
Many IDEs can do this for the file you are currently working with. The difference in what I am looking for is something that can operate upon a whole source code base at once.
Something like a Findbugs or PMD type tool for mis-spellings would be ideal.
As you mentioned, many IDEs have this functionality already, and one such IDE is Eclipse. However, unlike many other IDEs Eclipse is:
A) open source
B) designed to be programmable
For instance, here's an article on using Eclipse's code formatting functionality from the command line:
http://www.peterfriese.de/formatting-your-code-using-the-eclipse-code-formatter/
In theory, you should be able to do something similar with it's spell-checking mechanism. I know this isn't exactly what you're looking for, and if there is a program for doing spell-checking in code then obviously that'd be better, but if not then Eclipse may be the next best thing.
This seems little old but seems to do a good job
Source Code Spell Checker