Can PMD be customized to fully support a new language? - code-analysis

Can PMD be customized to fully support a new language, in a reasonable amount of time. I mean I know that technically almost anything can be done, but im wondering if this can be done in a reasonable amount of time? E.g. < 2 weeks
This page mentions how to write a CPD parser http://pmd.sourceforge.net/cpd-parser-howto.html
But is this just for copy / paste detection? Does writing a CPD parser give me full support of PMD in terms of rile sets?

I would guess not, but I'm not a PMD expert (and I have my own bias, check my bio).
The issues are:
Can you define a syntax for my langauge quickly (maybe, depending on how good you are, how messy the language is, and the strength of the parsing machinery offered by PMD)
Can you define the semantics of my language so that "semantic checks" provided by PMD work. You have to do this, because syntax tells you (and a tool) literally nothing about semantic of the syntax. I would guess that the PMD tool 'semantic checks' are pretty wired into the precise details of Java; if you language matched java perfectly, this would be zero work. But it doesn't, or you wouldn't be asking the question. And langauge semantics differences, even minor ones, cause discontinuous changes to the interpreation of the code. Before you get to doing even "serious" semantics, you're likely to have to build a symbol table mapping identifiers in the code to declarations (and the "semantic" type) for those symbols. Based on tool infrastructure I work with, this step alone takes 1-2 months for a real language.
Lastly, you are likely to have to code special PMD checks that are specific to your langauge. That takes time and energy, too.
I build generic compiler-type machinery (parsers, flow analyzers, style/error checkers) and get asked the equivalent of this question all the time WRT to our machinery. We try to have a lot of machinery available, try to make it easy to integrate new langauges, and we've been working on trying to make this "convenient and fast" for 15+ years. Its still not convenient, and there's no way to do this with our tools in a few weeks. I doubt PMD is better.

Related

Look for a VBA/VB parser/compiler written in OCaml

I am planning to write a compiler (including parser) in OCaml to parse and run VBA or/and VB programs. I have done this for simple imperative languages, but I am not sure how to handle the "object" features of VBA or/and VB...
Does anyone know if there is any existing work that I can inspire?
Not an OCaml solution (but OP asked):
Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery. It is intended to be a convenient foundation for custom software engineering tools for computer languages, with the goal being to help the tool engineer get his job done, rather than spending his time reinventing the wheel. In particular, many people think that getting a parser is the big part of the job. This is simply false. See Life After Parsing.
DMS has production front ends for many languages, both modern and legacy, including Visual Basic in its variety of dialects (VB6, VBA [essentially the same as VB6]) and VB.net.
By production I mean have been applied to real code systems of significant size and handle all the corresponding parsing issues. This is pretty hard for legacy languages, e.g., VB, especially the older dialects because such languages are generally poorly documented (VB6 and VBA especially so). The only way to get this right is to build a draft parser, run it against reality, and revise until lots of code goes through sensibly. This often takes longer than doing the draft parser because it isn't easy to understand the errors (they're undocumented!), you have to decide if they are real or the code base just has junk (more often than you'd think), guess what it means for the grammar and try it all again.
These front ends as a minimum parse source code and build ASTs; they can also invert this process to regenerate legal compilable code with the comments back as source text files. The VisualBasic front ends do this. Some of our other front ends (C, C++, Java, COBOL) go further: name/type resolution, flow analysis, etc.; they do that by collecting key program facts from the language-specific AST and then apply DMS-supplied machinery to compute the results. This would be possible for VisualBasic, too, if such facts were useful.
For an example of a tiny OO language written in OCaml check out the source code for boa at: http://andrej.com/plzoo/.
The OO flavour is not class based though so I'm not sure how useful it will be.

What tool to use for finding duplicated Ada code due to copy&paste

I'm looking for a tool for finding duplicated code due to copy&paste programming to be run over a large Ada codebase. I suppose that Ada support in the tool is important for detecting more than the trivial text similarities, that is, ignore layout or identifier difference, etc.
The tools that I have found with Ada support are the following:
Clone Doctor, commercial product with support for several languages, including Ada. http://www.semdesigns.com/Products/Clone/index.html
ConQAT: commercially supported open source product that includes a CloneDetection tool with Ada support since September 2011 http://conqat.cs.tum.edu/index.php/CloneDetectionTutorial
Have you tried these tools? Am I missing any other one of interest? Is the language support really significant or a general text tool would be enough? What is your experience with code duplication detection?
Thanks in advance.
I'm the author of CloneDR. Read the following understanding my bias.
It is important to understand the differences in the detection methods of clone detection tools, and the quality of the results as a consequence.
ConQAT is a representative of what are called "token based" detectors. They match sequences of language tokens (operators, identifiers, brackets, keywords etc.) The good news is they are pretty fast (that isn't a big issue; you don't run clone detection every 30 seconds, once a week is enough). They will find some clones that are near-misses, in the sense that another identifier or constant is substituted for an identifier in a clone. The bad news is that they don't understand the structure of your code and consequently want to report things like
} void ID ( ID
as clones. This is defeated by making the detectors only hunt for very long sequences of tokens (typically 30 or more), which means token-based detectors cannot find small but interesting clones without also drowning you in false positives like the above.
CloneDR operates by parsing the code (even for Ada) just like a compiler, building abstract syntax trees, and matching the trees up to a point of difference. It cannot propose a clone that crosses structure boundaries in silly ways. It will find near misses of the same kind as the token based detectors, but it goes beyond this. CloneDR will find consistent substitutions ("anti unifiers") which means clones can be explained by a small number of parameters that have been used in many places in the clone, and it will find variations in the code in which the mismatches are larger than a single token, e.g., expressions, statements, declarations, even blocks. So it produces fewer false positives and better answers. Independent research reports that compare types of clone detectors, specifically including CloneDR, agree with this analysis.
There is more detailed discussion at the Clone Doctor link you listed above. You can see examples of detected clones for many languages (but we don't have an Ada report on the web site).
EDIT March 19, 2012:
Now you can download an eval copy of an Ada95 CloneDR.
Ira Baxter has a good description.
Token-based clone detection tools tend to be good enough for our purpose, which is usually to get a quick overview of how bad code duplication is in a body of source code we haven't seen before, and how duplication is distributed across that code.
In particular, we are happy with CCFinderX, because it has a nice visualization frontend.
However, it's buggy, unmaintained, and the code has been released but without any license statement.
It has language specific preprocessors for some languages, but we often just disable them (they are buggy as well).
If you need better accuracy, you know exactly the language you need to parse (e.g. with C or C++, this is not always the case), and you can find a tool that parses exactly that language (which is also an issue with C and C++), a parsing-based approach may be better, as Ira writes.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

decompilation resources and theory

There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.
I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.
I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.
Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.

Moving from static language to dynamic

There are a lot of discussions all over the internet and on SO, i.e. here and here, about static vs dynamic languages.
I'm not going to ask again about one vs another. Instead, my question is for those who moved (or at least tried to move) from static typed language to dynamic.
I'm not talking about moderate usage of JS on your web page or other scripting language embedded into statically typed software or small personal scripts. I mean moving to dynamic language as your primary general purpose language for developing production quality software in team.
Was that easy? What was the biggest advantage and the biggest challenge? Was it fun? :)
UPD: Did you find IDE support good enough? Did you find that you need less IDE support?
Was that easy?
Moderately. Some Java-isms are hard habits to break. My first six months, I wrote Python with ;'s. Icky. Once I was over it, though, I haven't looked back.
What was the biggest advantage?
Moving from the "write -> compile -> build -> run -> break -> debug -> write" cycle to a "write -> run -> break -> write" cycle. It takes time to get used to immediate gratification from the Python command-line interpreter. I was soooo used to endless design and planning before attempting to write (much less compile) any code.
At first I considered the python command line to be a kind of "education-only" interface. Then reading docstrings, doctests, and user guides where the application is being typed at the >>> prompt, I started to realize that the truly great Python software boils complexity and nuance down to stuff you can type interactively.
[I wish I could design stuff that worked that cleanly.]
What was the biggest challenge?
Multiple inheritance. I use it very rarely.
Was it fun?
So far.
It's also amazingly productive. More time with user requirements and real data. Less time planning an inheritance hierarchy with proper interfaces to capture meaning and compile correctly and be extensible enough to last at least to the next revision.
If I were you, I would try Scala!!!.
Scala has some aspects really interesting that lets you feel like doing dynamic, while doing static.
Scala is a statically typed language
with dynamic typed smell, because the
compiler makes you less repetitive
inferring your assignments.
A compiled language with a warm and
wonderful script flavor.Cause you can use the scala console, or even write scripts just like ruby or python. So you can choose between "write -> compile -> build -> run -> break -> debug -> write" or "write -> run -> break -> write" as S.Lott said.
Scala is a complete Functional
language with full support for OO. So you don't lose many important OO aspects like inheritance, encapsulation, polymorphism, etc.
Why answering you questions suggesting Scala? Because I tryed script languages before, and the main was Ruby. And it was just like S.Lott said. But not so easy for me and my team. Most of time static is safe, less error prone, and even faster if you have the right language.
Answering you three questions putting Scala inside we have:
Was that easy?
Yes. Sometimes you need to concentrate to leave you old concepts aside and go deep.
What was the biggest advantage?
You feel in home cause you don't need to change you environment or rewrite existing applications to migrate to Scala (talking about Java). If you come from Java, you can start playing with Scala after reading some articles. Not too much effort. Another important advantage is the use of a functional language en its embedded power.
Was it fun?
Sure! Changing your mind, changing your way to solve problems to the best is for sure funny.
This is my vision. You don't exactly need to leave off static to grab the advantage of dynamic.
Nice question.
I am now working in Ruby, PHP and ActionScript (the least dynamic of the three) instead of languages that I would prefer, like Java and C#. But beggars, I mean, workers in this economy, can't be choosers. Or rather, you have to choose your battles and your master.
It's hard to compare Ruby and Java because they've got more than one difference, and you only asked about the dynamic/static thing (and not even about the strongly vs. weakly-typed thing!). But on that front, what affects me most is always the IDE. I was always horrified when other Java programmers used Notepad or Textpad to write code, and nowadays there are just too many advantages of a good IDE for that madness. Not true with Ruby! I use Netbeans and it does really well, but one of the main differences is that I have to actually type code. Autocomplete, for me, was/is a way of life (I write SMS messages in full English/Spanish with the predictive dictionary, for instance, and never use abbreviations) and writing Ruby code does require more work.
So at first it was painful and I was constantly looking at, for instance, function names of classes that I had written (or that are part of Ruby) just to get the spelling right! So that sucked, I thought, and I continued to think that until...
I moved back to ActionScript the other day, and to get my IDE autocompleting (FlashDevelop or FlexBuilder) I declare all variables with types (strongly-typed by choice, if you will)... and suddenly I thought what a friggin' hassle!
And then today I had to do some feature additions on a Ruby project and it felt free and cool. The code is clean, and why would I be informing the IDE of what I'm trying to write anyway?
So I would say that 1) the biggest challenges are learning the language and the framework you're working in, like always 2) it's been amazingly fun and deeply eye-opening. New languages always carry new things with them, but dynamic languages just feel different. And that's just the kind of thing that gets you to wake up at 7am and do some coding on a Sunday morning before falling asleep again.
I like programming and like most of you, I've spent some time with stored procedures, XSL, static, dynamic, whatever... it's all fun, and they all feel totally different. In the end, the framework you are working in will be the thing that will convince you too stay or not (if you have a choice), I think, but languages are to learned, studied and experienced, not compared.
I can't qualify myself fully under that handle but I did spend a while writing some an interesting Python mini-game after having spent many years writing Java. So, I might be mixing a little bit of moving from compiled to interpreted along with it.
I found myself using notation to mimic static typing. :)
However, I did find myself cranking code out at a slightly better clip. Having an interpreter is a godsend as far as learning new language/writing new code. The shorter the time between finishing a line of code and seeing it work, the faster you can write, and I think that is probably the best thing most dynamic and interpreted languages.
My code didn't look too different, all things considered. Though, Python has a lot of fun data structures. :)
I'm also interested in this topic.
Tried do dive into Ruby and Rails a while ago, and it really helped me to grasp the ASP.Net MVC stuff, which i think is a bit too chalenging at first for average .net developer.
If you're interested more on moving in this direction, or curious about how some developers moved from static to dynamic languages as their full time jobs, i highly recommend this Alt.Net podcast.