What are some common data augmentation techniques for code? - data-augmentation

I understand that there are data augmentation techniques for natural language such as word/sentence shuffling, word replacement with synonyms and syntax-tree manipulation.
However, I have hard time finding data augmentation techniques for code snippets(e.g. java, c++, etc).
For example, natural language (e.g. I am a human) and code snippet (e.g. def function(foo): print(foo)) have quite different syntactic and semantic characteristics. Thus, I don't think the data augmentation techniques for natural language (e.g. word replacement with synonyms) can be applied to code snippets.
Could someone tell me what are some common data augmentation used for code? Thank you.

Related

How to build short sentences with a small letter set restriction?

I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":
"Antonia ist mit Tina im Tal."
"Tamina malt mit lila Tinte Enten."
"Tina nimmt alle Tomaten mit."
For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:
SimpleNLG - a Java API, which has also an adaption for the german language
KOMET - multilingual generation, from University Bremen
Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction?
Can you recommend a paper to this topic?
Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.
Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.
Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:
Pick first word at random, checking that it is constructed solely from your desired set of characters
Use the Markov Chain to predict the next word
Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
If not, predict another word until it is the case.
If so, then repeat starting at 2 to completion.
Hope that's what you're looking for. Let me know if you have any more questions.
As Slater Tyranus already said, Markov chains certainly form the basis of this task. I am going to suggest a more heavy-duty approach. It is considerably more work, but is likely to give much better results in terms of grammatical correctness.
Language Model based on PCFG parse trees: A language model works by assigning a probability to a sequence of words. It requires training data, however, in order to be built first. In your case, the training process should disregard words containing letters outside the limited set.
While theoretically a language model based on parse trees is much more likely to serve your purpose, there is one caveat: due to the kind of letter-based restriction you have, data sparsity will certainly raise its ugly head. Backoff techniques (e.g. Katz's backoff model) can help a bit, but it will essentially depend on whether or not you can train on enough enough data.
As far as readily available parsers are concerned, the Stanford NLP group provides a German parser based on the Negra corpus, as mentioned in their home page.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

How to extract semantic relatedness from a text corpus

The goal is to assess semantic relatedness between terms in a large text corpus, e.g. 'police' and 'crime' should have a stronger semantic relatedness than 'police' and 'mountain' as they tend to co-occur in the same context.
The simplest approach I've read about consists of extracting IF-IDF information from the corpus.
A lot of people use Latent Semantic Analysis to find semantic correlations.
I've come across the Lucene search engine: http://lucene.apache.org/
Do you think it is suitable to extract IF-IDF?
What would you recommend to do what I'm trying to do, both in terms of technique and software tools (with a preference for Java)?
Thanks in advance!
Mulone
Yes, Lucene gets TF-IDF data. The Carrot^2 algorithm is an example of a semantic extraction program built on Lucene. I mention it since, as a first step, they create a correlation matrix. Of course, you probably can build this matrix yourself easily.
If you deal with a ton of data, you may want to use Mahout for the harder linear algebra parts.
It is very easy if you have lucene index. For example to get correllation you can use simple formula count(term1 and term2)/ count(term1)* count(term2). Where count is hits from you search results. Moreover you can easility calculate other semntica metrics such as chi^2, info gain. All you need is to get formula and convert it to terms of count from Query

Is functional programming considered more "mathematical"? If so, why?

Every now and then, I hear someone saying things like "functional programming languages are more mathematical". Is it so? If so, why and how? Is, for instance, Scheme more mathematical than Java or C? Or Haskell?
I cannot define precisely what is "mathematical", but I believe you can get the feeling.
Thanks!
There are two common(*) models of computation: the Lambda Calculus (LC) model and the Turing Machine (TM) model.
Lambda Calculus approaches computation by representing it using a mathematical formalism in which results are produced through the composition of functions over a domain of types. LC is also related to Combinatory Logic, which is considered a more generalized approach to the same topic.
The Turing Machine model approaches computation by representing it as the manipulation of symbols stored on idealized storage using a body of basic operations (like addition, mutation, etc).
These different models of computation are the basis for different families of programming languages. Lambda Calculus has given rise to languages like ML, Scheme, and Haskell. The Turing Model has given rise to C, C++, Pascal, and others. As a generalization, most functional programming languages have a theoretical basis in lambda calculus.
Due to the nature of Lambda Calculus, certain proofs are possible about the behavior of systems built on its principles. In fact, provability (ie correctness) is an important concept in LC, and makes possible certain kinds of reasoning and conclusions about LC systems. LC is also related to (and relies on) type theory and category theory.
By contrast, Turing models rely less on type theory and more on structuring computation as a series of state transitions in the underlying model. Turing Machine models of computation are more difficult to make assertions about and do not lend themselves to the same kinds of mathematical proofs and manipulation that LC-based programs do. However, this does not mean that no such analysis is possible - some important aspects of TM models is used when studying virtualization and static analysis of programs.
Because functional programming relies on careful selection of types and transformation between types, FP can be perceived as more "mathematical".
(*) Other models of computation exist as well, but they are less relevant to this discussion.
Pure functional programming languages are examples of a functional calculus and so in theory programs written in a functional language can be reasoned about in a mathematical sense. Ideally you'd like to be able to 'prove' the program is correct.
In practice such reasoning is very hard except in trivial cases, but it's still possible to some degree. You might be able to prove certain properties of the program, for example you might be able to prove that given all numeric inputs to the program, the output is always constrained within a certain range.
In non-functional languages with mutable state and side effects attempts to reason about a program and 'prove' correctness are all but impossible, at the moment at least. With non-functional programs you can think through the program and convince yourself parts of it are correct, and you can run unit tests that test certain inputs, but it's usually not possible to construct rigorous mathematical proofs about the behaviour of the program.
I think one major reason is that pure functional languages have no side effects, i.e. no mutable state, they only map input parameters to result values, which is just what a mathematical function does.
The logic structures of functional programming is heavily based on lambda calculus. While it may not appear to be mathematical based solely on algebraic forms of math, it is written very easily from discrete mathematics.
In comparison to imperative programming, it doesn't prescribe exactly how to do something, but what must be done. This reflects topology.
The mathematical feel of functional programming languages comes from a few different features. The most obvious is the name; "functional", i.e. using functions, which are fundamental to math. The other significant reason is that functional programming involves defining a collection of things that will always be true, which by their interactions achieve the desired computation -- this is similar to how mathematical proofs are done.

Non-Speech Noise or Sound Recognition Software?

I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.
I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.
In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:
Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
(or) Is there already an existing library to do non-word noise recognition?
(or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?
Thanks
Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.
In short, the sequence of steps is the following:
First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:
CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH
Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.
Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.
Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):
#JSGF V1.0;
/**
* JSGF Grammar for Hello World example
*/
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;
This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).
Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.
I don't know any existing libraries you can use, I suspect you may have to roll your own.
Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.
http://www.cs.bham.ac.uk/internal/courses/robotics/halloffame/2001/team14/sound.htm