Does it make sense to performe lemmatization and bigram tokens? - tokenize

I'm new regarding R and I've been learning a lot with some of this forums. I'm having an issue with my analysis I would like to ask your help to understand and solve it.
In my work I'm using a collection of texts.
I have now a corpus (Vcorpus), which I've clean for numbers, punctuation, lowercase, whitespace, stopwords.
After this basic preprocessing, I went for lemmatization and I think it was ok (seen the outputs):
myCorpus <- tm_map(myCorpus, lemmatize_strings).
The thing is in my study I'm going for topic modelling, and I think it would be interesting to use not only unigrams but also bigrams (given a lot of financial concepts with two words). So I manage to find code to do bigram tokenization and did it.
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=2))
The problem is when I tried to create my DTM matrix:
DTM <- DocumentTermMatrix(myCorpus, control = list(tokenize = BigramTokenizer))
this error occured:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character".
First of all, does it make sense to bigram after lemmatization? Couldn't find nothing to read about it anywhere. I'm certain the problem in related to this, since I tried to create DTM matrix after every cleaning corpus step and also bigram and no error came like this one.
Could anyone please please help me?

Related

length of 'dimnames' [1] not equal to array extent

I am new to using R and actually to most programming language, so I am a bit lost here. Hope you can help. I am using RCMap for whcih I have 4csv documents, I get the following error code:
Error in dimnames(x) <- dn :
length of 'dimnames' [1] not equal to array extent
I am sure it has something to do with my own data, because I get normal output if I use other people´s data. However, I don´t know where the problem is (not even in which of the four documents). I do have a lot of missing data, however changing the missing data to either blank spaces or NA, does not change the error code.
The documents of other people that I am able to run also contain missing data, although to a lesser extend.
Hope you can help,
best wishes, Doriene
I had a similar problem and it helped when i put a space in front of c__bacilli.
Ex: test <- subset_taxa(phylo, Class==" c__Bacilli")

How does spaCy tokenizer splits sentences?

I am finding the tokenization code quite complicated and I still couldn't find where in the code the sentences are split.
For example, how does the tokenizer know that
Mr. Smitt stayed at home. He was tired
should not be split in "Mr." and should be split before "He".? And where in the code does the split before "He" happens?
(In fact, I am unsure actually unsure if I am looking at the right place: if I search for sents in tokenizer.pyx I don't find any occurrence)
You access the splits via the doc object, with the generator:
doc.sents
The output of the generator is a series of spans.
As for how the splits are chosen, the document is parsed for dependency relationships. Understanding the parser is not trivial - you'll have to read into it if you want to understand it - it's using a neural network to inform the decision about how to construct the dependency trees; but the splits are those gaps between tokens which are not crossed by dependencies. This is not simply where you find a full-stop, and the method is more robust as a result.

wit.ai 'Only if..' not working

I am new in wit.ai. I'm confused with it. I have few questions:
how works Actions: 'Only if..' and 'Always if...'
simply I have 2 entities: 'Hi' and 'Botname',I have 2 stories: when say 'Hi' wit answers 'answer1', when say 'Botname', wit answers 'answer2'. It's Ok, but when combined 'Hi Botname', I want wit to answer 'answer1', but I can't echieveit without adding story. I try to add in Actions ->'Answer2'-'Only if..' 'doesn't have' ->'Hi', but still it answers 'Answer2' and I don't understend why :)
second question I sometimes don't get adequate answer from wit and I don't know how to avoid such cases. For example: entity 'constitution' and in 'understending' when writing 'station' wit gets 'constitution', this two words are different. and what to do? please, help with it.
To the first question, I'd suggest that rather than trying to use the keyword and free-text format of entities, you define and assign a trait entity which will not necessarily try to match the exact word, but the feeling of the sentence.
For example
Given the situation above, if you were to train an intent
called "greeting" to recognize all sentences with "Hi" in it as
greetings, then the result of "Hi Botname" will continue to be the
result of Hi. Also, if you're going to be using branching, enitites
will have to be defined as trait entities in either case.
To the second (And this will help with the first), you just have to spend some time training the bot to understand. You can't rush the brush. You'll have to feed it some examples before it can understand the difference in the words, and start to pick those differences up in future words.
The Wit Bot engine was released only a little while ago, so we're all learning it now, but I hope I could help you with the little knowledge I've gained.

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).

What is readable code? What are the best practices to follow while naming variables?

Do you think x, y, z are good variable names? How will you explain a new programmer to write readable code?
Readable code means some combination of comments and variable and function naming that allows me to read the code once and understand it. If I have to read it more than once, or spend my time working through complicated loops or functions, there's room for improvement.
Good summary descriptions at the top of files and classes are useful to give the reader context and background information.
Clear names are important. Verbose names make it much easier to write readable code with far fewer comments.
Writing readable code is a skill that takes some time to learn. I personally like overly verbose names because they create self documenting code.
As already stated x, y, and z are good variables for 3D coordinates but probably bad for anything else...
If someone does not believe that names are important, just use a code obfuscator on some code then ask them to debug it :-).
(BTW that's the only situation where a code obfuscator can be useful IMHO)
There seems to be slightly different conventions per progamming language; however, the consensus these days is to...
use pascal case
make the name meaningful
end with a noun
Here is a decent recap of what Microsoft publishes as standard naming conventions for .NET
The inventor of python has published a style guide which includes naming conventions.
There was a time when Microsoft VC++ developers (myself included) actually rallied around what was known as Hungarian Notation
Certainly there are multiple schools of thought on this, but I would only use these for counters, and advise far more descriptive names for any other variables.
x, y and z can be perfectly good variable names. For example you might be writing code that refers to them in reference to a 3D cartesian coordinate system. These names are often used for the three axes in such a system and as such they would be well suited.
I would give them some maintenance work on some code with variables called x, y, z and let them realise for themselves that readability is vital...
95% of code viewing is not by the author, but by the customer that everyone forgets about - the next programmer. You owe it to her to make her life easy.
Good variable names describe exactly what they are without being overly complex. I always use descriptive names, even in loops (for instance, index instead of i). It helps keep track of what's going on, especially when I'm working on rewriting a particularly complex piece of code.
Well give them a chunk of bad code and ask them to debug it.
Take the following code (simple example)
<?php $a = fopen('/path/to/file.ext', 'w');$b = "NEW LINE\n";fwrite($a, $b);fclose($a);?>
The bug is: File only ever contains 1 line when it should be a log
Problem: 'w' in fopen should be 'a'
This obviously is a super easy example, if you want to give them a bigger more complicated example give them the WMD source and ask them to give you readable code in 2 hours, it will get your point across.
As long as x, y and z are (3D) Cartesian co-ordinates, then they're great names.
In a similar vein, i, j and k would be OK for loop variables.
In all cases, the variable names should relate to the data
x,y and z are acceptable variable names if they represent 3d coordinates, or if they're used for iterating over 2 or 3 dimensional arrays.
This code is fine as far as I'm concerned:
for(int x = 0; x < xsize ; x++)
{
for(int y = 0; y < ysize ; y++)
{
for(int z = 0; z < zsize ; z++)
{
DoSomething(data[x][y][z]);
...
This one is a short answer, but it works very well for me:
If it would need a code comment to describe it, then rethink the variable name.
So if it's obvious, why "x" was choosen, then they are good names. E.g. "i" as variable name in a loop is (often) pretty obvious.
An ideal variable name is both short (to make the code more compact) and descriptive (to help understanding the code).
Opinions differ on which of the two is more important. Personally, I'd say it depends on the scope of the variable. A variable used only inside a 3 line loop can get away with being single letter. A class field in a 500 line class better be pretty damn descriptive. The Spartan Programming philosophy says that as far as possible, all units of code should be small enough that variable names can be very short.
Readable code and good naming conventions are not the same thing!
A good name for a variable is one that allows you to understand (or reasonably guess) the purpose and type of the variable WITHOUT seeing the context in which it is used. Thus, "x" "y" and "z" say coordinates because that is a reasonable guess. Conversely, a bad name is one that leads you to a wrong likely guess. For example, if "x" "y" and "z" represent people.
A good name for a function is one that conveys everything you would need to know about it without having to consult its documentation. That is not always possible.
Readable code is first of all code whose structured could be understood even if you obfuscated all variable and function names. If you do that and can't figure out the control structure easily, you're screwed.
Once you have readable code and good naming, then maybe you'll have truly readable code.