Number of vowels, consonants and syllables in a text document with Rapidminer - text-mining

I have a very short question: how to calculate the number of vowels, consonants and syllables in a text document with RapidMiner?

Replace vowels with blank and calculate lengths before and after. Same for consonants. Syllables more challenging.

Related

Is there a SQL regex function that can separate Emojis from other non-ASCII characters?

I have a table in SQL that contains text data, and sometimes that text data contains emojis. See below for sample table output:
CommentID
Comment_Text
1
A walk in the park.
2
A lovely day in the park
3
A sunny day in the park šŸ˜›
What I'd like to do is separate out the emoji as a column. Ultimately what I'd like to end up with is a binary column indicating whether the comment text contains an emoji.
After some research, I was able to find the following solution:
REGEXP_SUBSTR(Comment_Text, '[^\x00-\x7F]+', 1, 1)
Which will separate out the first emoji the code finds. In actuality, this regex doesn't find emojis, it just finds non-ASCII characters - emojis just happen to fall in that category. While this solution does work sometimes, it does not work when there are emojis and non-ASCII characters in the same comment. So, for example, if the comment was 'A lovely walk in the Ļ€aƞk šŸ˜›', the code would output both the emojis but also the Ļ€ and the ƞ.
What I need is a way to separate out the emojis from the other non-ASCII characters.
Good day sir.
Can you try this function on Regex101 site?
I think it will work.
[^\x00-\x7F]+[^x00-\x7F]

In SQL how to find two words after a searched word in string

I'm trying to do a SQL query on a table with strings that does the following:
I'm trying to find all strings where the word "poor" is present in a table.
In those strings, I need to identify the word that is two places to the right and copy that to a new column
You did not provide code so I will not either, instead I will help guide you in the right direction.
Firstly, you will want to find anything with the word "poor" using a wildcard, allowing it to be at the beginning, middle, or end of the string. Next you'll have to find the location of the word "poor" in the string. Finally, do something along the lines of locating the second space to the right of the end of the word "poor", then get anything after that space until the next space because this will contain the word you're looking for (if your strings follow traditional sentence structure).
You'll need to consider what to do if the word "poor" is one of the last words in the string, you may not have another word to look for.

Postgres select rows that have a percentage match, and sort by percentage

I have a table of sentences in a postgres database. The table is called sentences and the column that stores the sentence for each row is called sentence.
How can I compare the sentences to a given sentence and return the ones in which, say, 60% of the words (or even better the roots of the words) match and then sort the results by the quality of the match?
Ideally a 90% match would come before 70% match and a 50% match wouldn't show at all.
Ideally it would exclude punctuation as well, but that's not a necessity.
Check out the fuzzystrmatch module, especially the levenshtein function. This calculates the "distance" between two strings, with lower values meaning they are more similar. It's generally used between two words, but as long as the sentences aren't too long (max string length for each argument is 255 bytes), you could use them with sentences as well.
Then you would sort by the output of the levenshtein function ascending, with the results going from most to least similar.
If you wanted to exclude punctuation, call regexp_replace on the strings with a regex to remove all characters you would like, replacing it with empty string, and use those return values as the arguments to levenshtein.

How to find all possible words that can be created from a list of letters

I have a list of letters and I'm trying to find all the possible words that can be created with those letters. I haven't found any implementations in objective-c or something close to it.
What I have found is a nice Boggle solver, which is good, but not what I want. I don't need the selected letters to be adjacent to each other. I want to find out how many words can be found by combining any letters in a 25 letter list.
One way to do it is to read in a dictionary, and for each word, store an alphabetical list of letters the word contains. (If you're using ASCII, you can use a single 32-bit int to store the list for a given word. Just assign each letter of the alphabet a bit and turn it on if that letter exists in the word.)
Once you have the dictionary read in, you can scan through it to pull out words that contain the letters in your set of 25. If you followed the suggestion above to store the list of letters associated with each word in an int, you may get some false positives, where the word in question contains 2 of a letter, but you only had 1 letter in your list of 25. Discard those values.
The remaining set will be words that can be spelled using the 25 letters you have.

Distinguishing words in a sentence

I'm looking for a way to distinguish compound words in a sentence.
Although this is pretty easy in English because there are dashes between words of a compound word (e.g. daughter-in-law), it's not the same in other languages like Persian. In order to detect the words in a sentence we will look for spaces between words. Imagine there isn't a dash to connect these words together, but instead there is a space between them. Fortunately, we already have different records for "daughter" and "daughter in law" in the database. Now I'm looking for an algorithm or SQL query which would first look at bigger chunks of words like "daughter in law" and checks if they exist. If nothing was found, then it should start looking for each word.
Another example would be with digits. Imagine we have a string like "1 2 3 4 5 6". Each digit has a record in the database which corresponds to a value. However, there are extra records for combinations such as "2 3". I want to first get the records for bigger chunks and if there is no record, then check each single digit. Once again, please note that the algorithm must automatically distinguish compounds from singulars.
You can build a Directed Acyclic Word Graph (DAWG) from your dictionary. Basically, it's a trie that you can search very quickly. Once built, you can search for words or compound words pretty easily.
To search, you take the first letter of the word and, starting at the root node of the tree, see if there's a transition to that letter. As you match each letter, you get the next letter and see if there's a transition from the current node of the tree for that letter. If you reach the end of the string, then you know that you've found a word.
If you get to a point where there is not a transition from the current node, then:
if the current node is not marked as the end of a word, then the word you're working with is not a word in the dictionary or a compound word.
if the current node is marked as the end of a word, then you have a potential compound word. You take the next letter and start at the root of the tree.
Note that you probably don't want to implement a DAWG as records in a database.
For English this problem is solved using full text search binary trees (Huffman Encoding Trees), which take advantage of frequency analysis to put the words/alphabet most used on top of the tree.
But for Persian implementing such an algorithm is much more difficult because Persian alphabet combines together and it is not discrete like English. So to answer your question about the algorithm, you have to make a Huffman Encoding Tree based on frequency to be able to search against words.