I'm looking for a way to distinguish compound words in a sentence.
Although this is pretty easy in English because there are dashes between words of a compound word (e.g. daughter-in-law), it's not the same in other languages like Persian. In order to detect the words in a sentence we will look for spaces between words. Imagine there isn't a dash to connect these words together, but instead there is a space between them. Fortunately, we already have different records for "daughter" and "daughter in law" in the database. Now I'm looking for an algorithm or SQL query which would first look at bigger chunks of words like "daughter in law" and checks if they exist. If nothing was found, then it should start looking for each word.
Another example would be with digits. Imagine we have a string like "1 2 3 4 5 6". Each digit has a record in the database which corresponds to a value. However, there are extra records for combinations such as "2 3". I want to first get the records for bigger chunks and if there is no record, then check each single digit. Once again, please note that the algorithm must automatically distinguish compounds from singulars.
You can build a Directed Acyclic Word Graph (DAWG) from your dictionary. Basically, it's a trie that you can search very quickly. Once built, you can search for words or compound words pretty easily.
To search, you take the first letter of the word and, starting at the root node of the tree, see if there's a transition to that letter. As you match each letter, you get the next letter and see if there's a transition from the current node of the tree for that letter. If you reach the end of the string, then you know that you've found a word.
If you get to a point where there is not a transition from the current node, then:
if the current node is not marked as the end of a word, then the word you're working with is not a word in the dictionary or a compound word.
if the current node is marked as the end of a word, then you have a potential compound word. You take the next letter and start at the root of the tree.
Note that you probably don't want to implement a DAWG as records in a database.
For English this problem is solved using full text search binary trees (Huffman Encoding Trees), which take advantage of frequency analysis to put the words/alphabet most used on top of the tree.
But for Persian implementing such an algorithm is much more difficult because Persian alphabet combines together and it is not discrete like English. So to answer your question about the algorithm, you have to make a Huffman Encoding Tree based on frequency to be able to search against words.
Related
I'm trying to do a SQL query on a table with strings that does the following:
I'm trying to find all strings where the word "poor" is present in a table.
In those strings, I need to identify the word that is two places to the right and copy that to a new column
You did not provide code so I will not either, instead I will help guide you in the right direction.
Firstly, you will want to find anything with the word "poor" using a wildcard, allowing it to be at the beginning, middle, or end of the string. Next you'll have to find the location of the word "poor" in the string. Finally, do something along the lines of locating the second space to the right of the end of the word "poor", then get anything after that space until the next space because this will contain the word you're looking for (if your strings follow traditional sentence structure).
You'll need to consider what to do if the word "poor" is one of the last words in the string, you may not have another word to look for.
I have a list of letters and I'm trying to find all the possible words that can be created with those letters. I haven't found any implementations in objective-c or something close to it.
What I have found is a nice Boggle solver, which is good, but not what I want. I don't need the selected letters to be adjacent to each other. I want to find out how many words can be found by combining any letters in a 25 letter list.
One way to do it is to read in a dictionary, and for each word, store an alphabetical list of letters the word contains. (If you're using ASCII, you can use a single 32-bit int to store the list for a given word. Just assign each letter of the alphabet a bit and turn it on if that letter exists in the word.)
Once you have the dictionary read in, you can scan through it to pull out words that contain the letters in your set of 25. If you followed the suggestion above to store the list of letters associated with each word in an int, you may get some false positives, where the word in question contains 2 of a letter, but you only had 1 letter in your list of 25. Discard those values.
The remaining set will be words that can be spelled using the 25 letters you have.
How do we convert Upper case text like this:
WITHIN THE FIELD OF LITERARY CRITICISM, "TEXT" ALSO REFERS TO THE
ORIGINAL INFORMATION CONTENT OF A PARTICULAR PIECE OF WRITING; THAT
IS, THE "TEXT" OF A WORK IS THAT PRIMAL SYMBOLIC ARRANGEMENT OF
LETTERS AS ORIGINALLY COMPOSED, APART FROM LATER ALTERATIONS,
DETERIORATION, COMMENTARY, TRANSLATIONS, PARATEXT, ETC. THEREFORE,
WHEN LITERARY CRITICISM IS CONCERNED WITH THE DETERMINATION OF A
"TEXT," IT IS CONCERNED WITH THE DISTINGUISHING OF THE ORIGINAL
INFORMATION CONTENT FROM WHATEVER HAS BEEN ADDED TO OR SUBTRACTED FROM
THAT CONTENT AS IT APPEARS IN A GIVEN TEXTUAL DOCUMENT (THAT IS, A
PHYSICAL REPRESENTATION OF TEXT).
Into usual sentence case like this:
Within the field of literary criticism, "text" also refers to the
original information content of a particular piece of writing; that
is, the "text" of a work is that primal symbolic arrangement of
letters as originally composed, apart from later alterations,
deterioration, commentary, translations, paratext, etc. Therefore,
when literary criticism is concerned with the determination of a
"text," it is concerned with the distinguishing of the original
information content from whatever has been added to or subtracted from
that content as it appears in a given textual document (that is, a
physical representation of text).
The base answer is just to use the LOWER() function.
It's easy enough to separate the sentences by CHARINDEX()ing for the period (and then using UPPER() on the first letter of each sentence...).
But even then, you'll end-up leaving proper names, acronyms, etc. in lower-case.
Distinguishing proper names, etc. from the rest is beyond anything that can be done in TSQL. I've seen people attempt it in code using the dictionary from MS Word, etc...but even then, Word doesn't always get it right either.
I found a simple solution was to use INITCAP()
I have a SQLite database with a word list. In a table there is a word list that includes the word "vocĂȘ". This word has this representation in unicode "voc\U00ea".
I've found out that the same word can have the following representation with the same visual output:
"voc\U00ea",
"voce\U0302"
When I query my db using the second representation it returns blank. Does anyone know a way for the query work using both representations without duplicating the records in the table?
Thanks,
Miguel
These two forms are known as nfc(normal form composed) and nfd("normal form decomposed"). The letter \U0302 is known as a "combining circumflex", which modifies a preceding letter.
To cope with this situation, do the following:
Pick a normalization. Usually choosing nfc is a good idea. (Although iOS/OS X file system uses nfd.)
Before putting a string into the database, always normalize. In iOS, you can use precomposedStringWithCanonicalMapping or precomosedStringWithCompatibilityMapping. To understand the difference between canonical and compatibility mappings, see this description.
Before performing a query, always normalize the query to the same normal form.
I'm designing architecture of a text parser. Example sentence: Content here, content here.
Whole sentence is a... sentence, that's obvious. The, quick etc are words; , and . are punctuation marks. But what are words and punctuation marks all together in general? Are they just symbols? I simply don't know how to name what a single sentence consists of in the most reasonable abstract way (because one may write it consists of letters/vowels etc).
Thanks for any help :)
What you're doing is technically lexical analysis ("lexing"), which takes a sequence of input symbols and generates a series of tokens or lexemes. So word, punctuation and white-space are all tokens.
In (E)BNF terms, lexemes or tokens are synonymous with "terminal symbols". If you think of the set of parsing rules as a tree the terminal symbols are the leaves of the tree.
So what's the atom of your input? Is it a word or a sentence? If it's words (and white-space) then a sentence is more akin to a parsing rule. In fact the term "sentence" can itself be misleading. It's not uncommon to refer to the entire input sequence as a sentence.
A semi-common term for a sequence of non-white-space characters is a "textrun".
A common term comprising the two sub-categories "words" and "punctuation", often used when talking about parsing, is "tokens".
Depending on what stage of your lexical analysis of input text you are looking at, these would be either "lexemes" or "tokens."