What a single sentence consist of? How to name it? - oop

I'm designing architecture of a text parser. Example sentence: Content here, content here.
Whole sentence is a... sentence, that's obvious. The, quick etc are words; , and . are punctuation marks. But what are words and punctuation marks all together in general? Are they just symbols? I simply don't know how to name what a single sentence consists of in the most reasonable abstract way (because one may write it consists of letters/vowels etc).
Thanks for any help :)

What you're doing is technically lexical analysis ("lexing"), which takes a sequence of input symbols and generates a series of tokens or lexemes. So word, punctuation and white-space are all tokens.
In (E)BNF terms, lexemes or tokens are synonymous with "terminal symbols". If you think of the set of parsing rules as a tree the terminal symbols are the leaves of the tree.
So what's the atom of your input? Is it a word or a sentence? If it's words (and white-space) then a sentence is more akin to a parsing rule. In fact the term "sentence" can itself be misleading. It's not uncommon to refer to the entire input sequence as a sentence.
A semi-common term for a sequence of non-white-space characters is a "textrun".

A common term comprising the two sub-categories "words" and "punctuation", often used when talking about parsing, is "tokens".

Depending on what stage of your lexical analysis of input text you are looking at, these would be either "lexemes" or "tokens."

Related

Characterizing the data format of the inputs that an AWK tool processes?

AWK newbie here.
I am trying to characterize (for myself) the data format that an AWK tool expects of the input it processes. (Terminology question: Would such a "data format characterization" be called "AWK's data format model"?) Below is my attempt at a characterization. Is it correct? Is it complete? Is it easy to read and understand? What changes/additions are needed to make it correct, complete, and easy to read/understand?
As an aside: One of the things that I really like about AWK is that the data format of its input is readily described in a few short sentences. That's powerful! Contrast with other common data formats (e.g., XML, JSON, CSV) which require many pages of dense prose.
The data format consists of lines (lines are strings that are
typically separated by newlines, although the user may use a symbol
other than newline, if desired). Each line contains fields. Fields are
ASCII strings. Fields are separated by a delimiter (common delimiters
include the tab, space, or comma symbol, although the user is free to
use another symbol if desired). Fields may contain the field delimiter
symbol provided the symbol is preceded by a backslash symbol (this is
called "escaping the symbol"). Fields may be empty. Each line has zero
or more fields. Lines do not need to have the same number of fields.
CSV(...)which require many pages of dense prose.
I must protest, CSV is defined by RFC4180, prose is 7 points inside Definition of the CSV Format at most 2 pages, so I can not say it is many.
Is it complete?
I would say not, because you are using terms without defining them. For example what is ASCII string and what is symbol?

Distinguishing words in a sentence

I'm looking for a way to distinguish compound words in a sentence.
Although this is pretty easy in English because there are dashes between words of a compound word (e.g. daughter-in-law), it's not the same in other languages like Persian. In order to detect the words in a sentence we will look for spaces between words. Imagine there isn't a dash to connect these words together, but instead there is a space between them. Fortunately, we already have different records for "daughter" and "daughter in law" in the database. Now I'm looking for an algorithm or SQL query which would first look at bigger chunks of words like "daughter in law" and checks if they exist. If nothing was found, then it should start looking for each word.
Another example would be with digits. Imagine we have a string like "1 2 3 4 5 6". Each digit has a record in the database which corresponds to a value. However, there are extra records for combinations such as "2 3". I want to first get the records for bigger chunks and if there is no record, then check each single digit. Once again, please note that the algorithm must automatically distinguish compounds from singulars.
You can build a Directed Acyclic Word Graph (DAWG) from your dictionary. Basically, it's a trie that you can search very quickly. Once built, you can search for words or compound words pretty easily.
To search, you take the first letter of the word and, starting at the root node of the tree, see if there's a transition to that letter. As you match each letter, you get the next letter and see if there's a transition from the current node of the tree for that letter. If you reach the end of the string, then you know that you've found a word.
If you get to a point where there is not a transition from the current node, then:
if the current node is not marked as the end of a word, then the word you're working with is not a word in the dictionary or a compound word.
if the current node is marked as the end of a word, then you have a potential compound word. You take the next letter and start at the root of the tree.
Note that you probably don't want to implement a DAWG as records in a database.
For English this problem is solved using full text search binary trees (Huffman Encoding Trees), which take advantage of frequency analysis to put the words/alphabet most used on top of the tree.
But for Persian implementing such an algorithm is much more difficult because Persian alphabet combines together and it is not discrete like English. So to answer your question about the algorithm, you have to make a Huffman Encoding Tree based on frequency to be able to search against words.

SQL2008 fulltext index search without word breakers

I are trying to search an FTI using CONTAINS for Twitter-style usernames, e.g. #username, but word breakers will ignore the # symbol. Is there any way to disable word breakers? From research, there is a way to create a custom word breaker DLL and install it and assign it but that all seems a bit intensive and, frankly, over my head. I disabled stop words so that dashes are not ignored but I need that # symbol. Any ideas?
You're not going to like this answer. But full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.

Convert upper case into sentence case

How do we convert Upper case text like this:
WITHIN THE FIELD OF LITERARY CRITICISM, "TEXT" ALSO REFERS TO THE
ORIGINAL INFORMATION CONTENT OF A PARTICULAR PIECE OF WRITING; THAT
IS, THE "TEXT" OF A WORK IS THAT PRIMAL SYMBOLIC ARRANGEMENT OF
LETTERS AS ORIGINALLY COMPOSED, APART FROM LATER ALTERATIONS,
DETERIORATION, COMMENTARY, TRANSLATIONS, PARATEXT, ETC. THEREFORE,
WHEN LITERARY CRITICISM IS CONCERNED WITH THE DETERMINATION OF A
"TEXT," IT IS CONCERNED WITH THE DISTINGUISHING OF THE ORIGINAL
INFORMATION CONTENT FROM WHATEVER HAS BEEN ADDED TO OR SUBTRACTED FROM
THAT CONTENT AS IT APPEARS IN A GIVEN TEXTUAL DOCUMENT (THAT IS, A
PHYSICAL REPRESENTATION OF TEXT).
Into usual sentence case like this:
Within the field of literary criticism, "text" also refers to the
original information content of a particular piece of writing; that
is, the "text" of a work is that primal symbolic arrangement of
letters as originally composed, apart from later alterations,
deterioration, commentary, translations, paratext, etc. Therefore,
when literary criticism is concerned with the determination of a
"text," it is concerned with the distinguishing of the original
information content from whatever has been added to or subtracted from
that content as it appears in a given textual document (that is, a
physical representation of text).
The base answer is just to use the LOWER() function.
It's easy enough to separate the sentences by CHARINDEX()ing for the period (and then using UPPER() on the first letter of each sentence...).
But even then, you'll end-up leaving proper names, acronyms, etc. in lower-case.
Distinguishing proper names, etc. from the rest is beyond anything that can be done in TSQL. I've seen people attempt it in code using the dictionary from MS Word, etc...but even then, Word doesn't always get it right either.
I found a simple solution was to use INITCAP()

How to convert foreign characters to English characters in SQL Query?

I have to create sql function that converts special Characters, International Characters(French, Chinese...) to english.
Is there any special function in sql, can i get??
Thanks for your help.
If you are after English names for the characters, that is an achievable goal, as they all have published names as part of the Unicode standard.
See for example:
http://www.unicode.org/ucd/
http://www.unicode.org/Public/UNIDATA/
Your task then is to simply turn the list of unicode characters into a table with 100,000 or so rows. Unfortunately the names you get will be things like ARABIC LIGATURE LAM WITH MEEM MEDIAL FORM.
On the other hand, if you want to actually translate the meaning, you need to be looking at machine translation software. Both Microsoft and Google have well-known cloud translation offerings and there are several other well-thought of products too.
I think the short answer is you can't unless you narrow your requirements a lot. It seems you want to take a text sample, A, and convert it into romanized text B.
There are a few problems to tackle:
Languages are typically not romanized on a single character basis. The correct pronunciation of a character is often dependent on the characters and words around it, and can even have special rules for just one word (learning English can be tough because it is filled with these, having borrowed words from many languages without normalizing the spelling).
Even if you code rules for every language you want to support you still have homographs, words that are spelled using exactly the same characters, but that have different pronunciations (and thus romanization) depending on what was meant - for example "sow" meaning a pig, or "sow" (where the w is silent) meaning to plant seeds.
And then you get into the problem of what language you are romanizing: Characters and even words are not unique to one language, but the actual meaning and romanization can vary. The fact that many languages include loan words from those language they share characters with complicates any attempt to automatically determine which language you are trying to romanize.
Given all these difficulties, what it is you actually want to achieve (what problem are you solving)?
You mention French among the languages you want to "convert" into English - yet French (with its accented characters) is already written in the roman alphabet. Even everyday words used in English occasionally make use of accented characters, though these are rare enough that the meaning and pronunciation is understood even if they are omitted (ex. résumé).
Is your problem really that you can't store unicode/extended ASCII? There are numerous ways to correct or work around that.