Data Cleanup, post conversion from ALLCAPS to Title Case - sql

Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:
MacDonald, PhD, CPA, III
Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").
I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.
There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:
SQL Server: Make all UPPER case to Proper Case/Title Case

Don't know if this is of any help
private static function ucNames($surname) {
// ( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
$replaceValue = ucwords($surname);
return preg_replace('/
(?: ^ | \\b ) # assertion: beginning of string or a word boundary
( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
( [^\W\d_] ) # match next char; we exclude digits and _ from \w
/xe',
"'\$1' . strtoupper('\$2')",
$replaceValue);
}
It's a simple PHP function that I use to set surnames to correct case that works for names like O'Connor, McDonald and MacBeth, FitzPatrick, and double-barrelled names like Hedley-Smythe

Here is the answer I was looking for:
There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.
I was able to use two applications to solve my problem.
StyleList: this app, among other
things, converts ALL CAPS to mixed
case and in the process it does not
dirty up the data, leaving titles
such as CPA, MD, III, etc. in tact;
as well as natural, common
camel-case names such as McDonalds.
Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.
Here is a link to the solutions offered by Melissa Data:
http://www.melissadata.com/dqt/index.htm
For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.

Related

CHM/HHP: maximum length of variable names in [ALIAS] section

What is the maximum length of variable names in the [ALIAS] section of HHP files?
I_AM_WONDERING_ABOUT_THE_MAXIMUM_LENGTH_OF_THIS_STRING_RIGHT_HERE=this-is-some-really-helpful-html-file.html
I have found a CHM/HHP specification right here:
https://www-user.tu-chemnitz.de/~heha/viewchm.php/hs/chmspec.chm/hhp.html
That page only talks about the length of the overall line, though (and not about the length of the variable name). Very specific question, I know. Still, someone may be able to point me somewhere.
As far as I know never asked before and I never heard about limitations. But I think this is because nobody used long variable names in this place so far.
The purpose of the two files e.g. alias.h and map.h is to ease the coordination between developer and help author. The mapping file links an ID to the map number - typically this can be easily created by the developer and passed to the help author. Then the help author creates an alias file linking the IDs to the topic names. That was the idea behind years (decades) ago by Ralph Walden (ex Microsoft).
Please note HTMLHelp is about 20 years old and these context ID strings inside a alias.h file were derived from WinHelp as a predecessor of HTMLHelp.
You'll find some further Information at Creating Context-Sensitive Help for Applications.
In general I'd recommend to use ID's with a fixed format because of the better legibility like shown below:
;-------------------------------------------------------------
; alias.h file example for HTMLHelp (CHM)
; www.help-info.de
;
; All IDH's > 10000 for better format
; last edited: 2006-07-09
;---------------------------------------------------
IDH_90001=index.htm
IDH_10000=Context-sensitive_example\contextID-10000.htm
IDH_10010=Context-sensitive_example\contextID-10010.htm
IDH_20000=Context-sensitive_example\contextID-20000.htm
IDH_20010=Context-sensitive_example\contextID-20010.htm
I'd recommend to use less than 1024 bytes per line.

R - find name in string that matches a lookup field using regex

I have a data frame of ad listings for pets:
ID Ad_title
1 1 year old ball python
2 Young red Blood python. - For Sale
3 1 Year Old Male Bearded Dragon - For Sale
I would like take the common name in the Ad_listing (i.e. ball pyton) and create a new field with the Latin name for the species. To assist, I have another data frame that has the latin names and common names:
ID Latin_name Common_name
1 Python regius E: Ball Python, Royal Python G: Königspython
2 Python brongersmai E: Red Blood Python, Malaysian Blood Python
3 Pogona barbata E: Eastern Bearded Dragon, Bearded Dragon
How can I go about doing this? The tricky part is that the common names are hidden in between text both in the ad listing and in the Common_name. If that were not the case I could just use %in%. If there was a way/function to use regex I think that would be helpful.
The other answer does a good job outlining the general logic, so here's a few thoughts on a simple (though not optimized!!) way to do this:
First, you'll want to make a big table, two columns of all 'common names' (each name gets its own row) alongside it's Latin name. You could also make a dictionary here, but I like tables.
reference_table <- data.frame(common = c("cat", "kitty", "dog"), technical = c("feline", "feline", "canine"))
common technical
1 cat feline
2 kitty feline
3 dog canine
From here, just loop through every element of "ad_title" (use apply() or a for loop, depending on your preference). Now use something like this:
apply(reference_table,1, function(X) {
if (length(grep(X$common, ad_title)) > 0){ #If the common name was found in the ad_title
[code to replace the string]})
For inserting the new string, play with your regular regex tools. Alternatively, play with strsplit(ad_title, X$common). You'll be able to rebuild the ad_title using paste(), and the parts that make up the strsplit.
Again, this is NOT the best way to do this, but hopefully the logic is simple.
Well, I tried to create a workable solution for your requirement. There could be better ways to execute it, though, probably using packages such as data.table and/or stringr. Anyway, this snippet could be a working starting point. Oh, and I modified the Ad_title data a bit so that the species names are in titlecase.
# Re-create data
Ad_title <- c("1 year old Ball Python", "Young Red Blood Python. - For Sale",
"1 Year Old Male Bearded Dragon - For Sale")
df2 <- data.frame(Latin_name = c("Python regius", "Python brongersmai", "Pogona barbata"),
Common_name = c("E: Ball Python, Royal Python G: Königspython",
"E: Red Blood Python, Malaysian Blood Python",
"E: Eastern Bearded Dragon, Bearded Dragon"),
stringsAsFactors = F)
# Aggregate common names
Common_name <- paste(df2$Common_name, collapse = ", ")
Common_name <- unlist(strsplit(Common_name, "(E: )|( G: )|(, )"))
Common_name <- Common_name[Common_name != ""]
# Data frame latin names vs common names
df3 <- data.frame(Common_name, Latin_name = sapply(Common_name, grep, df2$Common_name),
row.names = NULL, stringsAsFactors = F)
df3$Latin_name <- df2$Latin_name[df3$Latin_name]
# Data frame Ad vs common names
Ad_Common_name <- unlist(sapply(Common_name, grep, Ad_title))
df4 <- data.frame(Ad_title, Common_name = sapply(1:3, function(i) names(Ad_Common_name[Ad_Common_name==i])),
stringsAsFactors = F)
obviously you need a loop structure for all your common name lookup table and another loop that splits this compound field on comma, before doing simple regex. there's no sane regex that will do it all.
in future avoid using packed/compound structures that require packing and unpacking. it looks fine for human consumption but semantically and for computer program consumption, you have multiple data values packed in single field, i.e. it's not "common name" it's "common names" delimited by comma, that you have there.
sorry if i haven't provided R or whatever specific answer. I'm a technology veteran and use many languages/technologies depending on problem and available resources. you will need to iterate over every record of your latin names lookup table, within which you will need to iterate over the comma delimited packed field of "common names", so you're working with one common name at a time. with that single common name you search/replace using regex or whatever means available to you, over the whole input file. it's plain and simple that you need to start at it from that end, i.e. the lookup table. you need to iterlate/loop through that. iteration/looping should be familiar to you, as it's a basic building block of any program/script. this kind of procedural logic is not part of the capability (or desired functionality) of regex itself. I assume you know how to create a iterative construct in R or whatever you're using for this.

SQL fuzzy search and Google-like improvements

Interesting challenge; my client enters some product information in a SQL database. The product is a painting of a famous old Russian composer called Rachmaninoff. So that name is in the description field. Now, only a few of their customers searching for products know exactly how to spell this name, but most of the time it's misspelled. Besides misspelling there are also a lot of international customers who just write this name completely different like, Rachmaninow, Rahmaninov, Рахманінаў.
If i put any of these misspellings or translations in Google it (almost) always knows how to correct it and to redirect me straight to the right page.
Does anyone know what my possibilities are to get some of this magic in my product search? Are there some API's i can use? Some super free text option that i don't know of? Or ...
We solved a similar problem with quite some success: Searching for people (german names) by name given over phone.
E.g.: The very common german last names "Schmidt", "Schmitt", "Schmied", "Schmid", "Schmit" and "Schmiedt" will be all but impossible to hold apart when given by a voice. Combine this with a first name of "Sylvia" or "Silvia" or "Sylvya" and a caller saying "Hi, I'm Sylvia Schmidt, I have forgotten my customer number" has no chance of being quickly found.
Our solution was to put up a list of synophones, e.g. (in pseudo code, for german):
{consonant}+ := {consonant}
ie := i
ii := i
dt* := t
y|j := i
{vocal}v := {vocal}f
etc., you get the drift. Now we stored the synophone-translated strings with the original strings to make search possible. This works really well.
I understand that MySQL has the Soundex() function for English strings. I would expect MSSQL to have something similar.

Handling Grammar / Spelling Issues in Translation Strings

We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"
It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.
ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.

How to generate (book) indexes?

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")