I have a data frame of ad listings for pets:
ID Ad_title
1 1 year old ball python
2 Young red Blood python. - For Sale
3 1 Year Old Male Bearded Dragon - For Sale
I would like take the common name in the Ad_listing (i.e. ball pyton) and create a new field with the Latin name for the species. To assist, I have another data frame that has the latin names and common names:
ID Latin_name Common_name
1 Python regius E: Ball Python, Royal Python G: Königspython
2 Python brongersmai E: Red Blood Python, Malaysian Blood Python
3 Pogona barbata E: Eastern Bearded Dragon, Bearded Dragon
How can I go about doing this? The tricky part is that the common names are hidden in between text both in the ad listing and in the Common_name. If that were not the case I could just use %in%. If there was a way/function to use regex I think that would be helpful.
The other answer does a good job outlining the general logic, so here's a few thoughts on a simple (though not optimized!!) way to do this:
First, you'll want to make a big table, two columns of all 'common names' (each name gets its own row) alongside it's Latin name. You could also make a dictionary here, but I like tables.
reference_table <- data.frame(common = c("cat", "kitty", "dog"), technical = c("feline", "feline", "canine"))
common technical
1 cat feline
2 kitty feline
3 dog canine
From here, just loop through every element of "ad_title" (use apply() or a for loop, depending on your preference). Now use something like this:
apply(reference_table,1, function(X) {
if (length(grep(X$common, ad_title)) > 0){ #If the common name was found in the ad_title
[code to replace the string]})
For inserting the new string, play with your regular regex tools. Alternatively, play with strsplit(ad_title, X$common). You'll be able to rebuild the ad_title using paste(), and the parts that make up the strsplit.
Again, this is NOT the best way to do this, but hopefully the logic is simple.
Well, I tried to create a workable solution for your requirement. There could be better ways to execute it, though, probably using packages such as data.table and/or stringr. Anyway, this snippet could be a working starting point. Oh, and I modified the Ad_title data a bit so that the species names are in titlecase.
# Re-create data
Ad_title <- c("1 year old Ball Python", "Young Red Blood Python. - For Sale",
"1 Year Old Male Bearded Dragon - For Sale")
df2 <- data.frame(Latin_name = c("Python regius", "Python brongersmai", "Pogona barbata"),
Common_name = c("E: Ball Python, Royal Python G: Königspython",
"E: Red Blood Python, Malaysian Blood Python",
"E: Eastern Bearded Dragon, Bearded Dragon"),
stringsAsFactors = F)
# Aggregate common names
Common_name <- paste(df2$Common_name, collapse = ", ")
Common_name <- unlist(strsplit(Common_name, "(E: )|( G: )|(, )"))
Common_name <- Common_name[Common_name != ""]
# Data frame latin names vs common names
df3 <- data.frame(Common_name, Latin_name = sapply(Common_name, grep, df2$Common_name),
row.names = NULL, stringsAsFactors = F)
df3$Latin_name <- df2$Latin_name[df3$Latin_name]
# Data frame Ad vs common names
Ad_Common_name <- unlist(sapply(Common_name, grep, Ad_title))
df4 <- data.frame(Ad_title, Common_name = sapply(1:3, function(i) names(Ad_Common_name[Ad_Common_name==i])),
stringsAsFactors = F)
obviously you need a loop structure for all your common name lookup table and another loop that splits this compound field on comma, before doing simple regex. there's no sane regex that will do it all.
in future avoid using packed/compound structures that require packing and unpacking. it looks fine for human consumption but semantically and for computer program consumption, you have multiple data values packed in single field, i.e. it's not "common name" it's "common names" delimited by comma, that you have there.
sorry if i haven't provided R or whatever specific answer. I'm a technology veteran and use many languages/technologies depending on problem and available resources. you will need to iterate over every record of your latin names lookup table, within which you will need to iterate over the comma delimited packed field of "common names", so you're working with one common name at a time. with that single common name you search/replace using regex or whatever means available to you, over the whole input file. it's plain and simple that you need to start at it from that end, i.e. the lookup table. you need to iterlate/loop through that. iteration/looping should be familiar to you, as it's a basic building block of any program/script. this kind of procedural logic is not part of the capability (or desired functionality) of regex itself. I assume you know how to create a iterative construct in R or whatever you're using for this.
Related
Fairly new to programming in R,
I have a dataframe from which I am trying to create a more concise table by pulling the entire row only if it contains a certain name in the "name" column. The names are all in a separate text document. Any suggestions?
I tried:
refGenestable <- dbGetQuery(con, "select row_names, name, chrom, strand, txStart, txEnd from refGene where name in c_Gene")
where c_Gene is the list of names I need to test that I have turned into a dataframe. I also tried turning into a list of strings and iterating through that but also had problems with that
Edit:
sorry for confusion I'm still learning! I created dataframe ("refGenestable") in R (but yes it is from SQL database) but I want to narrow it down more now to only include rows that contain same name as names I have in a text file, c_Genes, where each name is separated by \n. I created a list out of this file
You may have a few issues here. It's hard to know exactly what you need because it's unclear what the structure of your data is.
The general question is easy to answer.
Provided you have a data frame, and you want a new one with only names that are in a vector, you can use DF[DF$name %in% <some vector>) or with dplyr filter(DF, name %in% <some vector>). You can't use %in% to test whether something is in a data though. You have to actually extract the variable in the other data frame.
If the names you want to keep are lines in a text file, then you're also asking a question about how to get the text file into R, in which case it's my_vector <- readLines("path to file"). The actual code will depend on the structure of the file, but if each element is on a new line, that will do what you want.
If the names you want to keep are in another data frame, then you need to extract them as a vector in order to use %in%, i.e., filter(DF, name, name %in% OTHERDF$name)
EDIT:
From your edit to the question, my answer should likely work for you. Though, again, we don't know for sure what the structure of your data is without seeing it (you can provide it by pasting the output of dput(<your object>). Here's the answer above, using the names for objects that you've described.
gene_names <- readLines("c_Genes")
# is that really the name? No extension? Is it in your working directory?
# if not, you need to use a relative or absolute path for c_Genes
genes_you_want <- refGenestable[refGenestable$name %in% gene_names,]
# is the column with the gene name called name?
# don't forget the comma at the end
# or with dplyr
install.packages("dplyr")
library(dplyr)
genes_you_want <- filter(refGenestable, name %in% gene_names)
A typical parser in Anorm looks a bit like this:
val idSeqParser: RowParser[IDAndSequence] = {
long("ID") ~
int("sequence") map {
case id ~ sequence => IDAndSequence(id, sequence)
}
}
Assuming of course that you had a case class like so:
case class IDAndSequence(id: Long, sequence: Int = 0)
All handy-dandy when you know this up front however what if you want to run ad-hoc queries (raw SQL) that you write at run time? (hint: an on the fly reporting engine)
How does one tackle that problem?
Can you create a series of generic parsers or various numbers of fields (which I see Scala itself had to resort to when processing tuples on Forms meaning you can only go to 22 elements in a form and unsure what the heck you do after that...)
You can assume that "everything is a string" for the purpose of reporting so Option[String] should cut it.
Can a parser be created on the fly however? If so what would doing that look like?
Is the a more elegant way to address this "problem"?
EDIT (to help clarify what I'm after)
As I could "ask" using aliases
Select f1 as 'a', f2 as 'b', f3 as 'c' from sometable
Then I could collect that with a pre-written parser like so
val idSeqParser: RowParser[IDAndSequence] = {
get[Option[String]]("a") ~
get[Option[String]]("b") ~
get[Option[String]]("c") map {
case a ~ b ~ c => GenericCase(a, b, c)
}
}
However that means I would need to de alias the columns for the actual report output. The suggestion of SqlParser.flatten already puts me ahead there as it has up to 22 (there's that "literal" kludge!) columns.
As I've written reports with greater than 22 columns in times past -- mostly as inputs to spreadsheets for further manual dat mining -- I would like to escape that limit if possible. Hard to tell a client you can't have that urgent 27 column report for 5 days but this 21 column one you can have in 5 minutes...
Going to try an experiment today to see if I can't find my own workable solution.
I have a really hard time wrapping my head around arrays and associative arrays in awk.
Say you want to compare two different columns in two different files using associative arrays, how would you do? Let's say column 1 in file 1 with column 2 in file two, then print the the matching, corresponding values of file 1 in a new column in file 2. Please explain each step really simply, as if talking to your grandmother, I mean, super-thoroughly and super-simple.
Cheers
Simple explanation of associative arrays (aka maps), not specifically for awk:
Unlike a normal array, where each element has a numeric index, an associative array uses a "key" instead of an index. You can think of it as being like a simple flat-file database, where each record has a key and a value. So if you have, e.g. some salary data:
Fred 10000
John 12000
Sara 11000
you could store it in an associative array, a, like this:
a["Fred"] = 10000
a["John"] = 12000
a["Sara"] = 11000
and then when you wanted to retrieve a salary for a person you would just look it up using their name as the key, e.g.
salary = a[person]
You can of course modify the values too, so if you wanted to give Fred a 10% pay rise you could do it like this:
a["Fred"] = a["Fred"] * 1.1
And if you wanted to set Sara's salary to be the same as John's you could write:
a["Sara"] = a["John"]
So an associative array is just an array that maps keys to values. Note that the keys do not need to be strings, and the values do not need to be numeric, but the basic concept is the same regardless of the data types. Note also that one obvious constraint is that keys need to be unique.
Grandma - let's say you want to make jam out of strawberries, raspberries, and blueberries, one jar of each.
You have a shelf on your wall with room/openings for 3 jars on it. That shelf is an associative array: shelf[]
Stick a label named "strawberry" beneath any one of the 3 openings. That is the index of an associative array: shelf["strawberry"]
Now place a jar of strawberry jam in the opening above that label. That is the contents of the associative array indexed by the word "strawberry": shelf["strawberry"] = "the jar of strawberry jam"
Repeat for raspberry and blueberry.
When you feel like making yourself a delicious sandwich, go to your shelf (array), look for the label (index) named "strawberry" and pick up the jar sitting above it (contents/value), open and apply liberally to bread (preferably Mothers Pride end slices).
Now - if a wolf comes to the door, do not open it in case he steals your sandwich or worse!
I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")
Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:
MacDonald, PhD, CPA, III
Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").
I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.
There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:
SQL Server: Make all UPPER case to Proper Case/Title Case
Don't know if this is of any help
private static function ucNames($surname) {
// ( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
$replaceValue = ucwords($surname);
return preg_replace('/
(?: ^ | \\b ) # assertion: beginning of string or a word boundary
( O\' | \- | Ma?c | Fitz ) # attempt to match Irish, Scottish and double-barrelled surnames
( [^\W\d_] ) # match next char; we exclude digits and _ from \w
/xe',
"'\$1' . strtoupper('\$2')",
$replaceValue);
}
It's a simple PHP function that I use to set surnames to correct case that works for names like O'Connor, McDonald and MacBeth, FitzPatrick, and double-barrelled names like Hedley-Smythe
Here is the answer I was looking for:
There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.
I was able to use two applications to solve my problem.
StyleList: this app, among other
things, converts ALL CAPS to mixed
case and in the process it does not
dirty up the data, leaving titles
such as CPA, MD, III, etc. in tact;
as well as natural, common
camel-case names such as McDonalds.
Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.
Here is a link to the solutions offered by Melissa Data:
http://www.melissadata.com/dqt/index.htm
For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.