RegEx for parsing chemical formulas - objective-c

I need a way to separate a chemical formula into its components. The result should look like
this:
Ag3PO4 -> [Ag3, P, O4]
H2O -> [H2, O]
CH3OOH -> [C, H3, O, O, H]
Ca3(PO4)2 -> [Ca3, (PO4)2]
I don't know regex syntax, but I know I need something like this
[An optional parenthesis][A capital letter][0 or more lowercase letters][0 or more numbers][An optional parenthesis][0 or more numbers]
This worked
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:#"[A-Z][a-z]*\\d*|\\([^)]+\\)\\d*"
options:0
error:nil];
NSArray *tests = [[NSArray alloc ] initWithObjects:#"Ca3(PO4)2", #"HCl", #"CaCO3", #"ZnCl2", #"C7H6O2", #"BaSO4", nil];
for (NSString *testString in tests)
{
NSLog(#"Testing: %#", testString);
NSArray *myArray = [regex matchesInString:testString options:0 range:NSMakeRange(0, [testString length])] ;
NSMutableArray *matches = [NSMutableArray arrayWithCapacity:[myArray count]];
for (NSTextCheckingResult *match in myArray) {
NSRange matchRange = [match rangeAtIndex:0];
[matches addObject:[testString substringWithRange:matchRange]];
NSLog(#"%#", [matches lastObject]);
}
}

(PO4)2 really sits aside from all.
Let's start from simple, match items without parenthesis:
[A-Z][a-z]?\d*
Using regex above we can successfully parse Ag3PO4, H2O, CH3OOH.
Then we need to somehow add expression for group. Group by itself can be matched using:
\(.*?\)\d+
So we add or condition:
[A-Z][a-z]?\d*|\(.*?\)\d+
Demo
Which works for given cases. But may be you have some more samples.
Note: It will have problems with nested parenthesis. Ex. Co3(Fe(CN)6)2
If you want to handle that case, you can use the following regex:
[A-Z][a-z]?\d*|(?<!\([^)]*)\(.*\)\d+(?![^(]*\))
For Objective-C you can use the expression without lookarounds:
[A-Z][a-z]?\d*|\([^()]*(?:\(.*\))?[^()]*\)\d+
Demo
Or regex with repetitions (I don't know such formulas, but in case if there is anything like A(B(CD)3E(FG)4)5 - multiple parenthesis blocks inside one.
[A-Z][a-z]?\d*|\((?:[^()]*(?:\(.*\))?[^()]*)+\)\d+
Demo

When you encounter a parenthesis group, you don't want to parse what's inside, right?
If there are no nested parenthesis groups you can simply use
[A-Z][a-z]*\d*|\([^)]+\)\d*
\d is a shorcut for [0-9], [^)] means anything but a parenthesis.
See demo here.

This should just about work:
/(\(?)([A-Z])([a-z]*)([0-9]*)(\))?([0-9]*)/g
Play around with it here: http://refiddle.com/

this pattern should work depending on you RegEx engine
([A-Z][a-z]*\d*)|(\((?:[^()]+|(?R))*\)\d*) with gm option
Demo

Better to limit the set of chars to valid chemical names. In simple form:
^((Ac|Ag|Al|Am|Ar|As|At|Au|B|Ba|Be|Bh|Bi|Bk|Br|C|Ca|Cd|Ce|Cf|Cl|Cm|Co|Cr|Cs|Cu|Ds|Db|Dy|Er|Es|Eu|F|Fe|Fm|Fr|Ga|Gd|Ge|H|He|Hf|Hg|Ho|Hs|I|In|Ir|K|Kr|La|Li|Lr|Lu|Md|Mg|Mn|Mo|Mt|N|Na|Nb|Nd|Ne|Ni|No|Np|O|Os|P|Pa|Pb|Pd|Pm|Po|Pr|Pt|Pu|Ra|Rb|Re|Rf|Rg|Rh|Rn|Ru|S|Sb|Sc|Se|Sg|Si|Sm|Sn|Sr|Ta|Tb|Tc|Te|Th|Ti|Tl|Tm|U|V|W|Xe|Y|Yb|Zn|Zr)\d*)+$
This doesn't deal with the parenthesized groups.
This we worked out during the San Diego Python Users Group meeting.

Related

Apostrophes (') is not recognised in regular expression

I want a regular expression for first name that can contain
1)Alphabets
2)Spaces
3)Apostrophes
Exp: Raja, Raja reddy, Raja's,
I used this ^([a-z]+[,.]?[ ]?|[a-z]+[']?)+$ but it is failing to recognise Apostrophes (').
- (BOOL)validateFirstNameOrLastNameOrCity:(NSString *) inputCanditate {
NSString *firstNameRegex = #"^([a-z]+[,.]?[ ]?|[a-z]+[']?)+$";
NSPredicate *firstNamePredicate = [NSPredicate predicateWithFormat:#"SELF MATCHES[c] %#",firstNameRegex];
return [firstNamePredicate evaluateWithObject:inputCanditate];
}
May I recommand ^[A-Z][a-zA-Z ']* ?
// The NSRegularExpression class is currently only available in the Foundation framework of iOS 4
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^[A-Z][a-zA-Z ']*" options:NSRegularExpressionAnchorsMatchLines error:&error];
NSUInteger numberOfMatches = [regex numberOfMatchesInString:searchText options:0 range:NSMakeRange(0, [string length])];
return numberOfMatches > 1;
^[A-Z] : Force start with a capital letter from A to Z
[a-zA-Z ']* : followed by any number of charactere that an be 'a' to 'z', 'A' to 'Z', space or simple quote
I think you are looking for a pattern like this: ^[a-zA-Z ']+$
However, this is pretty bad. What about umlauts, accents, and a whole lot other letters that are not part of the ASCII alphabet?
A better solution would be to allow any kind of letter from any language.
To do so you can use the Unicode "letter" category \p{L}, e.g. ^[\p{L}]+$.
.. or you could just drop that rule all together - as reasonably suggested.

Objective-C: Parsing String into an Array under Special Circumstances

I have a string:
[{"id":1,"gameName":"arizona","cost":"0.5E1","email":"hi#gmail.com","requests":0},{"id":2,"gameName":"arizona","cost":"0.5E1","email":"hi#gmail.com","requests":0},{"id":3,"gameName":"arizona","cost":"0.5E1","email":"hi#gmail.com","requests":0}]
However, I would like to parse this string into an array such as:
[{"id":1,"gameName":"arizona","cost":"0.5E1","email":"hi#gmail.com","requests":0},
{"id":2,"gameName":"arizona","cost":"0.5E1","email":"hi#gmail.com","requests":0},
{"id":3,"gameName":"arizona","cost":"0.5E1","email":"hi#gmail.com","requests":0}]
This array is delimited by the comma in between the curly braces: },{
I tride usign the command
NSArray *responseArray = [response componentsSeparatedByString:#","];
but this separates the string into values at EVERY comma, which is not desirable.
Then I tried using regex:
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"\\{.*\\}" options:NSRegularExpressionCaseInsensitive error:&error];
NSArray *matches = [regex matchesInString:response options:0 range:NSMakeRange(0, [response length])];
which found one match: starting at the first curly brace to the last curly brace.
I was wondering if anyone new how to solve this problem efficiently?
This string seems to be valid JSON. Try a JSON parser: NSJSONSerialization
I agree with H2CO3's suggestion to use a parser where possible.
But looking at your attempted regex, it looks like you just need to make it non-greedy, i.e.
#"\\{.*?\\}"
^
|
Add this question mark for non-greedy matching.
Of course, this will fail if you have deeper levels of (what I assume to be) nested arrays. Go with the JSON parser!

Objective C. Regular expression to eliminate anything after 3 dots

I wrote the following code to eliminate anything after 3 dots
currentItem.summary = #"I am just testing. I am ... the second part should be eliminated";
NSError * error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(.)*(/././.)(.)*" options:0 error:&error];
if(nil != regex){
currentItem.summary = [regex stringByReplacingMatchesInString:currentItem.summary
options:0 range:NSMakeRange(0, [currentItem.summary length])
withTemplate:#"$1"];
}
However, my input and output are the same. The correct output should be "I am just testing. I am".
I was trying to do this using regular expression because I have a database of other regular expressions that I run on the string. I know the performance might not be as good as a plain text find or replace but the strings involved are short. I also tried using "\" to escape the dots in the regex, but I was getting a warning.
There is another question with a similar topic but the match strings are not for objective c.
This is much easier and will accomplish what you want:
NSRange range = [currentItem.summary rangeOfString:#"..."];
if (range != NSNotFound) {
currentItem.summary = [currentItem.summary substringToIndex:range.location];
}
You have forward slashes, /, instead of backward slashes, \, in your pattern. Also if you wish to match everything before the three dots you should use (.*) - tag everything matched by the enclosed .*. (The other parentheses in the pattern are redundant.)
Nice alternative:
NSScanner *scanner = [NSScanner scannerWithString:currentItem.summary];
[scanner scanUpToString:#"..." intoString: &currentItem.summary];
My recommended regex for your problem:
regularExpressionWithPattern:#"^(.*)\\s*\\.{3}.*$"
Main differences between this one and yours:
uses backslashes to escape special chars
uses ^ and $ to anchor at the beginning and end of the string
only captures the interesting section with ()
strips whitespace before the ... by ignoring any number of whitespace chars (\s*).
After correcting the slashes and other improvements, my final expression is:
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"^(.*)\\.{3}.*$"
options:0
error:&error];

Finding 2 Capitalized Words in a Row NSString

I'm writing a Mac app that goes through an NSString, and adds all its word to an NSArray (by separating them based on whitespace). Now, I've got the whole system down, but I'm still having one little problem: names (first + last), are added as two different words, and that's bothersome to me.
I thought of a couple solutions to fix this. My best idea was to, before actually adding the words to the array, join two words in a row that are capitalized. Then, through an if statement, determine if a word has two capitals in it, and then split the word and add it as one word. However, I can't find a way to find 2 words in a row with capitals.
Should I be using RegexKitLite (which I'm not familiar with), for example, to find two capitalized words in a row? I've seen this question: Regexp to pull capitalized words not at the beginning of sentence and two adjacent words
which seems somehow related, but due to my lack of understand of regular expressions, I don't really know if this is exactly what I need.
I've also seen this: Separating NSString into NSArray, but allowing quotes to group words
which is also similar, yet not exactly adapted to my needs.
So, to conclude, does anyone know how to either join capitalized words in an NSString, or even better, how to find two capitalized words in a row in an NSString ?
If you're targeting iOS 4.0 or greater OR OS 10.7 you can use NSRegularExpression
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression
regularExpressionWithPattern:#"[A-Z]\\w*\\s[A-Z]\\w*"
options:nil
error:&error];
NSString *inputString = #"One two Three Four five six Seven Eight";
NSArray *stringsWithTwoCapitalizedWordsInARow = [regex
matchesInString:inputString
options:0
range:NSMakeRange(0, [string length])];
You'll get something like this
["Three Four", "Seven Eigth"]
You could just do a second pass on the resulting array after it has been loaded to append entries together that need to be joined.
Names are notoriously difficult to match with regular expressions alone, as it is not unheard of for names (first or last) to contain spaces themselves.
NSMutableArray* words = ...;
NSMutableArray* joinedWords = [NSMutableArray array];
for (int i=0; i < [words length]; i++)
{
NSString* currentLine = [words objectAtIndex:i];
bool capitalized = false;
bool capitalizedNext = false;
capitalized = isCap(currentLine); // Up to your discretion here
NSString* nextLine = nil;
// for the last entry
if (i+1 < [words length])
{
nextLine = [words objectAtIndex:i+1];
capitalizedNext = isCap(nextLine);
}
// Check if first letter is uppercase
if (capitalized == true && capitalizedNext == true)
{
[words replaceObjectAtIndex:i withObject:[NSString stringWithFormat:#"%# %#", currentLine, nextLine];
[words removeObjectAtIndex:i+1];
// Run test again on new version of the line
i--;
}
else
{
[joinedWords addObject:currentLine];
}
}
[A-Z][A-Za-z]* [A-Z][A-Za-z]*|[\S]*
http://rubular.com/r/DrOabOAfBr
I've written a regular expression for you. This regex will try to match a name first, then fall back to a word, so your job is as simple as feeding this into NSRegularExpression, and take all the matches as your words, or names joined.

Why NSRegularExpression says that there are two matches of ".*" in the "a" string?

I'm very happy that Lion introduced NSRegularExpression, but I can't understand why the pattern .* matches two occurrences in a string like "a" (text can be longer).
I was using following code:
NSError *anError = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#".*"
options:0
error:&anError];
NSString *text = #"a";
NSUInteger counter = [regex numberOfMatchesInString:text
options:0
range:NSMakeRange(0, [text length])];
NSLog([NSString stringWithFormat:#"counter = %u", counter]);
Output from the console is:
2011-07-27 22:03:27.689 Regex[1930:707] counter = 2
Can anyone explain why that is?
The regular expression .* matches zero or more characters. Thus, it will match the empty string as well as a and as such there are two matches.
Mildly surprised that it didn't match 3 times. One for the "" before the "a", one for the "a" and one for the "" after the "a".
As has been noted, use a more precise pattern; including anchors (^ and/or $) might also change the behaviour.
No-one has asked, but why would you want to do this anyway?
The documents on NSRegularExpression say the following:
Some regular expressions [...] can
successfully match a zero-length range, so the comparison of the
resulting range with {NSNotFound, 0} is the most reliable way to
determine whether there was a match or not.
I more reliable way to get just one match would be to change the expression to .+