Deleting comments from a string using NSScanner? - objective-c

I've a string that may or may not include C++ comments on it (multi-lin and single line) and I need to strip those comments out before being able to use that string. My current idea is to use an NSScanner to do that - find the position of opening and closing multi-line comments and delete that portion of the string and find the position of opening one line comments and EOL characters and also delete that portion of the string.
Would you do it differently? What would be your approach? If it matters, the string can be several megabytes in size so performance is an issue.

From this thread, I thought the best suggestion was to run the string through the C++ preprocessor.

Don't forget to keep track of quote marks, too. Test cases:
"/*Ceci n'est pas une commentaire*/"
'/**/' (Mac OS/Mac OS X OSType literal)
'//!\n'
"This string does not contain a // comment"
In all of these cases, you should not detect a comment.
The converse is also true:
//Ceci n'est pas une "string"
/*This comment does not contain an OS'Type' literal*/

My solution:
Go through the string using NSScanner and mark the position of each multi-line and single comment and all strings (anything between single and double quotes). Store the positions in an array of NSValues that represent the ranges of each item.
Then iterate through the array of comments, making sure that each comment is not inside a string. The way to check for this is to make sure that the location of each comment is not bigger than the location of each string and the location of the comment does not fall into location + length of each string.
And voila. Anything that doesn't fall into can be safely deleted as it is a valid comment.

Related

VBA replace certain carriage

All.
I am used to programming VBA in Excel, but am new to the structures in Word.
I am working through a library of text files to update them. Many of them are either OCR documents, or were manually entered.
Each has a recurring pattern, the most common of which is unnecessary carriage returns.
For example, I am looking at several text files where there is a double return after each line. A search and replace of all double carriage returns removes all paragraph distinctions.
However, each line is approximately 30 characters long, and if I manually perform the following logic, it gives me a functional document.
If there is a double carriage return after 30+ characters, I replace them with a space.
If there were less than 30 characters prior to the double return, I replace them with a single return.
Can anyone help me with some rudimentary code that would help me get started on that? I could then modify it for each "pattern" of text documents I have.
e.g.
In this case, there are more than
thirty characters per line. And I
will keep going to illustrate this
example.
This would be a new paragraph, and
would be separated by another of
the single returns.
I want code that would return:
In this case, there are more than thirty character returns. And I will keep going to illustrate this example.
This would be a new paragraph, and would be separated by another of the single returns.
Let me know if anyone can throw something out that I can play with!
You can do this without code (which RegEx requires), simply using Word's own wildcard Find/Replace tools, where:
Find = ([!^13]{30,})[^13]{1,}
Replace = \1^32
and, to clean up the residual multi-paragraph breaks:
Find = [^13]{2,}
Replace = ^p
You could, of course, record the above as a macro...
Here is a RegEx that might work for you:
(\n\n)(?<!\.(\n\n))
The substitution is just a plain space, you can try it out (and modify / tweak it) here: https://regex101.com/r/zG9GPw/4
This 'pattern' tells the RegEx engine to look for the newline character \n which occurs x2 like this \n\n (worth noting this is from your question and might be different in your files, e.g. could be \r\n) and it assumes that a valid line break will be proceeded by a full stop: \..
In RegEx the full stop symbol is a single character wild card so it needs to be escaped with the '\' (n and r are normal characters, escaping them tells the RegEx engine they represent newline and return characters).
So... the expression is looking for a group of x2 newline characters but then uses a negative look-behind to exclude any matches where the previous character was a full stop.
Anyway, it's all explained on the site:
Here is how you could do a RegEx find and replace using NotePad++ (I'm not sure if it comes with RegEx or if a plugin is needed, either way it is easy). But you can set a location, filters (to target specific file types), and other options (such as search in sub-directories).
Other than that, as #MacroPod pointed out you could also do this with MS Word, document by document, not using any code :)

Regular expression to extract a number of steps

I have a localized string that looks something like this in English:
"
5 Mile(s)
5,252 Step(s)
"
My app is localized both in left-to-right and right-to-left languages so I don't want to make assumptions either about the ordering of the step(s) or about the formatting of the number (e.g. 5,252 can be 5.252 depending on user locale). So I need to account for possibilities that can include things like
Step(s) 5.252
as well as what's above.
A few other caveats
All I know is that if the Step(s) line is in there, it will be on its own line (hence in my regex I require \n at each end of the string)
No guarantee that the Mile(s) information will be in the string at all, let alone whether it will be before or after Step(s)
Here's my attempt at pattern extraction:
NSString *patternString = [NSString stringWithFormat:#"\\n(([0-9,\\.]*)\s*%#|%#\s*([0-9,\\.]*))\\n",
NSLocalizedString(#"Step(s)",nil), NSLocalizedString(#"Step(s)",nil)];
There appear to be two problems with this:
XCode is indicating Unknown escape sequence '\s' for the second \s in the pattern string above
No matches are being found even for strings like the following:
0.2 Mile(s)
1,482 Step(s)
Ideally I would extract the 1,482 out of this string in a way that is localization friendly. How should I modify my regex?
as far as the regex, perhaps this approach might work - it simply matches (with named groups) each couplet of numbers in sequence, with the assumption the first is miles and the second is steps. Decimals in the . or , form are optional:
(?<miles>\d+(?:[.,]\d+)?).*?(?<steps>\d+(?:[.,]\d+)?)
(and i think it should be \\s) - i'm not an ios guy, but if you can use a regex literal it would be way more readable.
regular expression demo
First I'd like to ask - Why is Mile(s) mentioned in the question at all?
And now to my two bits - you could simply use a positive look-ahead:
^(?=.*Step\(s\))[^\d]*(\d+(?:[.,]\d+)?)
It makes sure the expected word is present on the line, and then captures the number on it, allowing for localized, optional, decimal separator and decimals. This way it doesn't matter if the numer is before, or after, the "word".
It doesn't take localization of the "word" into account, but that you seem to have handled by yourself ;)
See it here at regex101.
Your regex is close, although in Obj-C you need to double-escape the \s and (s):
^(([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$
In your NSLocalizedString you likely also need to escape the parentheses enclosing (s):
NSString *patternString = [NSString stringWithFormat:#"^(([\\d,.]+)\\s%#|%#\\s([\\d,.]+))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
If you don't escape (s) then the regex engine is probably going to interpret it as a capture group.
Looking at NSLog you can see what the pattern actually reads like:
NSLog(#"patternString: %#", patternString);
Output:
patternString: ^(([\d,.]+)\sStep\(s\)|Step\(s\)\s([\d,.]+))$
Since you mentioned the Mile(s) part may not be in the string at all I'm assuming it isn't relevant to the regular expression. As I understand from the question, you just need to capture the number of steps and nothing else. On this basis, here's a modified version of your existing regex:
NSString *patternString =
[NSString stringWithFormat:#"^(?:([0-9,.]*)\\s*%#|%#\\s*([0-9,.]*))$",
NSLocalizedString(#"Step\\(s\\)",nil), NSLocalizedString(#"Step\\(s\\)",nil)];
Demo:
https://www.regex101.com/r/Q6ff1b/1
This is based on the following tips/modifications:
Use the m (= UREGEX_MULTILINE) flag option when creating the regex to specify that ^ and $ match the start and end of each line. This is more sophisticated than using \n as it will also handle the start and end of the string where this might not be present. See here.
Always use a double backslash (\\) for regex escaping - otherwise NSString will interpret the single backslash to be escaping the next character and convert it before it gets to the regex.
Literal parentheses need to be escaped - e.g. Step\\(s\\) instead of Step(s).
Characters within a character class (i.e. anything within the [] square brackets) don't need to be escaped - so it would be . rather than \\. - the latter.
If you are using (x|y|...) as a choice and don't need it to be a capturing group, use ?: after the first parenthesis to ensure it doesn't get captured - i.e. (?:x|y|...).

Approximate search with openldap

I am trying to write a search that queries our directory server running openldap.
The users are going to be searching using the first or last name of the person they're interested in.
I found a problem with accented characters (like áéíóú), because first and last names are written in Spanish, so while the proper way is Pérez it can be written for the sake of the search as Perez, without the accent.
If I use '(cn=*Perez*)' I get only the non-accented results.
If I use '(cn=*Pérez*)' I get only accented results.
If I use '(cn=~Perez)' I get weird results (or at least nothing I can use, because while the results contain both Perez and Pérez ocurrences, I also get some results that apparently have nothing to do with the query...
In Spanish this happens quite a lot... be it lazyness, be it whatever you want to call it, the fact is that for this kind of thing people tend NOT to write the accents because it's assumend all these searches work with both options (I guess since Google allowes it, everybody assumes it's supposed to work that way).
Other than updating the database and removing all accents and trimming them on the query... can you think of another solution?
You have your ~ and = swapped above. It should be (cn~=Perez). I still don't know how well that will work. Soundex has always been strange. Since many attributes are multi-valued including cn you could store a second value on the attribute that has the extended characters converted to their base versions. You would at least have the original value to still go off of when you needed it. You could also get real fancy and prefix the converted value with something and use the valuesReturnFilter to filter it out from your results.
#Sample object
dn:cn=Pérez,ou=x,dc=y
cn:Pérez
cn:{stripped}Perez
sn:Pérez
#etc.
Then modify your query to use an or expression.
(|(cn=Pérez)(cn={stripped}Perez))
And you would include a valuesReturnFilter that looked like
(!(cn={stripped}*))
See RFC3876 http://www.networksorcery.com/enp/rfc/rfc3876.txt for details. The method for adding a request control varies by what platform/library you are using to access the directory.
Search filters ("queries") are specified by RFC2254.
Encoding:
RFC2254
actually requires filters (indirectly defined) to be an
OCTET STRING, i.e. ASCII 8-byte String:
AttributeValue is OCTET STRING,
MatchingRuleId
and AttributeDescription
are LDAPString, LDAPString is an OCTET STRING.
The standard on escaping: Use "<ASCII HEX NUMBER>" to replace special characters
(https://www.rfc-editor.org/rfc/rfc4515#page-4, examples https://www.rfc-editor.org/rfc/rfc4515#page-5).
Quote:
The <valueencoding> rule ensures that the entire filter string is a
valid UTF-8 string and provides that the octets that represent the
ASCII characters "*" (ASCII 0x2a), "(" (ASCII 0x28), ")" (ASCII
0x29), "\" (ASCII 0x5c), and NUL (ASCII 0x00) are
represented as a backslash "\" (ASCII 0x5c) followed by the two hexadecimal digits
representing the value of the encoded octet.
Additionally, you should probably replace all characters that semantically modify the filter (RFC 4515's grammar gives a list), and do a Regex replace of non-ASCII characters with wildcards (*) to be sure. This will also help you with characters like "é".

define a long complex literal string

I have a string that is quite long and complicated, with special characters inside. I want to define this string as a variable, but don't want to escape each of them (because there are so many). I remember that in XML they have a special syntax for that, is there something similar for Objective-C?
Edit: I know I can save the thing in a file and load it easily, but is it possible to do so without a new file? I'm having quite some of them...
No, you have to escape the characters (though you only have to escape ", \, and control characters... is the string mostly control characters and quotes?)
A better idea might be to put the string in a file. Load it using +[NSString stringWithContentsOfFile:encoding:error:].

How can I write special character in VB code

I have a Sql statament using special character (ex: ('), (/), (&)) and I don't know how to write them in my VB.NET code. Please help me. Thanks.
Find out the Unicode code point for the character (from http://www.unicode.org) and then use ChrW to convert from the code point to the character. (To put this in another string, use concatenation. I'm somewhat surprised that VB doesn't have an escape sequence, but there we go.)
For example, for the Euro sign (U+20AC) you'd write:
Dim euro as Char = ChrW(&H20AC)
The advantage of this over putting the character directly into source code is that your source code stays "just pure ASCII" - which means you won't have any strange issues with any other program trying to read it, diff it, etc. The disadvantage is that it's harder to see the symbol in the code, of course.
The most common way seems to be to append a character of the form Chr(34)... 34 represents a double quote character. The character codes can be found from the windows program "charmap"... just windows/Run... and type charmap
If you are passing strings to be processed as SQL statement try doubling the characters for example.
"SELECT * FROM MyRecords WHERE MyRecords.MyKeyField = ""With a "" Quote"" "
The '' double works with the other special characters as well.
The ' character can be doubled up to allow it into a string e.g
lSQLSTatement = "Select * from temp where name = 'fred''s'"
Will search for all records where name = fred's
Three points:
1) The example characters you've given are not special characters. They're directly available on your keyboard. Just press the corresponding key.
2) To type characters that don't have a corresponding key on the keyboard, use this:
Alt + (the ASCII code number of the special character)
For example, to type ¿, press Alt and key in 168, which is the ASCII code for that special character.
You can use this method to type a special character in practically any program not just a VB.Net text editor.
3) What you probably looking for is what is called 'escaping' characters in a string. In your SQL query string, just place a \ before each of those characters. That should do.
Chr() is probably the most popular.
ChrW() can be used if you want to generate unicode characters
The ControlChars class contains some special and 'invisible' characters, plus the quote - for example, ControlChars.Quote