I know there are a lot of resources with regex for it. But I could not find the one I want.
My problem is:
I want to remove one line comments (//) from obj-c sources, but I don't want to break the code in it. For instance, with this regex: #"//.*" I can remove all comments, but it also corrupts string literal:
#"bsdv//sdfsdf"
I played with non-capturing parentheses (?:(\"*\")*+), but without success.
Also I found this expression for Python:
r'(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)'
It should cover my case, but I've not figure out how to make it work with obj-c.
Please, help me to build proper regex.
UPDATE: Yeah, that's a tough one, I know there're a lot of caveats, other than the one I described. I would appreciate if someone post regex that only fix my issue. Anyway, I gonna post my solution, without regex soon, I hope it will be helpful for anyone who struggling with such problem too.
Try this regex:
(?:^|.*;(?!.*")|#(?:define|endif|ifn?def|import|undef|...).*)\s*(//[^\r\n]+$)
Demo
http://regex101.com/r/jT4xC8
Description
Discussion
Besides all the warnings expressed in the comments, I assume that a single line can appear in two distinct cases:
Case 1: Alone on its line preceded or not by blank chars
Case 2: Not Alone on its line preceded or not by blank chars, and other chars.
In the first case, we match the beginning of the line (^ with /m flag). Then we search zero or more blank chars (\s*) and finally the single line comment: //[$\r\n]+$.
In the second case, if there are other chars on the line, they form statements. Any statement is ended by a semicolon ;. So we search the last statement and its corresponding semicolon .*;(?!.*"). Then we search the single line comment. Those other chars can be also preprocessor statements. In this case, they are introduced by a sharp #.
One important keypoint is that I assume the code passed to the regex is a code that compiles.
There is more
Don't forget also to add some other pre-processor directives that may apply in your case. Check this SO answer: https://stackoverflow.com/a/18014883/363573
Related
The IBM i implementation of regex uses apostrophes (instead of e.g. slashes) to delimit a regex string, i.e.:
... where REGEXP_SUBSTR(MYFIELD,'myregex_expression')
If I try to use an apostrophe inside a [group] within the expression, it always errors - presumably thinking I am giving a closing quote. I have tried:
- escaping it: \'
- doubling it: '' (and tripling)
No joy. I cannot find anything relevant in the IBM SQL manual or by google search.
I really need this to, for instance, allow names like O'Leary.
Thanks to Wiktor Stribizew for the answer in his comment.
There are a couple of "gotchas" for anyone who might land on this question with the same problem. The first is that you have to give the (presumably Unicode) hex value rather than the EBCDIC value that you would use, e.g. in ordinary interactive SQL on the IBM i. So in this case it really is \x27 and not \x7D for an apostrophe. Presumably this is because the REGEXP_ ... functions are working through Unicode even for EBCDIC data.
The second thing is that it would seem that the hex value cannot be the last one in the set. So this works:
^[A-Z0-9_\+\x27-]+ ... etc.
But this doesn't
^[A-Z0-9_\+-\x27]+ ... etc.
I don't know how to highlight text within a code sample, so I draw your attention to the fact that the hyphen is last in the first sample and second-to-last in the second sample.
If anyone knows why it has to not be last, I'd be interested to know. [edit: see Wiktor's answer for the reason]
btw, using double quotes as the string delimiter with an apostrophe in the set didn't work in this context.
A single quote can be defined with the \x27 notation:
^[A-Z0-9_+\x27-]+
^^^^
Note that when you use a hyphen in the character class/bracket expression, when used in between some chars it forms a range between those symbols. When you used ^[A-Z0-9_\+-\x27]+ you defined a range between + and ', which is an invalid range as the + comes after ' in the Unicode table.
All.
I am used to programming VBA in Excel, but am new to the structures in Word.
I am working through a library of text files to update them. Many of them are either OCR documents, or were manually entered.
Each has a recurring pattern, the most common of which is unnecessary carriage returns.
For example, I am looking at several text files where there is a double return after each line. A search and replace of all double carriage returns removes all paragraph distinctions.
However, each line is approximately 30 characters long, and if I manually perform the following logic, it gives me a functional document.
If there is a double carriage return after 30+ characters, I replace them with a space.
If there were less than 30 characters prior to the double return, I replace them with a single return.
Can anyone help me with some rudimentary code that would help me get started on that? I could then modify it for each "pattern" of text documents I have.
e.g.
In this case, there are more than
thirty characters per line. And I
will keep going to illustrate this
example.
This would be a new paragraph, and
would be separated by another of
the single returns.
I want code that would return:
In this case, there are more than thirty character returns. And I will keep going to illustrate this example.
This would be a new paragraph, and would be separated by another of the single returns.
Let me know if anyone can throw something out that I can play with!
You can do this without code (which RegEx requires), simply using Word's own wildcard Find/Replace tools, where:
Find = ([!^13]{30,})[^13]{1,}
Replace = \1^32
and, to clean up the residual multi-paragraph breaks:
Find = [^13]{2,}
Replace = ^p
You could, of course, record the above as a macro...
Here is a RegEx that might work for you:
(\n\n)(?<!\.(\n\n))
The substitution is just a plain space, you can try it out (and modify / tweak it) here: https://regex101.com/r/zG9GPw/4
This 'pattern' tells the RegEx engine to look for the newline character \n which occurs x2 like this \n\n (worth noting this is from your question and might be different in your files, e.g. could be \r\n) and it assumes that a valid line break will be proceeded by a full stop: \..
In RegEx the full stop symbol is a single character wild card so it needs to be escaped with the '\' (n and r are normal characters, escaping them tells the RegEx engine they represent newline and return characters).
So... the expression is looking for a group of x2 newline characters but then uses a negative look-behind to exclude any matches where the previous character was a full stop.
Anyway, it's all explained on the site:
Here is how you could do a RegEx find and replace using NotePad++ (I'm not sure if it comes with RegEx or if a plugin is needed, either way it is easy). But you can set a location, filters (to target specific file types), and other options (such as search in sub-directories).
Other than that, as #MacroPod pointed out you could also do this with MS Word, document by document, not using any code :)
This is a bit of a puzzler for me. I have a string that looks like:
fanspd<fanspd>3</fanspd>
doorinprocess<doorinprocess>0</doorinprocess>
timeremaining<timeremaining>0</timeremaining>
macaddr<macaddr>60:CB:FB:99:99:C1</macaddr>
ipaddr<ipaddr>10.0.0.6</ipaddr>
model<model>4.4eWHF</model>
softver: <softver>2.14.2</softver>
interlock1: <interlock1>0</interlock1>
interlock2: <interlock2>0</interlock2>
cfm: <cfm>2200</cfm>
power: <power>120</power>
inside: <house_temp>-99</house_temp>
<DNS1>10.0.0.1</DNS1>
attic: <attic_temp>76</attic_temp>
OA: <oa_temp>-99</oa_temp>
server response: <server_response>Ó£àêEE²ç©þ]kõ «jsÐ</server_response>
DIP Switches: <DIPS>11100</DIPS>
Remote Switch: <switch2>1111</switch2>
Setpoint:<Setpoint>0</Setpoint>
The string includes the "/n" so I have split it into corrisponding lines that look like
fanspd<fanspd>0</fanspd>
All I really want is the char(s) in the middle of the line. In the above example it would be 0.
I can match everything with regular expressions but by doing the following:
(.*)(<[a-z]+>)(.*)(</[a-z]+>)
But what I'd like is something more that would exclude or strip away or remove all the junk and grab the middle chars.
(!(.*)(!<[a-z]+>))(.*)(!(</[a-z]+>))
I've tried this and it does not work. I've also thought of doing another [NSstring componentsSeparatedByString:#"(with either < or or >"] but that would leave be with more parsing yet to do and I think there should be a way to get just the chars inbetween the tags with either regular expressions or string compare or some such way to parse out the
Any suggestions or help would be greatly appreciated.
Thanks
Two things.
Your regular expression does not escape the forward slash.
Your regular expression seems overly complicated for what you are trying to do.
If all you want is that lone middle character with regular expressions,
Try this:
<[a-z]+>(.*)<\/[a-z]+>
Here's a great tool to play around with:
http://rubular.com
Heck you could probably even get away with:
<[a-z]+>(.*)<\/
EDIT:
I figured out your problem partially, some of the tags part way down contain characters other than a through z. So here you go:
<.+>(.*)<\/.+>
I'm using IDEA 13.0.1. A unit test is failing because of some line separator stuff. But when I try to compare the two results, IDEA says "Contents have differences only in line separators".
And I can't find a setting where I can say show me these differences. Is there one?
I ran into the same issue, I couldn't find a way to show the difference by clicking show differences, but if you put a break point and look at the strings, they have the line separator written out. For me one string had \n and one had \r\n. \r is also a possible line separator.
I ran into the same problem recently. I found a workaround for it, is not that pretty but did the job:
yourString.replaceAll("\n", "").replaceAll("\r", "");
The key is in what #user1633977 said.
To solve this problem, always use System.lineSeparator() to create your separators. That way they will always be the same and will be OS-independant.
(Remember that different OS can use different symbols as line separators).
Just a question of interest: Does anyone know why there's no block comment capability in VB .NET? (Unless there really is - but I've never yet come across it.)
It is a side-effect of the Visual Basic syntax, a new-line terminates a statement. That makes a multi-line comment pretty incompatible with the basic way the compiler parses the language. Not an issue in the curly brace languages, new-lines are just white space.
It has never been a real problem, Visual Basic has had strong IDE support for a very long time. Commenting out multiple lines is an IDE feature, Edit + Advanced + Comment Selection.
Totally abusing compiler directives here... but:
#If False Then
Comments
go
here
#End If
You don't get the benefits of proper code coloration (it doesn't show in green when using the default color scheme) and the implicit line-continuation system automatically indents lines in a paragraph starting at the second line. But the compiler will ignore the text.
As can be read in “Comments in Code“ there isn't any other way:
If your comment requires more than one line, use the comment symbol on each line, as the following example illustrates.
' This comment is too long to fit on a single line, so we break
' it into two lines. Some comments might need three or more lines.
Similarly, the help on the REM statement states:
Note:
You cannot continue a REM statement by using a line-continuation sequence (_). Once a comment begins, the compiler does not examine the characters for special meaning. For a multiple-line comment, use another REM statement or a comment symbol (') on each line.
Depending on how many lines are to be ignored, one can use compiler directives instead. It may not be technically equivalent to comments (you don't get the syntax coloring of comments, for example), but it gets the job done without commenting many lines individually. So you just add 3 more lines of code.
#Const COMMENT = "C"
'basically a false statement
#If COMMENT = "Y" Then
'code to be commented goes between #If and #End If
MsgBox('Commenting failed!')
#End If
This is assuming the purpose is for ignoring blocks of code instead of adding documentation (what "comments" are actually used for, but I also wouldn't mind using compiler directives for that).
The effort required however, makes this method inconvenient when there are just around 10 lines to comment.
Reference: http://msdn.microsoft.com/en-us/library/tx6yas69.aspx