Smart search and replace [closed] - awk

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I had some code that had a few thousands lines of code that contain pieces like this
opencanmanager.GetObjectDict()->ReadDataFrom(0x1234, 1).toInt()
that I needed to convert to some other library that uses syntax like this
ReadFromOD<int>(0x1234, 1)
.
Basically I need to search for
[whatever1]opencanmanager.GetObjectDict()->ReadDataFrom([whatever2]).toInt()[whatever3]
across all the lines of a text file and to replace every occurence of it with
[whatever1]ReadFromOD<int>([whatever2])[whatever3]
and then do the same for a few other data types.
Doing that manually was going to be a few days of absolutely terrible dumb work but all the automatic functions of any editor I know of do not allow for any smart code refactoring tools.
Now I have solved the problem using GNU AWK with the script below
#!/usr/bin/awk -f
BEGIN {
spl1 = "opencanmanager.GetObjectDict()->ReadDataFrom("
spl2 = ").to"
spl2_1 = ").toString()"
spl2_2 = ").toUInt()"
spl2_3 = ").toInt()"
min_spl2_len = length(spl2_3)
repl_start = "ReadFromOD<"
repl_mid1 = "QString"
repl_mid2 = "uint"
repl_mid3 = "int"
repl_end = ">("
repl_after = ")"
}
function replacer(str)
{
pos1 = index(str, spl1)
pos2 = index(str, spl2)
if (!pos1 || !pos2) {
return str
}
strbegin = substr(str, 0, pos1-1)
mid_start_pos = pos1+length(spl1)
strkey = substr(str, pos2, min_spl2_len)
key1 = substr(spl2_1, 0, min_spl2_len)
key2 = substr(spl2_2, 0, min_spl2_len)
key3 = substr(spl2_3, 0, min_spl2_len)
strmid = substr(str, mid_start_pos, pos2-mid_start_pos)
if (strkey == key1) {
repl_mid = repl_mid1; spl2_fact = spl2_1;
} else if (strkey == key2) {
repl_mid = repl_mid2; spl2_fact = spl2_2;
} else if (strkey == key3) {
repl_mid = repl_mid3; spl2_fact = spl2_3;
} else {
print "ERROR!!! Found", spl1, "but not any of", spl2_1, spl2_1, spl2_3 "!" > "/dev/stderr"
exit EXIT_FAILURE
}
str_remainder = substr(str, pos2+length(spl2_fact))
return strbegin repl_start repl_mid repl_end strmid repl_after str_remainder
}
{
resultstr = $0
do {
resultstr = replacer(resultstr)
more_spl = index(resultstr, spl1) || index(resultstr, spl2)
} while (more_spl)
print(resultstr)
}
and everything works fine but the thing still bugs me somewhat. My solution still feels a bit too complicated for a job that must be very common and must have an easy standard solution that I just dont't know about for some reason.
I am prepared to just let it go but if you know a more elegant and quick one-liner solution or some specific tool for the smart code modification problem then I would definitely would like to know.

If sed is an option, you can try this solution which should match both output examples from input such as this.
$ cat input_file
opencanmanager.GetObjectDict()->ReadDataFrom(0x1234, 1).toInt()
power1 = opencanmanager.GetObjectDict()->ReadDataFrom(0x1234, 1).toInt() * opencanmanager.GetObjectDict()->ReadDataFrom(0x5678, 1).toUInt() * FACTOR1;
power2 = opencanmanager.GetObjectDict()->ReadDataFrom(0x5678, 1).toUInt() / 2;
$ sed -E 's/ReadDataFrom/ReadFromOD<int>/g;s/int/uint/2;s/(.*= )?[^>]*>([^\.]*)[^\*|/]*?(\*|\/.{2,})?[^\.]*?[^>]*?>?([^\.]*)?[^\*]*?(.*)?/\1\2 \3 \4 \5/' input_file
ReadFromOD<int>(0x1234, 1)
power1 = ReadFromOD<int>(0x1234, 1) * ReadFromOD<uint>(0x5678, 1) * FACTOR1;
power2 = ReadFromOD<int>(0x5678, 1) / 2;
Explanation
s/ReadDataFrom/ReadFromOD<int>/g - The first part of the command does a simple global substitution substituting all occurances of ReadDataFrom to ReadFromOD<int>
s/int/uint/2 - The second part will only substitute the second occurance of int to uint if there is one
s/(.*= )?[^>]*>([^\.]*)[^\*|/]*?(\*|\/.{2,})?[^\.]*?[^>]*?>?([^\.]*)?[^\*]*?(.*)?/\1\2 \3 \4 \5/ - The third part utilizes sed grouping and back referencing.
(.*= )? - Group one returned with back reference \1 captures everything up to an = character, ? makes it conditional meaning it does not have to exist for the remaining grouping to match.
[^>]*> - This is an excluded match as it is not within parenthesis (). It matches everything continuing from the space after the = character up to the >, a literal > is then included to exclude that also. This is not conditional and must match.
([^\.]*) - Continuing from the excluded match, this will continue to match everything up to the first . and can be returned with back reference \2. This is not conditional and must match.
[^\*|/]*? - This is an excluded match and will match everything up to the literal * or | to /. It is conditional ? so does not have to match.
(\*|\/.{2,})? - Continuing from the excluded match, this will continue to match everything up to and including * or | / followed by at least 2 or more{2,} characters. It can be returned with back reference \3 and is conditional ?
[^\.]*?[^>]*?>? - Conditional excluded matches. Match everything up to a literal ., then everything up to > and include >
([^\.]*)? - Conditional group matching up to a full stop .. It can be returned with back reference \4.
[^\*]*? - Excluded. Continue matching up to *
(.*)? - Everything else after the final * should be grouped and returned with back reference \5 if it exist ?

Related

Split by delimiter which is contained in a record

I have a column which I am splitting in Snowflake.
The format is as follows:
I have been using split_to_table(A, ',') inside of my query but as you can probably tell this uncorrectly also splits the Scooter > Sprinting, Jogging and Walking record.
Perhaps having the delimiter only work if there is no spaced on either side of it? As I cannot see a different condition that could work.
I have been researching online but haven't found a suitable work around yet, is there anyone that encountered a similar problem in the past?
Thanks
This is a custom rule for the split to table, so we can use a UDTF to apply a custom rule:
create or replace function split_to_table2(STR string, DELIM string, ROW_MUST_CONTAIN string)
returns table (VALUE string)
language javascript
strict immutable
as
$$
{
initialize: function (argumentInfo, context) {
},
processRow: function (row, rowWriter, context) {
var buffer = "";
var i;
const s = row.STR.split(row.DELIM);
for(i=0; i<s.length-1; i++) {
buffer += s[i];
if(s[i+1].includes(row.ROW_MUST_CONTAIN)) {
rowWriter.writeRow({VALUE: buffer});
buffer = "";
} else {
buffer += row.DELIM
}
}
rowWriter.writeRow({VALUE: s[i]})
},
}
$$;
select VALUE from
table(split_to_table2('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ',', '>'))
;
Output:
VALUE
Car > Bike
Bike > Scooter
Scooter > Sprinting, Jogging and Walking
Walking > Flying
This UDTF adds one more parameter than the two in the build in table function split_to_table. The third parameter, ROW_MUST_CONTAIN is the string a row must contain. It splits the string on DELIM, but if it does not have the ROW_MUST_CONTAIN string, it concatenates the strings to form a complete string for a row. In this case we just specify , for the delimiter and > for ROW_MUST_CONTAIN.
We can get a little clever with regexp_replace by replacing the actual delimiters with something else before the table split. I am using double pipes '||' but you can change that to something else. The '\|\|\\1' trick is called back-referencing that allows us to include the captured group (\\1) as part of replacement (\|\|)
set str='car>bike,bike>car,truck, and jeep,horse>cat,truck>car,truck, and jeep';
select $str, *
from table(split_to_table(regexp_replace($str,',([^>,]+>)','\|\|\\1'),'||'))
Yes, you are right. The only pattern, which I can see, is the one with the whitespace after the comma.
It's a small workaround but we can make use of this pattern. In below code I am replacing such commas, where we do have whitespaces afterwards. Then I am applying split to table function and I am converting the previous replacement back.
It's not super pretty and would crash if your string contains "my_replacement" or any other new pattern, but its working for me:
select replace(t.value, 'my_replacement', ', ')
from table(
split_to_table(replace('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ', ', 'my_replacement'),',')) t

Is there a way to get back source code from antlr4ts parse tree after modifications ctx.removeLastChild/ctx.addChild? [duplicate]

I want to keep white space when I call text attribute of token, is there any way to do it?
Here is the situation:
We have the following code
IF L > 40 THEN;
ELSE
IF A = 20 THEN
PUT "HELLO";
In this case, I want to transform it into:
if (!(L>40){
if (A=20)
put "hello";
}
The rule in Antlr is that:
stmt_if_block: IF expr
THEN x=stmt
(ELSE y=stmt)?
{
if ($x.text.equalsIgnoreCase(";"))
{
WriteLn("if(!(" + $expr.text +")){");
WriteLn($stmt.text);
Writeln("}");
}
}
But the result looks like:
if(!(L>40))
{
ifA=20put"hello";
}
The reason is that the white space in $stmt was removed. I was wondering if there is anyway to keep these white space
Thank you so much
Update: If I add
SPACE: [ ] -> channel(HIDDEN);
The space will be preserved, and the result would look like below, many spaces between tokens:
IF SUBSTR(WNAME3,M-1,1) = ')' THEN M = L; ELSE M = L - 1;
This is the C# extension method I use for exactly this purpose:
public static string GetFullText(this ParserRuleContext context)
{
if (context.Start == null || context.Stop == null || context.Start.StartIndex < 0 || context.Stop.StopIndex < 0)
return context.GetText(); // Fallback
return context.Start.InputStream.GetText(Interval.Of(context.Start.StartIndex, context.Stop.StopIndex));
}
Since you're using java, you'll have to translate it, but it should be straightforward - the API is the same.
Explanation: Get the first token, get the last token, and get the text from the input stream between the first char of the first token and the last char of the last token.
#Lucas solution, but in java in case you have troubles in translating:
private String getFullText(ParserRuleContext context) {
if (context.start == null || context.stop == null || context.start.getStartIndex() < 0 || context.stop.getStopIndex() < 0)
return context.getText();
return context.start.getInputStream().getText(Interval.of(context.start.getStartIndex(), context.stop.getStopIndex()));
}
Looks like InputStream is not always updated after removeLastChild/addChild operations. This solution helped me for one grammar, but it doesn't work for another.
Works for this grammar.
Doesn't work for modern groovy grammar (for some reason inputStream.getText contains old text).
I am trying to implement function name replacement like this:
enterPostfixExpression(ctx: PostfixExpressionContext) {
// Get identifierContext from ctx
...
const token = CommonTokenFactory.DEFAULT.createSimple(GroovyParser.Identifier, 'someNewFnName');
const node = new TerminalNode(token);
identifierContext.removeLastChild();
identifierContext.addChild(node);
UPD: I used visitor pattern for the first implementation

Lucene Highlighter class: highlight different words in different colors

Probably most people reading the title who know a bit about Lucene won't need much further explanation. NB I use Jython but I think most Java users will understand the Java equivalent...
It's a classic thing to want to do: you have more than one term in your search string... in Lucene terms this returns a BooleanQuery. Then you use something like this code to highlight (NB I am a Lucene newbie, this is all closely tweaked from Net examples):
yellow_highlight = SimpleHTMLFormatter( '<b style="background-color:yellow">', '</b>' )
green_highlight = SimpleHTMLFormatter( '<b style="background-color:green">', '</b>' )
...
stream = FrenchAnalyzer( Version.LUCENE_46 ).tokenStream( "both", StringReader( both ) )
scorer = QueryScorer( fr_query, "both" )
fragmenter = SimpleSpanFragmenter(scorer)
highlighter = Highlighter( yellow_highlight, scorer )
highlighter.setTextFragmenter(fragmenter)
best_fragments = highlighter.getBestTextFragments( stream, both, True, 5 )
if best_fragments:
for best_frag in best_fragments:
print "=== best frag: %s, type %s" % ( best_frag, type( best_frag ))
html_text += "&bull %s<br>\n" % unicode( best_frag )
... and then the html_text is put in a JTextPane for example.
But how would you make the first word in your query highlight with a yellow background and the second word highlight with a green background? I have tried to understand the various classes in org.apache.lucene.search... to no avail. So my only way of learning was googling. I couldn't find any clues...
I asked this question four years ago... At the time I did manage to implement a solution using javax.swing.text.html.HTMLDocument. There's also the interface org.w3c.dom.html.HTMLDocument in the standard Java library. This way is hard work.
But for anyone interested there's a far simpler solution. Taking advantage of the fact that Lucene's SimpleHTMLFormatter returns about the simplest imaginable "marked up" piece of text: chosen words are highlighted with the HTML B tag. That's it. It's not even a "proper" HTML fragment, just a String with <B>s and </B>s in it.
A multi-word query generates a BooleanQuery... from which you can extract multiple TermQuerys by going booleanQuery.clauses() ... getQuery()
I'm working in Groovy. The colouring I want to apply is console codes, as per BASH (or Cygwin). Other types of colouring can be worked out on this model.
So you set up a map before to hold your "markup details":
def markupDetails = [:]
Then for each TermQuery, you call this, with the same text param each time, stipulating a different colour param for each term. NB I'm using Lucene 6.
def createHighlightAndAnalyseMarkup( TermQuery tq, String text, String colour ) {
def termQueryScorer = new QueryScorer( tq )
def termQueryHighlighter = new Highlighter( formatter, termQueryScorer )
TokenStream stream = TokenSources.getTokenStream( fieldName, null, text, analyser, -1 )
String[] frags = termQueryHighlighter.getBestFragments( stream, text, 999999 )
// not sure under what circs you get > 1 fragment...
assert frags.size() <= 1
// NB you don't always get all terms in all returned LDocuments...
if( frags.size() ) {
String highlightedFrag = frags[ 0 ]
Matcher boldTagMatcher = highlightedFrag =~ /<\/?B>/
def pos = 0
def previousEnd = 0
while( boldTagMatcher.find()) {
pos += boldTagMatcher.start() - previousEnd
previousEnd = boldTagMatcher.end()
markupDetails[ pos ] = boldTagMatcher.group() == '<B>'? colour : ConsoleColors.RESET
}
}
}
As I said, I wanted to colourise console output. The colour parameter in the method here is per the console colour codes as found here, for example. E.g. yellow is \033[033m. ConsoleColors.RESET is \033[0m and marks the place where each coloured bit of text stops.
... after you've finished doing this with all TermQuerys you will have a nice map telling you where individual colours begin and end. You work backwards from the end of the text so as to insert the "markup" at the right position in the String. NB here text is your original unmarked-up String:
markupDetails.sort().reverseEach{ pos, markup ->
String firstPart = text.substring( 0, pos )
String secondPart = text.substring( pos )
text = firstPart + markup + secondPart
}
... at the end of which text contains your marked-up String: print to console. Lovely.

Is there an algorithm for encoding two text and the result will be the same even if change their position?

May be the question is hard to understand, I mean this
Given two sample text
Text1 = "abc" and Text2 = "def"
Which algorithm can do like
encoding(Text1, Text2) == encoding(Text2, Text1)
And I wish the result of the function is unique(not duplicate with encoding(Text3, Text1) like in another checksum algorithm too.
Actually, the root of this is I want to search in my database for the question is there any rows that "Who is a friends of B" or "B is a friends of whom" by searching only one column like
SELECT * FROM relationship WHERE hash = "a039813"
not
SELECT *
FROM relationship
WHERE (personColumn1 = "B" and verb = "friend") OR
(personColumn2 = "B" and verb = "friend")
You can adapt any encoding to ensure encoding(Text1, Text2) == encoding(Text2, Text1) by simply enforcing a particular ordering of the arguments. Since you're dealing with text, maybe use a basic lexical order:
encoding_adapter(t1, t2)
{
if (t1 < t2)
return encoding(t1, t2)
else
return encoding(t2, t1)
}
If you use a simple single-input hash function you're probably tempted to write:
encoding(t1, t2)
{
return hash(t1 + t2)
}
But this can cause collisions: encoding("AA", "B") == encoding("A", "AB"). There are a couple easy solutions:
if you have a character or string that never appears in your input strings then use it as a delimiter:
return hash(t1 + delimiter + t2)
hash the hashes:
return hash(hash(t1) + hash(t2))

Regular Expression to Match All Comments in a T-SQL Script

I need a Regular Expression to capture ALL comments in a block of T-SQL. The Expression will need to work with the .Net Regex Class.
Let's say I have the following T-SQL:
-- This is Comment 1
SELECT Foo FROM Bar
GO
-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO
/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'
/* This is a
multi-line comment */
DROP TABLE Bar
I need to capture all of the comments, including the multi-line ones, so that I can strip them out.
EDIT: It would serve the same purpose to have an expression that takes everything BUT the comments.
This should work:
(--.*)|(((/\*)+?[\w\W]+?(\*/)+))
In PHP, i'm using this code to uncomment SQL (this is the commented version -> x modifier) :
trim( preg_replace( '#
(([\'"]).*?[^\\\]\2) # $1 : Skip single & double quoted expressions
|( # $3 : Match comments
(?:\#|--).*?$ # - Single line comment
| # - Multi line (nested) comments
/\* # . comment open marker
(?: [^/*] # . non comment-marker characters
|/(?!\*) # . not a comment open
|\*(?!/) # . not a comment close
|(?R) # . recursive case
)* # . repeat eventually
\*\/ # . comment close marker
)\s* # Trim after comments
|(?<=;)\s+ # Trim after semi-colon
#msx', '$1', $sql ) );
Short version:
trim( preg_replace( '#(([\'"]).*?[^\\\]\2)|((?:\#|--).*?$|/\*(?:[^/*]|/(?!\*)|\*(?!/)|(?R))*\*\/)\s*|(?<=;)\s+#ms', '$1', $sql ) );
Using this code :
StringCollection resultList = new StringCollection();
try {
Regex regexObj = new Regex(#"/\*(?>(?:(?!\*/|/\*).)*)(?>(?:/\*(?>(?:(?!\*/|/\*).)*)\*/(?>(?:(?!\*/|/\*).)*))*).*?\*/|--.*?\r?[\n]", RegexOptions.Singleline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
With the following input :
-- This is Comment 1
SELECT Foo FROM Bar
GO
-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO
/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'
/* This is a
multi-line comment */
DROP TABLE Bar
/* comment /* nesting */ of /* two */ levels supported */
foo...
Produces these matches :
-- This is Comment 1
-- This is
-- Comment 2
/* This is Comment 3 */
/* This is a
multi-line comment */
/* comment /* nesting */ of /* two */ levels supported */
Not that this will only match 2 levels of nested comments, although in my life I have never seen more than one level being used. Ever.
I made this function that removes all SQL comments, using plain regular expressons. It removes both line comments (even when there is not a linebreak after) and block comments (even if there are nested block comments). This function can also replace literals (useful if you are searching for something inside SQL procedures but you want to ignore strings).
My code was based on this answer (which is about C# comments), so I had to change line comments from "//" to "--", but more importantly I had to rewrite the block comments regex (using balancing groups) because SQL allows nested block comments, while C# doesn't.
Also, I have this "preservePositions" argument, which instead of stripping out the comments it just fills comments with whitespace. That's useful if you want to preserve the original position of each SQL command, in case you need to manipulate the original script while preserving original comments.
Regex everythingExceptNewLines = new Regex("[^\r\n]");
public string RemoveComments(string input, bool preservePositions, bool removeLiterals=false)
{
//based on https://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
var lineComments = #"--(.*?)\r?\n";
var lineCommentsOnLastLine = #"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
// literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
// there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
var literals = #"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
var bracketedIdentifiers = #"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
var quotedIdentifiers = #"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
//var blockComments = #"/\*(.*?)\*/"; //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
//so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
var nestedBlockComments = #"/\*
(?>
/\* (?<LEVEL>) # On opening push level
|
\*/ (?<-LEVEL>) # On closing pop level
|
(?! /\* | \*/ ) . # Match any char unless the opening and closing strings
)+ # /* or */ in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
\*/";
string noComments = Regex.Replace(input,
nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
me => {
if (me.Value.StartsWith("/*") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
else if (me.Value.StartsWith("/*") && !preservePositions)
return "";
else if (me.Value.StartsWith("--") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
else if (me.Value.StartsWith("--") && !preservePositions)
return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
return me.Value; // do not remove object identifiers ever
else if (!removeLiterals) // Keep the literal strings
return me.Value;
else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
{
var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
}
else if (removeLiterals && !preservePositions) // wrap completely all literals
return "''";
else
throw new NotImplementedException();
},
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
return noComments;
}
Test 1 (first original, then removing comments, last removing comments/literals)
[select /* block comment */ top 1 'a' /* block comment /* nested block comment */*/ from sys.tables --LineComment
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables --FinalLineComment]
[select top 1 'a' from sys.tables
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables ]
[select top 1 ' ' from sys.tables
union
select top 1 ' ' from sys.tables ]
Test 2 (first original, then removing comments, last removing comments/literals)
Original:
[create table [/*] /*
-- huh? */
(
"--
--" integer identity, -- /*
[*/] varchar(20) /* -- */
default '*/ /* -- */' /* /* /* */ */ */
);
go]
[create table [/*]
(
"--
--" integer identity,
[*/] varchar(20)
default '*/ /* -- */'
);
go]
[create table [/*]
(
"--
--" integer identity,
[*/] varchar(20)
default ' '
);
go]
This works for me:
(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))
It matches all comments starting with -- or enclosed within */ .. */ blocks
I see you're using Microsoft's SQL Server (as opposed to Oracle or MySQL).
If you relax the regex requirement, it's now possible (since 2012) to use Microsoft's own parser:
using Microsoft.SqlServer.Management.TransactSql.ScriptDom;
...
public string StripCommentsFromSQL( string SQL ) {
TSql110Parser parser = new TSql110Parser( true );
IList<ParseError> errors;
var fragments = parser.Parse( new System.IO.StringReader( SQL ), out errors );
// clear comments
string result = string.Join (
string.Empty,
fragments.ScriptTokenStream
.Where( x => x.TokenType != TSqlTokenType.MultilineComment )
.Where( x => x.TokenType != TSqlTokenType.SingleLineComment )
.Select( x => x.Text ) );
return result;
}
See Removing Comments From SQL
The following works fine - pg-minify, and not only for PostgreSQL, but for MS-SQL also.
Presumably, if we remove comments, that means the script is no longer for reading, and minifying it at the same time is a good idea.
That library deletes all comments as part of the script minification.
I am using this java code to remove all sql comments from text. It supports comments like /* ... */ , --..., nested comments, ignores comments inside quoted strings
public static String stripComments(String sqlCommand) {
StringBuilder result = new StringBuilder();
//group 1 must be quoted string
Pattern pattern = Pattern.compile("('(''|[^'])*')|(/\\*(.|[\\r\\n])*?\\*/)|(--(.*|[\\r\\n]))");
Matcher matcher = pattern.matcher(sqlCommand);
int prevIndex = 0;
while(matcher.find()) {
// add previous portion of string that was not found by regexp - meaning this is not a quoted string and not a comment
result.append(sqlCommand, prevIndex, matcher.start());
prevIndex = matcher.end();
// add the quoted string
if (matcher.group(1) != null) {
result.append(sqlCommand, matcher.start(), matcher.end());
}
}
result.append(sqlCommand.substring(prevIndex));
return result.toString();
}
Following up from Jeremy's answer and inspired by Adrien Gibrat's answer.
This is my version that supports comment characters inside single-quoted strings.
.NET C# note you need to enable RegexOptions.IgnorePatternWhitespace
, most other languages this is the x option
(?: (?:'[^']*?') | (?<singleline>--[^\n]*) | (?<multiline>(?:\/\*)+?[\w\W]+?(?:\*\/)+) )
Example
https://regex101.com/r/GMUAnc/3