Regular Expression to Match All Comments in a T-SQL Script - sql

I need a Regular Expression to capture ALL comments in a block of T-SQL. The Expression will need to work with the .Net Regex Class.
Let's say I have the following T-SQL:
-- This is Comment 1
SELECT Foo FROM Bar
GO
-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO
/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'
/* This is a
multi-line comment */
DROP TABLE Bar
I need to capture all of the comments, including the multi-line ones, so that I can strip them out.
EDIT: It would serve the same purpose to have an expression that takes everything BUT the comments.

This should work:
(--.*)|(((/\*)+?[\w\W]+?(\*/)+))

In PHP, i'm using this code to uncomment SQL (this is the commented version -> x modifier) :
trim( preg_replace( '#
(([\'"]).*?[^\\\]\2) # $1 : Skip single & double quoted expressions
|( # $3 : Match comments
(?:\#|--).*?$ # - Single line comment
| # - Multi line (nested) comments
/\* # . comment open marker
(?: [^/*] # . non comment-marker characters
|/(?!\*) # . not a comment open
|\*(?!/) # . not a comment close
|(?R) # . recursive case
)* # . repeat eventually
\*\/ # . comment close marker
)\s* # Trim after comments
|(?<=;)\s+ # Trim after semi-colon
#msx', '$1', $sql ) );
Short version:
trim( preg_replace( '#(([\'"]).*?[^\\\]\2)|((?:\#|--).*?$|/\*(?:[^/*]|/(?!\*)|\*(?!/)|(?R))*\*\/)\s*|(?<=;)\s+#ms', '$1', $sql ) );

Using this code :
StringCollection resultList = new StringCollection();
try {
Regex regexObj = new Regex(#"/\*(?>(?:(?!\*/|/\*).)*)(?>(?:/\*(?>(?:(?!\*/|/\*).)*)\*/(?>(?:(?!\*/|/\*).)*))*).*?\*/|--.*?\r?[\n]", RegexOptions.Singleline);
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
resultList.Add(matchResult.Value);
matchResult = matchResult.NextMatch();
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
With the following input :
-- This is Comment 1
SELECT Foo FROM Bar
GO
-- This is
-- Comment 2
UPDATE Bar SET Foo == 'Foo'
GO
/* This is Comment 3 */
DELETE FROM Bar WHERE Foo = 'Foo'
/* This is a
multi-line comment */
DROP TABLE Bar
/* comment /* nesting */ of /* two */ levels supported */
foo...
Produces these matches :
-- This is Comment 1
-- This is
-- Comment 2
/* This is Comment 3 */
/* This is a
multi-line comment */
/* comment /* nesting */ of /* two */ levels supported */
Not that this will only match 2 levels of nested comments, although in my life I have never seen more than one level being used. Ever.

I made this function that removes all SQL comments, using plain regular expressons. It removes both line comments (even when there is not a linebreak after) and block comments (even if there are nested block comments). This function can also replace literals (useful if you are searching for something inside SQL procedures but you want to ignore strings).
My code was based on this answer (which is about C# comments), so I had to change line comments from "//" to "--", but more importantly I had to rewrite the block comments regex (using balancing groups) because SQL allows nested block comments, while C# doesn't.
Also, I have this "preservePositions" argument, which instead of stripping out the comments it just fills comments with whitespace. That's useful if you want to preserve the original position of each SQL command, in case you need to manipulate the original script while preserving original comments.
Regex everythingExceptNewLines = new Regex("[^\r\n]");
public string RemoveComments(string input, bool preservePositions, bool removeLiterals=false)
{
//based on https://stackoverflow.com/questions/3524317/regex-to-strip-line-comments-from-c-sharp/3524689#3524689
var lineComments = #"--(.*?)\r?\n";
var lineCommentsOnLastLine = #"--(.*?)$"; // because it's possible that there's no \r\n after the last line comment
// literals ('literals'), bracketedIdentifiers ([object]) and quotedIdentifiers ("object"), they follow the same structure:
// there's the start character, any consecutive pairs of closing characters are considered part of the literal/identifier, and then comes the closing character
var literals = #"('(('')|[^'])*')"; // 'John', 'O''malley''s', etc
var bracketedIdentifiers = #"\[((\]\])|[^\]])* \]"; // [object], [ % object]] ], etc
var quotedIdentifiers = #"(\""((\""\"")|[^""])*\"")"; // "object", "object[]", etc - when QUOTED_IDENTIFIER is set to ON, they are identifiers, else they are literals
//var blockComments = #"/\*(.*?)\*/"; //the original code was for C#, but Microsoft SQL allows a nested block comments // //https://msdn.microsoft.com/en-us/library/ms178623.aspx
//so we should use balancing groups // http://weblogs.asp.net/whaggard/377025
var nestedBlockComments = #"/\*
(?>
/\* (?<LEVEL>) # On opening push level
|
\*/ (?<-LEVEL>) # On closing pop level
|
(?! /\* | \*/ ) . # Match any char unless the opening and closing strings
)+ # /* or */ in the lookahead string
(?(LEVEL)(?!)) # If level exists then fail
\*/";
string noComments = Regex.Replace(input,
nestedBlockComments + "|" + lineComments + "|" + lineCommentsOnLastLine + "|" + literals + "|" + bracketedIdentifiers + "|" + quotedIdentifiers,
me => {
if (me.Value.StartsWith("/*") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks // return new string(' ', me.Value.Length);
else if (me.Value.StartsWith("/*") && !preservePositions)
return "";
else if (me.Value.StartsWith("--") && preservePositions)
return everythingExceptNewLines.Replace(me.Value, " "); // preserve positions and keep line-breaks
else if (me.Value.StartsWith("--") && !preservePositions)
return everythingExceptNewLines.Replace(me.Value, ""); // preserve only line-breaks // Environment.NewLine;
else if (me.Value.StartsWith("[") || me.Value.StartsWith("\""))
return me.Value; // do not remove object identifiers ever
else if (!removeLiterals) // Keep the literal strings
return me.Value;
else if (removeLiterals && preservePositions) // remove literals, but preserving positions and line-breaks
{
var literalWithLineBreaks = everythingExceptNewLines.Replace(me.Value, " ");
return "'" + literalWithLineBreaks.Substring(1, literalWithLineBreaks.Length - 2) + "'";
}
else if (removeLiterals && !preservePositions) // wrap completely all literals
return "''";
else
throw new NotImplementedException();
},
RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);
return noComments;
}
Test 1 (first original, then removing comments, last removing comments/literals)
[select /* block comment */ top 1 'a' /* block comment /* nested block comment */*/ from sys.tables --LineComment
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables --FinalLineComment]
[select top 1 'a' from sys.tables
union
select top 1 '/* literal with */-- lots of comments symbols' from sys.tables ]
[select top 1 ' ' from sys.tables
union
select top 1 ' ' from sys.tables ]
Test 2 (first original, then removing comments, last removing comments/literals)
Original:
[create table [/*] /*
-- huh? */
(
"--
--" integer identity, -- /*
[*/] varchar(20) /* -- */
default '*/ /* -- */' /* /* /* */ */ */
);
go]
[create table [/*]
(
"--
--" integer identity,
[*/] varchar(20)
default '*/ /* -- */'
);
go]
[create table [/*]
(
"--
--" integer identity,
[*/] varchar(20)
default ' '
);
go]

This works for me:
(/\*(.|[\r\n])*?\*/)|(--(.*|[\r\n]))
It matches all comments starting with -- or enclosed within */ .. */ blocks

I see you're using Microsoft's SQL Server (as opposed to Oracle or MySQL).
If you relax the regex requirement, it's now possible (since 2012) to use Microsoft's own parser:
using Microsoft.SqlServer.Management.TransactSql.ScriptDom;
...
public string StripCommentsFromSQL( string SQL ) {
TSql110Parser parser = new TSql110Parser( true );
IList<ParseError> errors;
var fragments = parser.Parse( new System.IO.StringReader( SQL ), out errors );
// clear comments
string result = string.Join (
string.Empty,
fragments.ScriptTokenStream
.Where( x => x.TokenType != TSqlTokenType.MultilineComment )
.Where( x => x.TokenType != TSqlTokenType.SingleLineComment )
.Select( x => x.Text ) );
return result;
}
See Removing Comments From SQL

The following works fine - pg-minify, and not only for PostgreSQL, but for MS-SQL also.
Presumably, if we remove comments, that means the script is no longer for reading, and minifying it at the same time is a good idea.
That library deletes all comments as part of the script minification.

I am using this java code to remove all sql comments from text. It supports comments like /* ... */ , --..., nested comments, ignores comments inside quoted strings
public static String stripComments(String sqlCommand) {
StringBuilder result = new StringBuilder();
//group 1 must be quoted string
Pattern pattern = Pattern.compile("('(''|[^'])*')|(/\\*(.|[\\r\\n])*?\\*/)|(--(.*|[\\r\\n]))");
Matcher matcher = pattern.matcher(sqlCommand);
int prevIndex = 0;
while(matcher.find()) {
// add previous portion of string that was not found by regexp - meaning this is not a quoted string and not a comment
result.append(sqlCommand, prevIndex, matcher.start());
prevIndex = matcher.end();
// add the quoted string
if (matcher.group(1) != null) {
result.append(sqlCommand, matcher.start(), matcher.end());
}
}
result.append(sqlCommand.substring(prevIndex));
return result.toString();
}

Following up from Jeremy's answer and inspired by Adrien Gibrat's answer.
This is my version that supports comment characters inside single-quoted strings.
.NET C# note you need to enable RegexOptions.IgnorePatternWhitespace
, most other languages this is the x option
(?: (?:'[^']*?') | (?<singleline>--[^\n]*) | (?<multiline>(?:\/\*)+?[\w\W]+?(?:\*\/)+) )
Example
https://regex101.com/r/GMUAnc/3

Related

Split by delimiter which is contained in a record

I have a column which I am splitting in Snowflake.
The format is as follows:
I have been using split_to_table(A, ',') inside of my query but as you can probably tell this uncorrectly also splits the Scooter > Sprinting, Jogging and Walking record.
Perhaps having the delimiter only work if there is no spaced on either side of it? As I cannot see a different condition that could work.
I have been researching online but haven't found a suitable work around yet, is there anyone that encountered a similar problem in the past?
Thanks
This is a custom rule for the split to table, so we can use a UDTF to apply a custom rule:
create or replace function split_to_table2(STR string, DELIM string, ROW_MUST_CONTAIN string)
returns table (VALUE string)
language javascript
strict immutable
as
$$
{
initialize: function (argumentInfo, context) {
},
processRow: function (row, rowWriter, context) {
var buffer = "";
var i;
const s = row.STR.split(row.DELIM);
for(i=0; i<s.length-1; i++) {
buffer += s[i];
if(s[i+1].includes(row.ROW_MUST_CONTAIN)) {
rowWriter.writeRow({VALUE: buffer});
buffer = "";
} else {
buffer += row.DELIM
}
}
rowWriter.writeRow({VALUE: s[i]})
},
}
$$;
select VALUE from
table(split_to_table2('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ',', '>'))
;
Output:
VALUE
Car > Bike
Bike > Scooter
Scooter > Sprinting, Jogging and Walking
Walking > Flying
This UDTF adds one more parameter than the two in the build in table function split_to_table. The third parameter, ROW_MUST_CONTAIN is the string a row must contain. It splits the string on DELIM, but if it does not have the ROW_MUST_CONTAIN string, it concatenates the strings to form a complete string for a row. In this case we just specify , for the delimiter and > for ROW_MUST_CONTAIN.
We can get a little clever with regexp_replace by replacing the actual delimiters with something else before the table split. I am using double pipes '||' but you can change that to something else. The '\|\|\\1' trick is called back-referencing that allows us to include the captured group (\\1) as part of replacement (\|\|)
set str='car>bike,bike>car,truck, and jeep,horse>cat,truck>car,truck, and jeep';
select $str, *
from table(split_to_table(regexp_replace($str,',([^>,]+>)','\|\|\\1'),'||'))
Yes, you are right. The only pattern, which I can see, is the one with the whitespace after the comma.
It's a small workaround but we can make use of this pattern. In below code I am replacing such commas, where we do have whitespaces afterwards. Then I am applying split to table function and I am converting the previous replacement back.
It's not super pretty and would crash if your string contains "my_replacement" or any other new pattern, but its working for me:
select replace(t.value, 'my_replacement', ', ')
from table(
split_to_table(replace('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ', ', 'my_replacement'),',')) t

Is there a way to get back source code from antlr4ts parse tree after modifications ctx.removeLastChild/ctx.addChild? [duplicate]

I want to keep white space when I call text attribute of token, is there any way to do it?
Here is the situation:
We have the following code
IF L > 40 THEN;
ELSE
IF A = 20 THEN
PUT "HELLO";
In this case, I want to transform it into:
if (!(L>40){
if (A=20)
put "hello";
}
The rule in Antlr is that:
stmt_if_block: IF expr
THEN x=stmt
(ELSE y=stmt)?
{
if ($x.text.equalsIgnoreCase(";"))
{
WriteLn("if(!(" + $expr.text +")){");
WriteLn($stmt.text);
Writeln("}");
}
}
But the result looks like:
if(!(L>40))
{
ifA=20put"hello";
}
The reason is that the white space in $stmt was removed. I was wondering if there is anyway to keep these white space
Thank you so much
Update: If I add
SPACE: [ ] -> channel(HIDDEN);
The space will be preserved, and the result would look like below, many spaces between tokens:
IF SUBSTR(WNAME3,M-1,1) = ')' THEN M = L; ELSE M = L - 1;
This is the C# extension method I use for exactly this purpose:
public static string GetFullText(this ParserRuleContext context)
{
if (context.Start == null || context.Stop == null || context.Start.StartIndex < 0 || context.Stop.StopIndex < 0)
return context.GetText(); // Fallback
return context.Start.InputStream.GetText(Interval.Of(context.Start.StartIndex, context.Stop.StopIndex));
}
Since you're using java, you'll have to translate it, but it should be straightforward - the API is the same.
Explanation: Get the first token, get the last token, and get the text from the input stream between the first char of the first token and the last char of the last token.
#Lucas solution, but in java in case you have troubles in translating:
private String getFullText(ParserRuleContext context) {
if (context.start == null || context.stop == null || context.start.getStartIndex() < 0 || context.stop.getStopIndex() < 0)
return context.getText();
return context.start.getInputStream().getText(Interval.of(context.start.getStartIndex(), context.stop.getStopIndex()));
}
Looks like InputStream is not always updated after removeLastChild/addChild operations. This solution helped me for one grammar, but it doesn't work for another.
Works for this grammar.
Doesn't work for modern groovy grammar (for some reason inputStream.getText contains old text).
I am trying to implement function name replacement like this:
enterPostfixExpression(ctx: PostfixExpressionContext) {
// Get identifierContext from ctx
...
const token = CommonTokenFactory.DEFAULT.createSimple(GroovyParser.Identifier, 'someNewFnName');
const node = new TerminalNode(token);
identifierContext.removeLastChild();
identifierContext.addChild(node);
UPD: I used visitor pattern for the first implementation

how to remove all html characters in snowflake, dont want to include all html special characters in query (no hardcoding)

Want to remove below kind of characters from string..pl help
'
&
You may try this one to remove any HTML special characters:
select REGEXP_REPLACE( 'abc&def³»ghi', '&[^&]+;', '!' );
Explanation:
REGEXP_REPLACE uses regular expression to search and replace. I search for "&[^&]+;" and replace it with "!" for demonstration. You can of course use '' to remove them. More info about the function:
https://docs.snowflake.com/en/sql-reference/functions/regexp_replace.html
About the regular expression string:
& is the & character of a HTML special character
[^&] means any character except &. Tthis prevents to REGEXP to replace all characters between the first '&' char and last ';'. It will stop when it see second '&'
+ means match 1 or more of preceding token (any character except &)
; is the last character of a HTML special character
CREATE or REPLACE FUNCTION UDF_StripHTML(str varchar)
returns varchar
language javascript
strict
as
'var HTMLParsedText=""
var resultSet = STR.split(''>'')
var resultSetLength =resultSet.length
var counter=0
while(resultSetLength>0)
{
if(resultSet[counter].indexOf(''<'')>0)
{
var value = resultSet[counter]
value=value.substring(0, resultSet[counter].indexOf(''<''))
if (resultSet[counter].indexOf(''&'')>=0 && resultSet[counter].indexOf('';'')>=0)
{
value=value.replace(value.substring(resultSet[counter].indexOf(''&''), resultSet[counter].indexOf('';'')+1),'''')
}
}
if (value)
{
value = value.trim();
if(HTMLParsedText === "")
{
HTMLParsedText = value
}
else
{
if (value) {
HTMLParsedText = HTMLParsedText + '' '' + value
}
}
value=''''
}
counter= counter+1
resultSetLength=resultSetLength-1
}
return HTMLParsedText';
to call this UDF :
Select UDF_StripHTML(text)

Extra quotes doing SQL insert from Perl to CSV. If I try to remove them, I get no quotes. I need single quotes as in "0123"

When I write to a csv file from Perl using SQL Insert, I get either 0123 or """0123""", but I need "0123". Neither concatenation nor regex seem to resolve the issue.
Here's my code:
my $dbh = DBI->connect(qq{DBI:CSV:csv_eol=\n;csv_sep_char=\\,;});
$dbh->{'csv_tables'}->{'Table'} = {'file' => 'data.csv','col_names' =>
["num","id"]};
#Setup error variables
$dbh->{'RaiseError'} = 1;
$# = '';
## Attempts to change $num
##$num = '"'.$num.'"';## this causes """0123"""
##$num = "\"$num\"";## this causes """0123""" even if I additionally do this:
## $num=~s/"""/"/g;
##$num = " ".$num;## causes " 0123" VERY CLOSE. Try next line:
##$num=~s/ //g;## This causes 0123
##$num = "".$num; ## causes 0123
##$num = "'".$num."'";## causes 123
my $value = "\'$num\',\'$id\'";
my $insert = "INSERT INTO Table VALUES ($value)";
my $sth = $dbh->prepare($insert);
$sth->execute();
$sth->finish();
$dbh->disconnect();
I would like to have the output of $num to end up being "0123" in the CSV, but instead I get 0123 or """0123""" or " 0123"
Thanks to poj at PerlMonks (https://www.perlmonks.org/?node_id=11107315), there is a solution:
Here it is for everyone's convenience:
Use placeholders and always_quote option
#!/usr/bin/perl
use strict;
use DBI;
my $dbh = DBI->connect("dbi:CSV:", "", "",{
'RaiseError' => 1 }
);
$dbh->{'csv_tables'}->{'MyTable'} = {
'file' => 'data1.csv',
'col_names' => ["num","id"],
'always_quote' => 1,
};
my $num = '0123';
my $id = '0124';
my $sql = "INSERT INTO MyTable VALUES (?,?)";
my $sth = $dbh->do($sql,undef,$num,$id);
poj
Interpolating variables into SQL statements is a bad idea. You leave yourself open to SQL injection attacks (see Bobby Tables). So don't do that.
Instead, use bind parameters and pass extra values to the execute() method.
my $insert = 'INSERT INTO Table VALUES (?, ?)';
my $sth = $dbh->prepare($insert);
$sth->execute($num, $id);
The problem is that a CSV entry of 123 and of "123" are identical. A single pair of double quotes is part of the CSV format (at least, the most common variant). So when you try to insert a value containing double quotes, it doubles them up to escape them. DBD::CSV is only surrounding the cell in double quotes when it contains special characters like a comma, space, or escaped double quotes.
foo," foo ",",","""a""b"
# parses to: 'foo' -- ' foo ' -- ',' -- '"a"b'
The parser should not care whether the cell is quoted or not (unless it is needed because the cell contains these special characters). See http://tburette.github.io/blog/2014/05/25/so-you-want-to-write-your-own-CSV-code/

Lex Yacc, should i tokenize character literals?

I know, poorly worded question not sure how else to ask though.
I always seem to end up in the error branch regardless of what i'm entering and can't figure out where i'm screwing this up. i'm using a particular flavor of Lex/YACC called GPPG which just sets this all up for use with C#
Here is my Y
method : L_METHOD L_VALUE ')' { System.Diagnostics.Debug.WriteLine("Found a method: Name:" + $1.Data ); }
| error { System.Diagnostics.Debug.WriteLine("Not valid in this statement context ");/*Throw new exception*/ }
;
here's my Lex
\'[^']*\' {this.yylval.Data = yytext.Replace("'",""); return (int)Tokens.L_VALUE;}
[a-zA-Z0-9]+\( {this.yylval.Data = yytext; return (int)Tokens.L_METHOD;}
The idea is that i should be able to pass
Method('value') to it and have it properly recognize that this is correct syntax
ultimately the plan is to execute the Method passing the various parameters as values
i've also tried several derivations. for example:
method : L_METHOD '(' L_VALUE ')' { System.Diagnostics.Debug.WriteLine("Found a method: Name:" + $1.Data ); }
| error { System.Diagnostics.Debug.WriteLine("Not valid in this statement context: ");/*Throw new exception*/ }
;
\'[^']*\' {this.yylval.Data = yytext.Replace("'",""); return (int)Tokens.L_VALUE;}
[a-zA-Z0-9]+ {this.yylval.Data = yytext; return (int)Tokens.L_METHOD;}
You need a lex rule to return the punctuation tokens 'as-is' so that the yacc grammar can recognize them. Something like:
[()] { return *yytext; }
added to your second example should do the trick.