ANTLR4 Grammar mathematical operator associativity and order of operations - antlr

I am attempting to create an ANTLR4 grammar to carry out mathematical operations. However, I'm finding some difficulty balancing the order of operations alongside operator associativity. (e.g. x+y+z is right-associative should be understood as x+(y+z) and subtraction is left associative, so x-y-z should be understood as (x-y)-z ).
This example defines the operations with correct associativity.
expr
: value = INTEGER #Integer
| value = VARIABLE #Variable
| <assoc=right> left = expression op = '^' right = expression #Operation
| <assoc=left> left = expression op = '/' right = expression #Operation
| <assoc=right> left = expression op = '*' right = expression #Operation
| <assoc=right> left = expression op = '+' right = expression #Operation
| <assoc=left> left = expression op = '-' right = expression #Operation
However, the order of operations dictates that multiplication and division are evaluated with equal precedence, from left to right. The same goes for multiplication and subtraction.
This example maintains the order of operations, but denies the ability to add the associativity.
expr
: value = INTEGER #Integer
| value = VARIABLE #Variable
| <assoc=right> left = expression op = OP_POW right = expression #Operation
| left = expression op = ('/' | '*') right = expression #Operation
| left = expression op = ('+' | '-') right = expression #Operation
Any attempts I've made to fix this end up using left-recursive rules. Is there a way to maintain operator associativity AND the correct order of operations using ANTLR4? Any help is appreciated. Thanks in advance.

Related

MS SQL case statement and cast/covert data type issues

I'm having an issue with the MS SQL case statement that has cast inside. Here is the example I cam up with.
DECLARE #bla as varchar(10) = '001234'
DECLARE #vb AS varchar(20) = 'bla'
SELECT CASE when (#vb <> 'bla') THEN CAST(#bla AS int) ELSE #bla END vbla
The result is very strange. It should be 001234. What am I missing?
+------+
| vbla |
+------+
| 1234 |
+------+
A case EXPRESSION (not statement) returns a single type. When one of the branches is a number, then the return value is a number.
The value you are seeing is the number that the string converts to. If the string started with a non-digit, then the value would be 0.
If you want to see the leading zeros, leave the value as a string.

A strange operation problem in SQL Server: -100/-100*10 = 0

If you execute SELECT -100/-100*10 the result is 0.
If you execute SELECT (-100/-100)*10 the result is 10.
If you execute SELECT -100/(-100*10) the result is 0.
If you execute SELECT 100/100*10 the result is 10.
BOL states:
When two operators in an expression have the same operator precedence level, they are evaluated left to right based on their position in the expression.
And
Level Operators
1 ~ (Bitwise NOT)
2 * (Multiplication), / (Division), % (Modulus)
3 + (Positive), - (Negative), + (Addition), + (Concatenation), - (Subtraction), & (Bitwise AND), ^ (Bitwise Exclusive OR), | (Bitwise OR)
Is BOL wrong, or am I missing something? It seems the - is throwing the (expected) precedence off.
According to the precedence table, this is the expected behavior. The operator with higher precedence (/ and *) is evaluated before operator with lower precedence (unary -). So this:
-100 / -100 * 10
is evaluated as:
-(100 / -(100 * 10))
Note that this behavior is different from most programming languages where unary negation has higher precedence than multiplication and division e.g. VB, JavaScript.
BOL is correct. - has lower precedence than *, so
-A * B
is parsed as
-(A * B)
Multiplication being what it is, you don't typically notice this, except when mixing in the two other binary operators with equal precedence: / and % (and % is rarely used in compound expressions like this). So
C / -A * B
Is parsed as
C / -(A * B)
explaining the results. This is counter-intuitive because in most other languages, unary minus has higher precedence than * and /, but not in T-SQL, and this is documented correctly.
A nice (?) way to illustrate it:
SELECT -1073741824 * 2
produces an arithmetic overflow, because -(1073741824 * 2) produces 2147483648 as an intermediate, which does not fit in an INT, but
SELECT (-1073741824) * 2
produces the expected result -2147483648, which does.
Notice in the documentation that (perhaps counter-intuitively) the order of precedence for - (Negative) is third.
So you effectively get:
-(100/-(100*10)) = 0
If you place them into variables you won't see this happening, as there is no unary operation that occurs after the multiplication.
So here A and B are the same, whereas C, D, E show the result you are seeing (with E having the complete bracketing)
DECLARE #i1 int, #i2 int, #i3 int;
SELECT #i1 = -100,
#i2 = -100,
#i3 = 10;
SELECT #i1/#i2*#i3 [A],
-100/(-100)*10 [B],
-100/-100*10 [C],
-100/-(100*10) [D],
-(100/-(100*10)) [E];
A - 10
B - 10
C - 0
D - 0
E - 0

SQL assign variable with subquery

I have a question for following 2 SQL:
declare #i1 bit, #b1 bit
declare #i2 bit, #b2 bit
declare #t table (Seq int)
insert into #t values (1)
-- verify data
select case when (select count(1) from #t n2 where 1 = 2) > 0 then 1 else 0 end
-- result 0
select #i1 = 1, #b1 = case when #i1 = 1 or ((select count(1) from #t n2 where 1 = 2) > 0) then 1 else 0 end from #t n where n.Seq = 1
select #i1, #b1
-- result 1, 0
select #i2 = 1, #b2 = case when #i2 = 1 or (0 > 0) then 1 else 0 end from #t n where n.Seq = 1
select #i2, #b2
-- result 1, 1
SQL Fiddle Here
Before the execute, I thought the case part should be null = 1 or (0 > 0), and it will return 0.
But now, I wondering why the 2nd SQL will return 1
Just to extend #Giorgi's answer:
See this execution plan:
Since #i2 is evaluated first (#i2=1), case when #i2 = 1 or anything returns 1.
See also this msdn entry: https://msdn.microsoft.com/en-us/library/ms187953.aspx and Caution section
If there are multiple assignment clauses in a single SELECT statement,
SQL Server does not guarantee the order of evaluation of the
expressions. Note that effects are only visible if there are
references among the assignments.
It's all related to internal optimization.
I will post this as an answer as it is quite large text from Training Kit (70-461):
WHERE propertytype = 'INT' AND CAST(propertyval AS INT) > 10
Some assume that unless precedence rules dictate otherwise, predicates
will be evaluated from left to right, and that short circuiting will
take place when possible. In other words, if the first predicate
propertytype = 'INT' evaluates to false, SQL Server won’t evaluate the
second predicate CAST(propertyval AS INT) > 10 because the result is
already known. Based on this assumption, the expectation is that the
query should never fail trying to convert something that isn’t
convertible.
The reality, though, is different. SQL Server does
internally support a short-circuit concept; however, due to the
all-at-once concept in the language, it is not necessarily going to
evaluate the expressions in left-to-right order. It could decide,
based on cost-related reasons, to start with the second expression,
and then if the second expression evaluates to true, to evaluate the
first expression as well. This means that if there are rows in the
table where propertytype is different than 'INT', and in those rows
propertyval isn’t convertible to INT, the query can fail due to a
conversion error.
Just to extend both answers.
From Dirty Secrets of the CASE Expression:
CASE will not always short circuit
The official documentation implies that the entire expression will short-circuit, meaning it will evaluate the expression from left-to-right, and stop evaluating when it hits a match:
The CASE statement evaluates its conditions sequentially and stops with the
first condition whose condition is satisfied.
And MS Connect:
CASE / COALESCE won't always evaluate in textual order
Aggregates Don't Follow the Semantics Of CASE
CASE Transact-SQL
The CASE statement evaluates its conditions sequentially and stops with the first condition whose condition is satisfied. In some situations, an expression is evaluated before a CASE statement receives the results of the expression as its input. Errors in evaluating these expressions are possible.

create sql view from comma separated values

T-sql question:
I need help to build a join from 2 tables, where on one of the tables I have aggregated data (comma separated values).
I have a table - Users where I have 3 columns: UserId, DefaultLanguage and OtherLanguages.
The table looks like this:
UserId | DefaultLanguage | OtherLanguages
---------------------------------------------
1 | en | NULL
2 | en | it, fr
3 | fr | en, it
4 | en | sp
and so on.
I have another table where I have the association between language code (en, fr, ro, it, sp) and language name:
LangCode | LanguageName
-------------------------
en | English
fr | French
it | Italian
sp | Spanish
and so on.
I want to create a view like this:
UserId | DefaultLanguage | OtherLanguages
---------------------------------------------
1 | English | NULL
2 | English | Italian, French
3 | French | English, Italian
4 | English | Spanish
and so on.
In short, I need a view where the language code is replaced by language name.
Any help, please?
Several solutions of course you can recreate all table change the data structure.
1. If all the language are 2 digits:
select t1.UserId, t2.LanguageName,
ISNULL( t3.LanguageName, '') + ISNULL(', '+t4.LanguageName, '') + ISNULL( ', '+t5.LanguageName, '') OtherLanguages
from Table1 t1
inner join Table2 t2 on t1.DefaultLanguage = t2.LangCode
left join Table2 t3 on Left(t1.OtherLanguages,2) = t3.LangCode
left join Table2 t4 on CASE WHEN len(Replace(t1.OtherLanguages, ' ', '')) > 3 THEN
SUBSTRING( Replace(t1.OtherLanguages, ' ', ''), 4, 2) ELSE null END = t4.LangCode
left join Table2 t5 on CASE WHEN len(Replace(t1.OtherLanguages, ' ', '')) > 6 THEN
SUBSTRING( Replace(t1.OtherLanguages, ' ', ''), 7, 2) ELSE null END = t5.LangCode
Use user-define function:
CREATE FUNCTION [dbo].[func_GetLanguageName] (#pLanguageList varchar(max))
RETURNS varchar(max) AS
BEGIN
Declare #aLanguageList varchar(max) = #pLanguageList
Declare #aLangCode varchar(max) = null
Declare #aReturnName varchar(max) = null
WHILE LEN(#aLanguageList) > 0
BEGIN
IF PATINDEX('%,%',#aLanguageList) > 0
BEGIN
SET #aLangCode = RTRIM(LTRIM(SUBSTRING(#aLanguageList, 0, PATINDEX('%,%',#aLanguageList))))
SET #aLanguageList = LTRIM(SUBSTRING(#aLanguageList, LEN(#aLangCode + ',') + 1,LEN(#aLanguageList)))
END
ELSE
BEGIN
SET #aLangCode = #aLanguageList
SET #aLanguageList = NULL
END
Select #aReturnName = ISNULL( #aReturnName + ', ' , '') + LanguageName from Table2 where LangCode=#aLangCode
END
RETURN(#aReturnName)
END
and use select
select UserId, dbo.func_GetLanguageName(DefaultLanguage)DefaultLanguage, dbo.func_GetLanguageName(OtherLanguages) OtherLanguages from table1
Best practice would dictate not to have this type of comma delimited
data in a column...
Since you stated in comments that the schema cannot be changed, the next best thing is a function. This can be used in a select query in-line.
SQL is notoriously slow with string manipulation. Here is an interesting article on the topic. There are many SQL "string split" functions out there. They all generally split a comma delimited string and return a table.
For this specific use-case, you actually need a scalar-valued
function (a function which returns one value) rather than a
table-valued function (one which returns a table of values).
Below is a modified such function, which returns a scalar value in place of the original comma delimited string of language codes.
The comments explain what is happening line by line.
The gist is that you must loop through the input string keeping track of the last comma location, extract each code, lookup the full language from the languages table, and then return the output as a comma-delimited string.
Language codes to languages function:
Create Function [dbo].fn_languageCodeToFull
( #Input Varchar(100) )
Returns Varchar(1000)
As
Begin
-- To address null input, based on the example you provided, we set the output to NULL if there is no input
If #Input = '' Or #Input Is Null
Return Null
Declare
#CodeLength int, -- constant for code length to avoid hardcoded "magic numbers"
#Output varchar(1000), -- will contain the final comma delimited string of full languages
#LastIndex int, -- tracks the location of the input we are searching as we loop over the string
#CurrentCode varchar(2), -- for code readability, we extract each language code to this variable
#CurrentLanguage varchar(50), -- for code readability, we store the full language in this variable
#IndexIncrement int -- constant to increment the search index by 1 at each iteration
-- ensuring the loop moves forward
Set #LastIndex = 0 -- seed the index, so we begin to search at 0 index
Set #CodeLength = 2 -- ISO language codes are always 2 characters in length
Set #Output = '' -- seed with empty string to avoid NULL when concatenating
Set #IndexIncrement = 1 -- again avoiding hardcoded values...
-- We will loop until we have gone to or beyond the length of the input string
While #LastIndex < len(#Input)
Begin
-- Set the index of each comma (charindex is 1-based)
Set #LastIndex = CHARINDEX(',', #Input, #LastIndex)
-- When we get to the last item, CharIndex will return 0 when it does not find a comma.
-- To pull the last item, we will artificially set #LastIndex to be 1 greater than the input string
-- This will allow the code following this line to be unaltered for this scenario
If #LastIndex = 0 set #LastIndex = len(#Input) + 1 -- account for 1-based index of substring
-- Extract the code prior to the current comma that charindex has identified
Set #CurrentCode = substring(#Input, #LastIndex - #CodeLength, #CodeLength)
-- Do a lookup to get the language for the current code
Set #CurrentLanguage = (Select LanguageName From languages Where code = #CurrentCode)
-- Only add comma after first language to ensure no extra comma will be present in Output
If #LastIndex > 3 Set #Output = #Output + ','
-- Here we build the Output string with the language
Set #Output = #Output + #CurrentLanguage
-- Finally, we increment #LastIndex by 1 to avoid loop on first instance of comma
Set #LastIndex = #LastIndex + #IndexIncrement
End
Return #Output
End
Then your view would simply do something like:
Sample view using the function:
Create View vw_UserLanguages
As
Select
UserId,
dbo.fn_languageCodeToFull(DefaultLanguage) as DefaultLanguage,
dbo.fn_languageCodeToFull(OtherLanguages) as OtherLanguages,
From UserLanguageCodes -- you do not provide a name so I made one up
Note that the function will work whether there are commas or not, so there is no need to join the Languages table here as you can just have the function do all the work in this case.
One quick and dirty solution would be to use a nested REPLACE command but that could result in a very complex statement a bit long winded, especially if you have more than five languages.
As an example:
SELECT [UserId],[DefaultLanguage],
CASE
WHEN [OtherLanguages] IS NULL THEN ''
ELSE REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE([OtherLanguages],
'en','English'),
'fr','French'),
'it','Italian'),
'ro','Romulan'), --Probably not the intended language ;-)
'sp','Spanish')
END as [OtherLanguages]
FROM YourTable
Personally, I'd create a scalar function, again using the REPLACE command, but you can then check the number of languages present and add a counter so that you're not doing unnecessary lookups.
SELECT [UserId],[DefaultLanguage],
CASE
WHEN [OtherLanguages] IS NULL THEN ''
WHEN [OtherLanguages] = '' THEN ''
ELSE do_function_name([OtherLanguages])
END as [OtherLanguages]
FROM YourTable
It might not be good practice but there are times when it is more efficient to store multiple values in a single field but accept that when you do, it will slow down the way you handle that data.

ANTLR is taking the wrong branch

I have this very simple grammar:
grammar LispExp;
expression : LITERAL #LiteralExp
| '(' '-' expression ')' #UnaryMinusExp
| '(' OP expression expression ')' #OpExp
| '(' 'if' expression expression expression ')' #IfExp;
OP : '+' | '-' | '*' | '/' | '==' | '<';
LITERAL : '0'|('1'..'9')('0'..'9')*;
WS : ('\t' | '\n' | '\r' | ' ') -> skip;
It should be able to parse a "lisp-like" expression, but when I try to parse this:
(+ (+ 5 (* 7 (/ 5 (- 2 (- 9) ) ) ) ) 8)
ANTLR fails to recognize the last unary minus, and generates the following (with antlr v4) :
(expression ( + (expression ( + (expression 5) (expression ( * (expression 7) (expression ( / (expression 5) (expression ( - (expression 2))) ( -) 9 )) expression ))
So, how can I make ANTLR understand the priority of unary minus over binary expression?
You are using a combined grammar LispExp, as opposed to separate lexer grammar LispExpLexer and parser grammar LispExpParser. When working with combined grammars, if you use a string literal in a parser rule the code generator will create anonymous tokens according to those string literals, and silently override the lexer.
In this case, your expression rule includes the string literal '-'. All instances of - in your input will be assigned this token type, which means they will not ever have the token type OP. Your input contains a subexpression (- 2 (- 9) ) which can only be parsed if the first - is an OP token, so according to the parser you have a syntax error in your input.
If you update your code to use separate lexer and parser grammars, any attempt to use a string literal in the parser grammar which is not defined in the lexer grammar will produce an error when you attempt to generate your lexer and parser.