Expression optimization - optimization

I have an expression involving REAL as:
xf=w1*x1 + w2*x2 + w3*x3 + w1*y1 + w2*y2 + w3*y3
I want to know if the (Intel Fortran) compiler optimized it to:
xf=w1*(x1+y1) + w2*(x2+y2) + w3*(x3+y3)
How do I see the expression tree which was generated for this expression?

Your standard common subexpression schemes would not perform the above transformation, and some languages would regard the transformation as illegal, since it could result in different side-effects.
But high-performance FORTRAN compilers (which probably excludes Intel FORTRAN) might do it.

How do I see the expression tree which was generated for this expression?
If your compiler has an option to display the tree after a given optimization phase, you can use that option. To find out whether that's the case, consult your compiler's optimization.
If your compiler does not have such an option, you won't be able to see which trees (or other internal representations) the compiler generated during its run. In that case your best bet would be to simply look at the generated assembly to see which arithmetic operations will be performed.

Related

Lucene operator precedence for boolean operators

What is the order of operations for boolean operators? Left to right? Right to left? Specific operators have higher priority?
For example, if I search for:
jakarta OR apache AND website
What do I get? Is it
Anything with "jakarta" as well as anything with both "apache" and "website"?
Anything with "website" that also has either "jakarta" or "apache"?
Something else?
Short answer:
In Lucene, the AND operator takes precedence over the OR operator. So, you are effectively doing this:
jakarta OR (apache AND website)
You can verify this for yourself by parsing your query string and seeing how it converts AND and OR to the "required" and "optional" operators.
And the NOT operator takes precendence over the AND operator, since we are discussing precedence.
But you need to be very careful when dealing with Lucene's so-called "boolean" operators, as they do not behave the way you may expect based on their collective name ("boolean").
(Unfortunately I have never seen any official documentation which provides a citation for these precedence rules - but instead I am relying on empirical observations. See below for more about that. If the documentation for this does exist, that would be great to see.)
Longer Answer
One key thing to understand is that Lucene boolean operators are not really "boolean" in the sense that you may think, based on Boolean algebra, where you use parentheses to help avoid ambiguity (or where you need to know what rules a programming language may be applying) - and where everything evaluates to TRUE or FALSE.
Lucene boolean operators serve a subtly different purpose.
They are not purely concerned with TRUE/FALSE inclusion/exclusion, but also concerned with how to score results so that the more relevant results have higher scores than less relevant results.
The Lucene query jakarta OR apache AND website is equivalent to the following:
jakarta +apache +website
This means the document's field must contain apache and website, but may also include jakarta (for a higher relevance score).
You can see this for yourself by taking your original query string and parsing it:
Query query = parser.parse(queryString);
...and then printing the resulting string representation of the query. The + operator is the "required" operator. It:
requires that the term after the "+" symbol exist somewhere in the field
And the lack of a + operator means the default of "may" as in "may contain" - meaning the term is optional: it does not need to be present, if there is some other clause in the query which does match a document.
The use of AND forces the terms on either side of the AND to be required.
You can encounter some potentially surprising situations.
Consider this:
foo AND bar OR baz AND bat
This parses to the following:
+foo +bar +baz +bat
This is because the AND operators are transformed to + operators for every term, rendering the OR redundant.
It's the same result as if you had written this:
foo AND bar AND baz AND bat
But not the same as this:
(foo AND bar) OR (baz AND bat)
which is parsed to this, where the parentheses are retained:
(+foo +bar) (+baz +bat)
Bottom Line:
Use parentheses to explicitly make your intentions clear, when using AND and OR and also NOT.
Regarding NOT, since we mentioned it - that takes prescendence over AND.
The query:
foo AND bar NOT baz AND bat
Is parsed as:
+foo +bar -baz +bat
So, a document field must contain foo, bar and bat - and must not contain baz.
Why does this situation exist?
I don't know, but I think Lucene originally did not include AND, OR and NOT - but instead used + (must include), - (must not include) and "nothing" (may include). The so-called boolean operators AND, OR, NOT were added later on, as a kind of "syntactic sugar" for these original operators - introduced for people who were more familiar with AND, OR and NOT from other contexts. I'm basing this on the following thread:
Getting a Better Understanding of Lucene's Search Operators
A summary of that thread is included in this answer about the NOT operator.

Disambiguating predicates vs gated predicates in ANTLR v3

Here's an excerpt from an ANTLR grammar I'm working with:
expression: // ... some other stuff ...
(
{ switch_expression_enabled() }?=> switch_expression
| { complex_expression_enabled() }? complex_expression
| simple_expression
)
The functions switch_expression_enabled() and complex_expression_enabled() check compiler flags to figure out whether the corresponding language features should be enabled. As you can see, the first alternative uses a gated predicate (which seems to be the correct one to use according to the documentation), while the second one uses a disambiguating predicate.
Judging from the descriptions in the official documentation as well as here and here, I'd expect the definition of the second alternative to be incorrect. However, it turns out that it works in exactly the same way: If complex_expression_enabled() returns false, then I get a syntax error if I use a complex_expression, even if the input is not ambiguous, so the term "disambiguating predicate" seems to be a bit misleading. The only difference I can see in the generated code is that in case of gated predicates, the condition is checked twice (before and after choosing alternative 1), while the "disambiguating" predicate is only checked after choosing alternative 2.
So my question is: Is there any practical difference between using gated and disambiguating predicates for disabling grammar based on compiler flags?

Why not have operators as both keywords and functions?

I saw this question and it got me wondering.
Ignoring the fact that pretty much all languages have to be backwards compatible, is there any reason we cannot use operators as both keywords and functions, depending on if it's immediately followed by a parenthesis? Would it make the grammar harder?
I'm thinking mostly of python, but also C-like languages.
Perl does something very similar to this, and the results are sometimes surprising. You'll find warnings about this in many Perl texts; for example, this one comes from the standard distributed Perl documentation (man perlfunc):
Any function in the list below may be used either with or without parentheses around its arguments. (The syntax descriptions omit the parentheses.) If you use parentheses, the simple but occasionally surprising rule is this: It looks like a function, therefore it is a function, and precedence doesn't matter. Otherwise it's a list operator or unary operator, and precedence does matter. Whitespace between the function and left parenthesis doesn't count, so sometimes you need to be careful:
print 1+2+4; # Prints 7.
print(1+2) + 4; # Prints 3.
print (1+2)+4; # Also prints 3!
print +(1+2)+4; # Prints 7.
print ((1+2)+4); # Prints 7.
An even more surprising case, which often bites newcomers:
print
(a % 7 == 0 || a % 7 == 1) ? "good" : "bad";
will print 0 or 1.
In short, it depends on your theory of parsing. Many people believe that parsing should be precise and predictable, even when that results in surprising parses (as in the Python example in the linked question, or even more famously, C++'s most vexing parse). Others lean towards Perl's "Do What I Mean" philosophy, even though the result -- as above -- is sometimes rather different from what the programmer actually meant.
C, C++ and Python all tend towards the "precise and predictable" philosophy, and they are unlikely to change now.
Depending on the language, not() is not defined. If not() is not defined in some language, you can not use it. Why not() is not defined in some language? Because creator of that language probably had not need this type of language construction. Because it is better to let things be simpler.

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.

Handling expressions in GMP

I introduced myself to the GMP library for high precision arithmetic recently. It seems easy enough to use but in my first program I am running into practical problems. How are expressions to be evaluated. For instance, if I have "1+8*z^2" and z is a mpz_t "large integer" variable, how am I to quickly evaluate this? (I have larger expressions in the program that I am writing.) Currently, I am doing every single operation manually and storing the results in temporary variables like this for the "1+8*z^2" expression:
1) first do mpt_mul(z,z,z) to square z
2) then define an mpz_t variable called "eight" with the value 8.
3) multiply the result from step one by this 8 and store in temp variable.
4) define mpz_t variable called "one" with value 1.
5) add this to the result in step 3 to find final answer.
Is this what I am supposed to be doing? Or is there a better way? It would really help if there was a user's manual for GMP to get people started but there's only the reference manual.
GMP comes with a C++ class interface which provides a more straightforward way of expressing arithmetic expressions. This interface uses C++ operator overloading to allow you to write:
mpz_class z;
1 + 8 * z**2
This is, of course, assuming you're using C++. If you are using C only, you may need to use the C interface to GMP which does not provide operator overloading.
Turns out that there's an unsupported expression parser distributed with GMP in a "expr" subdirectory. It's not part of GMP proper and is subject to change but it is discussed in a README file in that directory. It isn't guaranteed to do the calculation in the fastest way possible, so buyer beware.
So the user must manually evaluate all expressions when using GMP unless they wish to use this library or make their own expression parser.