Why no line break before `+` but before `.` in Kotlin? - kotlin

I am trying to understand this answer: https://stackoverflow.com/a/44180583/481061 and particularly this part:
if the first line of the statement is a valid statement, it won't work:
val text = "This " + "is " + "a "
+ "long " + "long " + "line" // syntax error
This does not seem to be the case for the dot operator:
val text = obj
.getString()
How does this work? I'm looking at the grammar (https://kotlinlang.org/docs/reference/grammar.html) but am not sure what to look for to understand the difference. Is it built into the language outside of the grammar rules, or is it a grammar rule?

It is a grammar rule, but I was looking at an incomplete grammar.
In the full grammar https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinParser.g4 it's made clear in the rules for memberAccessOperator and identifier.
The DOT can always be preceded by NL* while the other operators cannot, except in parenthesized contexts which are defined separately.

Related

BNF parentheses and pipe

In Backus–Naur Form would:
print_stmt : "print" (string | expr)+
match to:
print string
print expr
or
print (string)
print (expr)
I'm not sure whether the parentheses have to be there or not.
It would match either of the first two options, and a number of other possibilities.
In this dialect of BNF, it appears that the parentheses are metacharacters. The + probably means 'one or more' of the previous units, but if the ) was repeated one or more times, it would be a very unusual language. If the + was absent, then either interpretation would be reasonable and I couldn't give as confident an answer; you would have to go back and find the specification for the dialect of BNF you're interpreting.
Because of the +, this should also be valid:
print string string expr string

Why do parser combinators don't backtrack in case of failure?

I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user
def value: Parser[Any] =
obj | stringLit | num | "true" | "false" | failure("illegal start of value")
This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.
To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser
import scala.util.parsing.combinator._
object ExprParserTest extends App with JavaTokenParsers {
// Grammar
val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
val program = rep(declaration) ~ rep(statement)
// Test
println(parseAll(program, "1 to 2")) // OK
println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
println(parseAll(program, "abc")) // OK
}
It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?
Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.
Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.

how to avoid WHERE to be seen as attribute in parsing sql using haskell

I am parsing SQL in Haskell using Parsec. How can I ensure that a statement with a where clause will not treat the WHERE as a table name? Find below some part of my coding. The p_Combination works but it sees the WHERE as part of the list of attributes instead of the where clause.
--- from clause
data Table_clause = Table {table_name :: String, alias :: Maybe String} deriving Show
p_Table_clause:: Parser Table_clause
p_Table_clause = do
t <- word
skipMany (space <?> "require space at the Table clause")
a <- optionMaybe (many1 (alphaNum)) <?> "aliase for table or nothing"
return $ Table t a
newtype From_clause = From [Table_clause] deriving Show
p_From_clause :: Parser From_clause
p_From_clause = do
string "FROM" <?> "From";
skipMany1 (space <?> "space in the from clause ")
x <- sepBy p_Table_clause (many1(char ',' <|> space))
return $ From x
-- where clause conditions elements
data WhereClause = WhereFCondi String deriving Show
p_WhereClause :: Parser WhereClause
p_WhereClause = do
string "WHERE"
skipMany1 space
x <- word
return $ WhereFCondi x
data Combination = FromWhere From_clause (Maybe WhereClause) deriving Show
p_Combination :: Parser Combination
p_Combination = do
x <- p_From_clause
skipMany1 space
y <- optionMaybe p_WhereClause
return $ FromWhere x y
Normal SQL parsers have a number of reserved words, and they’re often not context-sensitive. That is, even where a where might be unambiguous, it is not allowed simply because it is reserved. I’d guess most implementations do this by first lexing the source in a conceptually separate stage from parsing the lexed tokens, but we do not need to do that with Parsec.
Usually the way we do this with Parsec is by using Text.Parsec.Token. To use it, you first create a LanguageDef defining some basic characteristics about the language you intend to parse: how comments work, the reserved words, whether it’s case sensitive or not, etc. Then you use makeTokenParser to get a record full of functions tailored to that language. For example, identifier will not match any reserved word, they are all careful to require whitespace where necessary, and when they are skipping whitespace, comments are also skipped.
If you want to stay with your current approach, using only Parsec primitives, you’ll probably want to look into notFollowedBy. This doesn’t do exactly what your code does, but it should provide some inspiration about how to use it:
string "FROM" >> many1 space
tableName <- many1 alphaNum <* many1 space
aliasName <- optionMaybe $ notFollowedBy (string "WHERE" >> many1 space)
>> many1 alphaNum <* many1 space
Essentially:
Parse a FROM, then whitespace.
Parse a table name, then whitespace.
If WHERE followed by whitespace is not next, parse an alias name then whitespace.
I guess the problem is that p_Table_clause accepts "WHERE". To fix this, check for "WHERE" and fail the parser:
p_Table_clause = do
t <- try (do w <- word
if w == "WHERE"
then unexpected "keyword WHERE"
else return w)
...
I guess there might be a try missing in sepBy p_Table_clause (many1 (char ',' <|> space)). I would try sepBy p_Table_clause (try (many1 (char ',' <|> space))).
(Or actually, I would follow the advice from the Parsec documentation and define a lexeme combinator to handle whitespace).
I don't really see the combinator you need right away, if there is one. Basically, you need p_Combination to try (string "WHERE" >> skipMany1 space) and if it succeeds, parse a WHERE "body" and be done. If it fails, try p_Table_clause, if it fails be done. If p_Table_clause succeeds, read the separator and loop. After the loop is done prepend the Table_clause to the results.
There's some other problems with your parser, too. many1 (char ',' <|> space) matches " ,,, , ,, " which is not a valid separator between tables in a from clause, for example. Also, SQL keywords are case-insensitive, IIRC.
In general, you want to exclude keywords from matching identifiers, with something like:
keyword :: Parser Keyword
keyword = string "WHERE" >> return KW_Where
<|> string "FROM" >> return KW_From
<|> string "SELECT" >> return KW_Select
identifier :: Parser String
identifier = try (keyword >> \kw -> fail $ "Expected identifier; got:" ++ show kw)
<|> (liftA2 (:) identiferStart (many identifierPart))
If two (or more) or your keywords have common prefixes, you'll want to combine them for more efficiency (less backtracking) like:
keyword :: Parser Keyword
keyword = char 'D' >> ( string "ROP" >> KW_Drop
<|> string "ELETE" >> KW_Delete
)
<|> string "INSERT" >> return KW_Insert

Lucene QueryParser interprets 'AND OR' as a command?

I am calling Lucene using the following code (PyLucene, to be precise):
analyzer = StandardAnalyzer(Version.LUCENE_30)
queryparser = QueryParser(Version.LUCENE_30, "text", analyzer)
query = queryparser.parse(queryparser.escape(querytext))
But consider if this is the content of querytext:
querytext = "THE FOOD WAS HONESTLY NOT WORTH THE PRICE. MUCH TOO PRICY WOULD NOT GO BACK AND OR RECOMMEND IT"
In that case, the "AND OR" trips up the queryparser, even though I am use queryparser.escape. How do I avoid the following error message?
Java stacktrace:
org.apache.lucene.queryParser.ParseException: Cannot parse 'THE FOOD WAS HONESTLY NOT WORTH THE PRICE. MUCH TOO PRICY WOULD NOT GO BACK AND OR RECOMMEND IT': Encountered " <OR> "OR "" at line 1, column 80.
Was expecting one of:
<NOT> ...
"+" ...
"-" ...
"(" ...
"*" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...
<TERM> ...
"*" ...
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:187)
....
at org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:1759)
at org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:1641)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1268)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1207)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1167)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:182)
It's not just OR, it's AND OR.
I use the following workaround:
query = queryparser.parse(queryparser.escape(querytext.replace("AND OR", "AND or")))
queryparser.parse only escapes special characters (as shown in this page) and leaves "AND OR" unchanged, so it would not work in your case. Since presumably you also used StandardAnalyzer to analyze your text, the terms in your index are already in lowercase. So you can change the whole query string to lowercase before giving it to the queryparser. Lowercase "and" and "or" are not considered operators, so "and or" would not trip the queryparser.
I realise I'm rather late to the party here, but putting quotes round the search string is a better option:
querytext = "\"THE FOOD WAS ... \""

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.