I am making a program that loads a cfg from a file and uses it to load programming languages into a syntax tree.
What is the proper way to define an identifier in a context free grammar. For now, I have a format like this:
IdentifierStart => $l$ | _;
IdentifierChar => "$l$$IdentifierChar$" | "_$IdentifierChar$" | "$i$$IdentifierChar$" | $e$;
Identifier => "$IdentifierStart$$IdentifierChar$" $w$;
Format:
$l$ = any letter
$e$ = epsilon
$i$ = any integer
$o$ = any operator
$n$ = new line
$w$ = whitespace
$a$ = any atom
Quotes mean the whitespace needs to match the inside of the quotes
While this does work, it is inefficient because it creates a deep tree when each letter could justifiably just be listed next to each other. For example,
Pragma => $n$ "#direct" $w$ $String$;
with the rules:
IdentifierStart => $l$ | _;
IdentifierChar => "$l$$IdentifierChar$" | "_$IdentifierChar$" | "$i$$IdentifierChar$" | $e$;
Identifier => "$IdentifierStart$$IdentifierChar$";
Symbol => $l$ | $i$ | $o$ | \$$Identifier$\$;
Def => $Symbol$ $Def$ | $Symbol$ | "Def";
Assignment => $Def$ \| $Assignment$ | $Def$;
Definition => $Identifier$ "=>" $Assignment$\;;
creates the following tree (where each space represents a level in the tree):
Definition:Pragma => $n$ "#pragma" $w$ $String$;
Identifier:Pragma
IdentifierStart:P
Terminal:P
IdentifierChar:ragma
Terminal:r
IdentifierChar:agma
Terminal:a
IdentifierChar:gma
Terminal:g
IdentifierChar:ma
Terminal:m
IdentifierChar:a
Terminal:a
Terminal:=
Terminal:>
Assignment:$n$ "#direct" $w$ $String$
...
While this is fine in the case of an identifier, I noticed there was a problem when I realized I had to define the file format in the same recursive manner:
File => $ValidDirective$ $File$;
ValidDirective => $Comment$ | $Include$ | $Define$ | $Undef$ | $IfPart$ | $Error$ | $Pragma$ | $String$;
Each element of the file will be stored in a sub-tree of the previous element! I don't think this is acceptable because in a program with millions of lines, it will be incredibly inefficient.
Is there any way I can fix this problem while staying true to the conventions of a CFG?
A true CFG does indeed define repetition via recursion, which leads to the nested parse tree you observed.
A programming language would typically use regular expressions (or something similar) to define the syntax of symbols like identifiers. In that case, parsing an identifier would result in a single token, rather than a tree, which might answer your concerns about inefficiency.
However, that approach doesn't apply to higher-level repetitive constructs, e.g. a StatementList or ArgumentList: for those, regular expressions are insufficient, and you need something at least as 'powerful' as a CFG. It's unclear if you think that storing a StatementList or ArgumentList as a deeply-nested tree is inefficient.
If you're obliged to use a true CFG, and you don't have control of the data structures created during the parsing process, you could run a post-process that converts recursive structures into non-recursive structures, but you may find that the efficiency gains are not that much.
Most programming language grammars don't confine themselves to pure CFGs, though.
Related
I have a 300-lines JQ code which run (literally hours) on the files I deal with (plain list of 200K-2.5M JSON objects, 500MB-6GB size).
On the first glance the code looks linear in complexity, but I can easily miss something.
Are there most common traps to be aware of in terms of code complexity in JQ? Or some tools to identify the key bottlenecks in my code?
I'm bit reluctant with making my code public, for size&complexity on one hand, and for its somewhat proprietary nature on the other.
PS. Trimming the input file to keep only most relevant objects AND pre-deflating it to keep only the fields I need are obvious steps towards optimizing my processing flow. I'm wondering what can be done specifically on query complexity side.
Often, a program that takes longer than expected is also producing incorrect results, so perhaps the first thing to check is that the results are correct. If they are, then the following might be worth checking:
avoid slurping (i.e., use input and/or inputs in preference);
beware of functions with arity greater than 0 that call themselves;
avoid recomputing intermediate results unnecessarily, e.g. by storing them in $-variables, or by including them in a filter's input;
use functions with "short-circuit" semantics when possible, notably any and all
use limit/2, first/1, and/or foreach as appropriate;
the implementation of index/1 on arrays can be a problem for large arrays, as it first computes all the indices;
remember that unique and group_by should be used carefully since both involve a sort.
use bsearch for insertion and for binary search for an item in a sorted array;
using JSON objects as dictionaries is generally a good idea.
Note also that the streaming parser (invoked with the --stream option) is designed to make the tradeoff between time and space in favor of the latter. It succeeds!
Finally, jq is stream-oriented, and using streams is sometimes more efficient than using arrays.
Since you are evidently not a beginner, the likelihood of your making beginners' mistakes seems small, so if you cannot figure out a way to share some details about your program and data, you might try
breaking up the program so you can see where the computing resources are being consumed. Well-placed debug statements can be helpful in that regard.
The following filters for computing the elapsed clock time might also be helpful:
def time(f):
now as $start | f as $out | (now - $start | stderr) | "", $out;
def time(f; $msg):
now as $start | f as $out | ("\(now - $start): \($msg)" | stderr) | "", $out;
Example
def ack(m;n):
m as $m | n as $n
| if $m == 0 then $n + 1
elif $n == 0 then ack($m-1; 1)
else ack($m-1; ack($m; $n-1))
end ;
time( ack(3;7) | debug)
Output:
["DEBUG:",1021]
0.7642250061035156
1021
I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.
Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.
I'm trying to remove elements from a Clojure vector:
Note that I'm using Clojure's operations from Kotlin
val set = PersistentHashSet.create("foo")
val vec = PersistentVector.create("foo", "bar")
val seq = clojure.`core$remove`.invokeStatic(set, vec) as ISeq
val resultVec = clojure.`core$vec`.invokeStatic(seq) as PersistentVector
This is the equivalent of the following Clojure code:
(remove #{"foo"} ["foo" "bar"])
The code works fine but I've noticed that creating a vector from the seq is extrmely slow. I've written a benchmark and these were the results:
| Item count | Remove ms | Remove with converting back to vector ms|
-----------------------------------------------------------------
| 1000 | 51 | 1355 |
| 10000 | 71 | 5123 |
Do you know how I can convert the seq resulting from the remove operation back to a vector without the harsh performance penalty?
If it is not possible is there an alternative way to perform the remove operation?
You could try the complementary operation to remove that returns a vector:
(filterv (complement #{"foo"})
["foo" "bar"])
Note the use of filterv. The v indicates that it uses a vector from the start, and returns a vector, so no conversion is required. It uses a transient vector behind the scenes, so it should be pretty fast.
I'm negating the predicate using complement so I can use filterv, since there is no removev. remove is just defined as the complement of filter anyway though, so it's basically what you were already doing, just strict.
What you are trying to do fundamentally performs badly. Vectors are for fast indexed read/write, and O(1) access to the right end. To do anything else you must tear the vector apart and rebuild it again, an O(N) operation. If you need an operation like this to be efficient, you must use a different data structure.
Why not a PersistentHashSet? Fast removal, though not ordered. I do vaguely recall Clojure also having a sorted set in case that’s needed.
You have made an error of accepting the lazy result of remove as equivalent to the concrete result of converting back to a vector. Compare the lazy result of (remove ...) with the concrete result implied by (count (remove ...)). You will see that it is slightly slower than just doing (vec (remove ...)). Also, for real speed-critical applications, there is nothing like using a native Java ArrayList:
(ns tst.demo.core
(:require
[criterium.core :as crit] )
(:import [java.util ArrayList]))
(def N 1000)
(def tgt-item (/ N 2))
(def pred-set #{ (long tgt-item) })
(def data-vec (vec (range N)))
(def data-al (ArrayList. data-vec))
(def tgt-items (ArrayList. [tgt-item]))
(println :lazy)
(crit/quick-bench
(remove pred-set data-vec))
(println :lazy-count)
(crit/quick-bench
(count (remove pred-set data-vec)))
(println :vec)
(crit/quick-bench
(vec (remove pred-set data-vec)))
(println :ArrayList)
(crit/quick-bench
(let [changed? (.removeAll data-al tgt-items)]
data-al))
with results:
:lazy Evaluation count : 35819946 time mean : 10.856 ns
:lazy-count Evaluation count : 8496 time mean : 69941.171 ns
:vec Evaluation count : 9492 time mean : 62965.632 ns
:ArrayList Evaluation count : 167490 time mean : 3594.586 ns
I am using Menhir to parse a DSL. My parser builds an AST using an elaborate collection of nested types. During later typecheck and other passes in error reports generated for a user, I would like to refer to source file position where it occurred. These are not parsing errors, and they generated after parsing is completed.
A naive solution would be to equip all AST types with additional location information, but that would make working with them (e.g. constructing or matching) unnecessary clumsy. What are the established practices to do that?
I don't know if it's a best practice, but I like the approach taken in the abstract syntax tree of the Frama-C system; see https://github.com/Frama-C/Frama-C-snapshot/blob/master/src/kernel_services/ast_data/cil_types.mli
This approach uses "layers" of records and algebraic types nested in each other. The records hold meta-information like source locations, as well as the algebraic "node" you can match on.
For example, here is a part of the representation of expressions:
type ...
and exp = {
eid: int; (** unique identifier *)
enode: exp_node; (** the expression itself *)
eloc: location; (** location of the expression. *)
}
and exp_node =
| Const of constant (** Constant *)
| Lval of lval (** Lvalue *)
| UnOp of unop * exp * typ
| BinOp of binop * exp * exp * typ
...
So given a variable e of type exp, you can access its source location with e.eloc, and pattern match on its abstract syntax tree in e.enode.
So simple, "top-level" matches on syntax are very easy:
let rec is_const_expr e =
match e.enode with
| Const _ -> true
| Lval _ -> false
| UnOp (_op, e', _typ) -> is_const_expr e'
| BinOp (_op, l, r, _typ) -> is_const_expr l && is_const_expr r
To match deeper in an expression, you have to go through a record at each level. This adds some syntactic clutter, but not too much, as you can pattern match on only the one record field that interests you:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, { enode = UnOp (Neg, e', _) }, _) -> e'
| _ -> e
For comparison, on a pure AST without metadata, this would be something like:
let optimize_double_negation e =
match e.enode with
| UnOp (Neg, UnOp (Neg, e', _), _) -> e'
| _ -> e
I find that Frama-C's approach works well in practice.
You need somehow to attach the location information to your nodes. The usual solution is to encode your AST node as a record, e.g.,
type node =
| Typedef of typdef
| Typeexp of typeexp
| Literal of string
| Constant of int
| ...
type annotated_node = { node : node; loc : loc}
Since you're using records, you can still pattern match without too much syntactic overhead, e.g.,
match node with
| {node=Typedef t} -> pp_typedef t
| ...
Depending on your representation, you may choose between wrapping each branch of your type individually, wrapping the whole type, or recursively, like in Frama-C example by #Isabelle Newbie.
A similar but more general approach is to extend a node not with the location, but just with a unique identifier and to use a final map to add arbitrary data to nodes. The benefit of this approach is that you can extend your nodes with arbitrary data as you actually externalize node attributes. The drawback is that you can't actually guarantee the totality of an attribute since finite maps are no total. Thus it is harder to preserve an invariant that, for example, all nodes have a location.
Since every heap allocated object already has an implicit unique identifier, the address, it is possible to attach data to the heap allocated objects without actually wrapping it in another type. For example, we can still use type node as it is and use finite maps to attach arbitrary pieces of information to them, as long as each node is a heap object, i.e., the node definition doesn't contain constant constructors (in case if it has, you can work around it by adding a bogus unit value, e.g., | End can be represented as | End of unit.
Of course, by saying an address, I do not literally mean the physical or virtual address of an object. OCaml uses a moving GC so an actual address of an OCaml object may change during a program execution. Moreover, an address, in general, is not unique, as once an object is deallocated its address can be grabbed by a completely different entity.
Fortunately, after ephemera were added to the recent version of OCaml it is no longer a problem. Moreover, an ephemeron will play nicely with the GC, so that if a node is no longer reachable its attributes (like file locations) will be collected by the GC. So, let's ground this with a concrete example. Suppose we have two nodes c1 and c2:
let c1 = Literal "hello"
let c2 = Constant 42
Now we can create a location mapping from nodes to locations (we will represent the latter as just strings)
module Locations = Ephemeron.K1.Make(struct
type t = node
let hash = Hashtbl.hash (* or your own hash if you have one *)
let equal = (=) (* or a specilized equal operator *)
end)
The Locations module provides an interface of a typical imperative hash table. So let's use it. In the parser, whenever you create a new node you should register its locations in the global locations value, e.g.,
let locations = Locations.create 1337
(* somewhere in the semantics actions, where c1 and c2 are created *)
Locations.add c1 "hello.ml:12:32"
Locations.add c2 "hello.ml:13:56"
And later, you can extract the location:
# Locations.find locs c1;;
- : string = "hello.ml:12:32"
As you see, although the solution is nice in the sense, that it doesn't touch the node data type, so the rest of your code can pattern match on it nice and easy, it is still a little bit dirty, as it requires global mutable state, that is hard to maintain. Also, since we are using an object address as a key, every newly created object, even if it was logically derived from the original object, will have a different identity. For example, suppose you have a function, that normalizes all literals:
let normalize = function
| Literal str -> Literal (normalize_literal str)
| node -> node
It will create a new Literal node from the original nodes, so all the location information will be lost. That means, that you need to update the location information, every time you derive one node from another.
Another issue with ephemera is that they can't survive the marshaling or serialization. I.e., if you store your AST somewhere in a file, and then you restore it, all nodes will loose their identity, and the location table will become empty.
Speaking of the "monadic approach" that you mentioned in comments. Though monads are magic, they still can't magically solve all the problems. They are not silver bullets :) In order to attach something to a node we still need to extend it with an extra attribute - either a location information directly or an identity through which we can attach properties indirectly. The monad can be useful for the latter though, as instead of having a global reference to the last assigned identifier, we can use a state monad, to encapsulate our id generator. And for the sake of completeness, instead of using a state monad or a global reference to generate unique identifiers, you can use UUID and get identifiers that are not only unique in a program run, but are also universally unique, in the sense that there are no other objects in the world with the same identifier, no matter how often you run your program (in the sane world). And although it looks like that generating the UUID doesn't use any state, underneath the hood it still uses an imperative random number generator, so it is sort of cheating, but still can seen as pure functional, as it doesn't contain observable effects.
I am using ANTLR 3 to parse and rewrite Answer-Set Programs (ASP). What I want to do is parse an ASP program and output an AST with some rewriting. I can easily add and remove nodes to/from the AST but what I need to do is add nodes to the root dynamically (effectively, adding new rules to the ASP program). Which nodes to add and how many is based on the input ASP program.
Below I have an example from my lexer and parser which outputs an AST. r_rule returns a LinkedHashMap that is filled based on what it matches. For each member of the LinkedHashMap, in the rewrite for r_program I want to add a new node to the root node PROGRAM. However, I cannot seem to find a way to iterate through the LinkedHashMap and add new nodes.
#members {
int rID = 0;
}
r_program
: (a=r_rule)* -> ^(PROGRAM r_rule*);
r_rule returns [LinkedHashMap<String, String> somehm]
#init {
$somehm = new LinkedHashMap<String, String>();
String strrID = Integer.toString(++rID);
}
: (head = r_head) ':-'
body=r_body[strrID] {$vartypes.putAll($body.vartypes); } -> ^(LIMPL $head ^(EXTENSION ^(NUMBER[strrID] $head)) $body);
I can use a semantic predicate but only to check a property of the LinkedHashMap. I can arbitrarily loop through the HashMap with inserted code, but I can't then, for each iteration, add child nodes or trigger a rewrite. The code generated is in fact put in the wrong place to even do this in an ugly way using Java (I can't access the root node PARENT).
What can I do about this? A completely different approach is also welcome. Many thanks!
Update 1
An example input is:
head_pred(X, Y, Z) :- body_1(X), body_1(Y), body_1(Z).
An example AST is, apologies for the drawing (n.b. strictly an example used for readability, in reality many more nodes are used in the rewrite)...
PROGRAM
|
|____:-
| |____head_pred(X, Y, X)
| |____body_1(X)
| |____body_1(Y)
| |____body_1(Z)
| |____X == Y
|
|____:-
| |____head_pred(X, Y, X)
| |____body_1(X)
| |____body_1(Y)
| |____body_1(Z)
| |____X == Z
I could go on, the idea is that each rule binds the variables differently, if they can be bound. Different inputs change the number and content of the children of PROGRAM.