Should Raku grammars or regexes be used for parsing wiki markup? - raku

I need to convert a few hundred text files using TWiki markup to a more standard markup language (e.g. Markdown or Asciidoc) and, as there doesn't seem to be any usable tool for doing this (pandoc supports TWIki but very poorly, i.e. it irretrievably loses not only much of the markup but also a lot of contents), I'd like to write my own converter in Raku.
At first sight, it looks like it would be appropriate to use Raku grammars for this task, but I'm a bit worried that I could have problems with some broken markup -- which I'd still need to handle in some way. So would it be wiser to stick to manual regexes to have more flexibility?
If anybody has any experience with using Raku grammars for parsing markup, could you please share it? TIA!

Related

It is possible to use Perl6 grammar on raster data ? (Use Case : Cloud Optimized GeoTIFF Validation)

A few questions scratch an itch around perl6 grammars and raster (binary in general) data. For what I understand, the text approach is to work at the grapheme-level trhough grammars, may we approach raster data that way ? Can we make custom grapheme definition to approach raster data or a basic unit of binary data to parse them using Grammars ?
Seeing that perl6 is defined by perl6 grammars, can we define similar grammars as kind of "validation" test with a basic case being if the grammar can parse the data, the data is well-formed and is structurally validated ? Using this approach for text data, it is kind of obvious with grammars as the basic unit are text-oriented but can we customize those back-end definition (by example, it's possible to overwrite the :sigspace to make rules and tokens parse with a another separator for grapheme) to enable the power of grammars in the binary data territory ?
Thanks!
For the background part:
During the past few weeks, I begin to learn-ish Perl6 by personal interest. After seeing this talk at FOSDEM 2019 and I begin to ask myself (and the people around me) about using using grammars to inspect/parse binary data. My usecase will be for example to replicate the Cloud Optimized Geotiff validator without the support of a GDAL binding (I didn't see one yet in perl6). It's clearly a learning project for me.
The Spec for Cloud Optimized Geotiff
For now, the basic idea is to parse the binary structure with the help of perl6 grammars if it possible as a first basic step, hoping to be able to inspect the data and metadata as a main goal.
Note : Not native speaker, if some parts need rewriting/precisions feel free to point out.
As only comments where posted, I will summarize all the answers I got from the comments here, my further research and the #perl6 IRC chatroom.
Concerning the support of binding for X library (in the test case, it was GDAL), the strategy inside the perl6 community is to either leverage :
Use the Inline::Foo modules aiming at launching and accessing the ecosystem of the Foo language (by example : Inline::Perl5, Inline::Python and so on). List of Inline::X modules from the Perl6 Module Directory ;
Use or write a binding using NativeCall to bind to dynamic libraries who follow the C Calling Covention ;
Use or write a native perl6 module.
Concerning the parsing of binary data, I'll split the subject in two parts :
Generally speaking ;
Leveraging Grammars ;
1. Generally speaking
Leveraging the P5pack module or using Inline::Perl5 to use the unpack/pack is actually (with perl6.c) the best to parse binary data structure (the former seems favoraed as it's native module).
Go to see first comment from #raiph to a SO anwser showing a basic use case.
2. Leveraging the grammars
With perl6.c, grammars can only parse text.
However, the question about parsing binary data seems to be moderatly hot (based on feedbacks seen on the #perl6 irc channel) and a few to document, yet not implemetend, seems to pave the way with a hope to see it happens in a future (near or distant?).
The last part of the #raiph's anwser list a lot of ressources aiming at that direction. Moreover, in the Synopses 05 - Regexes and Rules : line 432 a :bytes modifier is evoked. We will have to see at which point those modifiers will be implemented and what is missing to bring them to the language.
On the #perl6 irc channel, MasterDuke said « also, i think the nqp binary reading/writing ops that jnthn recently specced and nine implemented were a prerequisite for anything further ». I still have to investigated what exactly he is talking about but it seems to go to the good direction.
One of the main point, IMO, is the related to the grapheme definition based on the UTF-8 one. If we were able to overwrite the grapheme definition to a custom one for specialized grammar as we can for now overwrite the :sigspace modifier to affect what is the separators for rulesand tokens, we will access a new way to operate around data structure and grammars. For now, the grapheme is defined in the string-level not the grammar-level or meta. See #timotimo comments linking to the UTF-8 document describing the Grapheme Cluster Boundary Rules.
A way to bend the rules was linked by #jjmererlo : Parsing GFX3 format with perl6 grammars.

Difference between properties File, Yaml & Json?

I'm a beginner in software testing. I'm working with selenium with page object design patterns. I want to keep the test data separately, but i'm confusing how to do it.
I want to know the difference between the usage of properties file, yaml, json.Which is most useful in software testing?
Which should I choose yaml, properties file, or json. So I need to keep the test data separately in json or properties file or yaml. Which is more people using nowadays ? As a tester using yaml, json and properties file is knowiing well. or following as particular pattern which is more easier. whats your suggestion ?
XML (Extensible Markup Language) is flexible and powerful markup capabilities. It is often used in configuration and preference files like those used for the Eclipse IDE. Most Web browsers have XML viewers, although XML is designed for structured data, making it a bit like looking at the internals of a database.
JavaScript Object Notation (JSON) is used with JavaScript, of course. It will be familiar to Web developers that use it for client/server communication.
YAML stands for YAML Ain’t Markup Language. It uses line and whitespace delimiters instead of explicitly marked blocks that could span one or more lines like XML and JSON. This approach is used in many programming languages, such as Python.
So it comes down to YAML or JSON-
Technically YAML is a superset of JSON. That is, in theory at least, a YAML parser can understand JSON, but not necessarily the other way around.
In general, there are certain things I like about YAML that are not available in JSON.
1) YAML is visually easier to look at. In fact the YAML homepage is itself valid YAML, yet it is easy for a human to read.
2)YAML has the ability to reference other items within a YAML file using "anchors." Thus it can handle relational information as one might find in a MySQL database.
3)YAML is more robust about embedding other serialization formats such as JSON or XML within a YAML file.
4)YAML, depending on how you use it, can be more readable than JSON
5)JSON is often faster and is probably still interoperable with more systems
6)Duplicate keys, which are potentially valid JSON, are definitely invalid YAML.
7)YAML has a ton of features, including comments and relational anchors. YAML syntax is accordingly quite complex, and can be hard to understand.
8)YAML can be used, directly, for complex tasks like grammar definitions, and is often a better choice than inventing a new language.
If you don't need any features which YAML has and JSON doesn't, I would prefer JSON because it is very simple and is widely supported (has a lot of libraries in many languages). YAML is more complex and has less support. I don't think the parsing speed or memory use will be very much different, and maybe not a big part of your program's performance.But JSON is the winner for performance (if relevant) and interoperability. YAML is better for human-maintained files.So basically use as per your requirements not what most people are using.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

Doxygen for procedural programs

I have some large, mostly procedural codes that need to be well documented. This generally involves repeated use of a number of functions that must be executed in a certain order.
Doxygen is a great product, but it seems very oriented towards documenting OOP codes. Does anyone have any tips on how to use doxygen in a natural way to document procedural work?
There's nothing inherently OOP about the way doxygen works. It's just able to extract more information about OO code because it has more information in it (e.g., inheritance graphs).
We use doxygen for plain C code and it works just as well, minus the information that plain C doesn't provide when compared to C++. Just use doxygen's grouping (#addtogroup et al) features to organize the generated documentation and you're good to do.

Should DocBook be used for publishing technical documentation in English & Arabic?

I'm looking for the ideal tool to use for publishing technical documentation in English & Arabic (in the same document). Should I use DocBook, or is it better to stick with TeX/LaTeX? I am a complete beginner to both systems so there's no legacy stuff to worry about. The two most import factors for me are easy of use and support for Arabic. By ease of use I mean that I don't mind setting up XML documents or so on, but for day to day writing I'd rather not deal with hand-coding XML, a good editor that gives a feel for how the document sort of looks would be ideal. I would like the output to be print-ready PDF as well as HTML.
Well, TeX/LaTeX in the TeXLive CD/DVD/bundle in the XeTeX incarnation is certainly able to deal with Arabic, see these examples. I'm not sure whether all the DocBook utilities (like the editors and things like fop) are up to this.