Get all the strings using a VS Code language server - vscode-extensions

Apropos of How do I parse TS to symbols using a Language Server Protocol?, I already did this and I did indeed use the TS compiler as suggested in one of the answers to that question. That, however, is only good for TS and JS. There are many languages, and VS Code has language servers for most of them.
In my case I want the strings, all the strings, and nothing but the strings because I'm building a toolchain to support localisation. The problem is doing this in a language agnostic way. It occurred to me that VS Code identifies strings for the purposes of syntax colouring and validation, and this is how I found the abovementioned question. But that looks at it from an ad hoc perspective of "what is the thing at this position in some source code" rather than processing the entire file and emitting a traversable AST, which is what the TS compiler does.
Does anyone have experience using a VS Code language server like this (find all the string/method names/whatever) ? If so would you mind sharing a sketch of the best approach you found, and key points to read up on with respect to LSP and language server in general?
I really can't see any other practical way to do this that avoids wrapping a compiler for every target language. (But if you can I'm keen to read about it.)

Since revision 3.16, the language server protocol supports semantic tokens
I.e. the language server will tokenize your file and you (i.e. the client) can ask for all tokens with a certain semantic. In your case the semantic string.
The method that the client should call at the server is
textDocument/semanticTokens/full (or any other variant than full)
which will get you back a list of tokens.
A full example can be found here

Related

It is possible to use Perl6 grammar on raster data ? (Use Case : Cloud Optimized GeoTIFF Validation)

A few questions scratch an itch around perl6 grammars and raster (binary in general) data. For what I understand, the text approach is to work at the grapheme-level trhough grammars, may we approach raster data that way ? Can we make custom grapheme definition to approach raster data or a basic unit of binary data to parse them using Grammars ?
Seeing that perl6 is defined by perl6 grammars, can we define similar grammars as kind of "validation" test with a basic case being if the grammar can parse the data, the data is well-formed and is structurally validated ? Using this approach for text data, it is kind of obvious with grammars as the basic unit are text-oriented but can we customize those back-end definition (by example, it's possible to overwrite the :sigspace to make rules and tokens parse with a another separator for grapheme) to enable the power of grammars in the binary data territory ?
Thanks!
For the background part:
During the past few weeks, I begin to learn-ish Perl6 by personal interest. After seeing this talk at FOSDEM 2019 and I begin to ask myself (and the people around me) about using using grammars to inspect/parse binary data. My usecase will be for example to replicate the Cloud Optimized Geotiff validator without the support of a GDAL binding (I didn't see one yet in perl6). It's clearly a learning project for me.
The Spec for Cloud Optimized Geotiff
For now, the basic idea is to parse the binary structure with the help of perl6 grammars if it possible as a first basic step, hoping to be able to inspect the data and metadata as a main goal.
Note : Not native speaker, if some parts need rewriting/precisions feel free to point out.
As only comments where posted, I will summarize all the answers I got from the comments here, my further research and the #perl6 IRC chatroom.
Concerning the support of binding for X library (in the test case, it was GDAL), the strategy inside the perl6 community is to either leverage :
Use the Inline::Foo modules aiming at launching and accessing the ecosystem of the Foo language (by example : Inline::Perl5, Inline::Python and so on). List of Inline::X modules from the Perl6 Module Directory ;
Use or write a binding using NativeCall to bind to dynamic libraries who follow the C Calling Covention ;
Use or write a native perl6 module.
Concerning the parsing of binary data, I'll split the subject in two parts :
Generally speaking ;
Leveraging Grammars ;
1. Generally speaking
Leveraging the P5pack module or using Inline::Perl5 to use the unpack/pack is actually (with perl6.c) the best to parse binary data structure (the former seems favoraed as it's native module).
Go to see first comment from #raiph to a SO anwser showing a basic use case.
2. Leveraging the grammars
With perl6.c, grammars can only parse text.
However, the question about parsing binary data seems to be moderatly hot (based on feedbacks seen on the #perl6 irc channel) and a few to document, yet not implemetend, seems to pave the way with a hope to see it happens in a future (near or distant?).
The last part of the #raiph's anwser list a lot of ressources aiming at that direction. Moreover, in the Synopses 05 - Regexes and Rules : line 432 a :bytes modifier is evoked. We will have to see at which point those modifiers will be implemented and what is missing to bring them to the language.
On the #perl6 irc channel, MasterDuke said « also, i think the nqp binary reading/writing ops that jnthn recently specced and nine implemented were a prerequisite for anything further ». I still have to investigated what exactly he is talking about but it seems to go to the good direction.
One of the main point, IMO, is the related to the grapheme definition based on the UTF-8 one. If we were able to overwrite the grapheme definition to a custom one for specialized grammar as we can for now overwrite the :sigspace modifier to affect what is the separators for rulesand tokens, we will access a new way to operate around data structure and grammars. For now, the grapheme is defined in the string-level not the grammar-level or meta. See #timotimo comments linking to the UTF-8 document describing the Grapheme Cluster Boundary Rules.
A way to bend the rules was linked by #jjmererlo : Parsing GFX3 format with perl6 grammars.

How to Write a Source to Source Compiler API

I am doing a little research on source to source compilation but now that I am getting an understanding of Source to Source compilation. I am wondering are there any examples of API's for these source to source compilers.
I mean an Interface Descriptor to pass the source code of one programming language to another compiler to be compile? Please if so can you point me to these examples or could you give me tips (Just pure explanation) on writing one am still in research okay.
Oh I should note I am researching this for several days an I have came across things such as ROSE, DMS and LLVM. As said its purely research so I dont know whats the best approach I know I wouldn't use ROSE for it is only for C/C++. LLVMs' seems promising but I am new to LLVM. Oh my aim is to create a transpiler for 4 language support (Is that feasible). Which is why I just need expert Advice :)
Yes, you can have a procedural API for doing source-to-source translation. These are pretty straightforward in the abstract: define a core data structure to represent AST nodes, then define APIs to "parse file to AST", "visit tree nodes", "inspect tree nodes", "modify tree nodes", "spit out text". They get messy in the concrete, especially if the API is specific the language being translated; too much of the details of that language get wound into the APIs. While traditional, this is really a rather clumsy way to define source-to-source translators, because you then have to write tons of procedural code invoking the APIs to do the translation.
You can instead define them using a program transformation system (PTS) using source to source transformations based on surface syntax; these are patterns written using the notation of your to-be-compiled language, and your target-language, in the form of "if you see this, then replace it by that", operating on syntax trees not text strings. This means you can inspect the transforms simply by staring at them. So can your fellow programmer.
One such translation rule might look like:
rule tranlate_add_to(t: access_path, u: access_path):COBOL -> Java
" add \t to \u "
-> " \object_for\(\u\).\u += \object_for\(\t\).\t; ";
with a left-hand side "add \t to \u " specifying a COBOL fragment (this) to be replaced by the right-hand side " \object_for... " representing corresponding Java code (that). This rule uses a helper function "object_for" to decide where in a target Java program, a global variable in a the source COBOL program will be placed. (There's no avoiding writing such a function if you are translating Java to COBOL. You can argue about how sophisticated). In practice, the way such a rule works is the pattern ASTs of each side are constructed, and then the patterns are matched against a parsed AST; a match causes the corresponding subtree to be spliced into place where the match was found. (All this low level tree matching and splicing has to be done... procedurally, but somebody else has already implemented that in a PTS).
In our experience, you need one to two thousand such rules to translate one language to another. The plethora of rules comes from the combinatorics of language syntax constructs for the source language (and their perhaps different interpretations according to types; "a+b" means different things when a is an int vs when a is a string) and the target language opportunities. A nice plus of such rewrites is that one can build a somewhat simpler base translation, and apply additional rewrites from the target language to itself to clean up and optimize the translated result.
Many PTS are purely based on source-to-source surface syntax rewrites. We have found that combining both PTS and a procedural API, and making it possible to segue between them makes for very nice tool: you can use the rewrites where convenient, and procedural APIs where they don't work so well (the "object_for" function suggested above is easier to code as a procedure).
See lot more detail on how our DMS Software Reengineering Toolkit encodes such transformation rules (the one above is code in DMS style), in a language agnostic (well, parameterized) fashion. DMS offers a "pure" procedural API as OP requested with some 400 functions, but DMS encourages its users to lean heavily on the rewrites and only code as a little as necessary agains the procedural API. It would be "straightforward" (at least as straightforward as practical) to build your "4 language support" this way.
Don't underestimate the amount of effort to build such translators, even with a lot of good technical machinery as a foundation. Langauges tend to be complex beasts, and their translations doubly so. And you have to decide if you want a truly crummy translation or a good one.
I have been using ROSE compiler framework to write a source to source translator. ROSE can parse a language that it supports and create an AST from it. It provides different APIs (found in SageInterface) to perform transformation and analysis on the AST. After the transformation, you can unparse the transformed AST to produce your target source code.
If ROSE does not support parsing your input language, you can write your own parser while utilizing ROSE's SageBuilder API to build the AST. If your target language is one of the languages which ROSE supports, then you can rely on ROSE's unparser to get the target code. But if ROSE does not support your target language, then you can write your own unparser as well using different AST traversal mechanism provided by ROSE.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

Can PMD be customized to fully support a new language?

Can PMD be customized to fully support a new language, in a reasonable amount of time. I mean I know that technically almost anything can be done, but im wondering if this can be done in a reasonable amount of time? E.g. < 2 weeks
This page mentions how to write a CPD parser http://pmd.sourceforge.net/cpd-parser-howto.html
But is this just for copy / paste detection? Does writing a CPD parser give me full support of PMD in terms of rile sets?
I would guess not, but I'm not a PMD expert (and I have my own bias, check my bio).
The issues are:
Can you define a syntax for my langauge quickly (maybe, depending on how good you are, how messy the language is, and the strength of the parsing machinery offered by PMD)
Can you define the semantics of my language so that "semantic checks" provided by PMD work. You have to do this, because syntax tells you (and a tool) literally nothing about semantic of the syntax. I would guess that the PMD tool 'semantic checks' are pretty wired into the precise details of Java; if you language matched java perfectly, this would be zero work. But it doesn't, or you wouldn't be asking the question. And langauge semantics differences, even minor ones, cause discontinuous changes to the interpreation of the code. Before you get to doing even "serious" semantics, you're likely to have to build a symbol table mapping identifiers in the code to declarations (and the "semantic" type) for those symbols. Based on tool infrastructure I work with, this step alone takes 1-2 months for a real language.
Lastly, you are likely to have to code special PMD checks that are specific to your langauge. That takes time and energy, too.
I build generic compiler-type machinery (parsers, flow analyzers, style/error checkers) and get asked the equivalent of this question all the time WRT to our machinery. We try to have a lot of machinery available, try to make it easy to integrate new langauges, and we've been working on trying to make this "convenient and fast" for 15+ years. Its still not convenient, and there's no way to do this with our tools in a few weeks. I doubt PMD is better.

Language-Portable Example Programs

At the moment I am learning Objective-C 2. I'm aware that it's used heavily by Mac developers, but I'm more interested in learning the language at this point in time than the frameworks for developing on Mac OS X/iPhone (except for Foundation). In order to do this I want to write a few intermediate* console applications, but I'm stuck for ideas.
Most examples are something along the lines of "Write a Fraction class that has getters/setters and a print function", which isn't very challenging coming from a C++ background. I'd like some generic examples of programs, but I don't want them to include any Objective-C implementation details. I want to figure out the program structure/write my own interfaces and learn the language from there.
In summary: I am curious as to what example programs Objective-C programmers would recommend for exploring the language.
An example of an "intermediate" application would be something along the lines of "Write a program that takes a URL from the command line and returns the number of occurrences of a certain word in data returned:
example -url www.google.com -word search
"Project Euler" is a standard response for this kind of thing, but I get the feeling that you're less interested in being told to implement algorithmic stuff (since that knowledge is easier to port between languages) and more interested in miniprojects that will familiarize you with core libraries. Is this fair?
If so, IMO, you ought to know the basics of how to do the following with the standard libraries of language you hope to use for serious work:
Standard IO
Network IO
Disk IO and navigating the filesystem
Regexp utilities
Structured data (XML libraries and CSV libraries if they exist)
Programming problems I would recommend for those:
It sounds like you've already done this.
A very simple proxy - something like what you described in your post, but that listens on a port for a message containing a URL rather than taking it on the command line, and likewise returns the results to whatever contacted it over the network rather than outputting to stdio. [Obviously you need to have the machine behind an appropriate firewall for this!]
Something which takes a directory path and recursively tallies the number of lines its children contain. (So, get the directory's listing, open each child file and count the number of line breaks. Then open each of its child directories, get their listings, ...) Record any errors encountered (e.g., no read privileges) in a reasonable way. Write out the final results to file in the directory supplied.
Usually if I tool around in a language enough, I'll run across some problem which I just naturally find myself using regexps for. I'll assume the same is true for you and punt this element for now.
Fetch StackOverflow.com, and [by putting it into a DOM model and navigating that] determine whether this question is still on the front page.
I got the most out of Objective-C by exploring it with a testing framework. I have written a short blog post about it. You should also wrap your head around the memory management conventions employed by Objective-C, reference counting takes a little time to get used to but works very well if responsibilities are clearly segregated (I have written about that on my blog too).
By getting my hands dirty on a testing framework (GHUnit for that matter), I was able to learn far more about the language than I could have in a "traditional" way. Of course you'll need a little pet project, otherwise this approach doesn't make sense.
I don't think your example is a very good idea as it requires you to mess with http connections, resources etc. which is a little framework specific after all. Parsing a text file would be a little easier in this regard. Using a unit testing framework has the following advantages for you:
learn about platform specific build systems and deployment details
forced to develop components in a loosely coupled fashion from the ground up
thereby exploring unique mechanisms of the language, that might require new or make known patterns redundant (e.g. categories make dependency injection obsolete etc.)
fast compile-test cycle, less time spent in front of the debugger
combined with source control: painless experiments
You should also look into the testing framework implementation, as testing frameworks always require to work with metadata to some extend. Testing frameworks are often used together with isolation frameworks. They basically create objects at runtime that comply to certain interfaces and act as stand-ins for concrete objects. Looking at their implementation will teach you about the runtime manipulations that can be done in Objective-C (keyword: Method-Swizzling)