Can I index source code using Lucene?

Can I index source code using Lucene? - lucene

I would like to index source code using Lucene. The source code has already been pre-analysed using a compiler plugin. The output of the compiler is a list of IDs that appear in the source code. Each ID includes information about
the module the ID was defined in (as opposed to used in),
the source span where the ID appears (i.e. line:col-line:col), and
whether the ID is defined at this location or merely used here.
For example, given this source code module (in pseudo-code)
module MyModule
from MyOtherModule import bar
foo = ...
print bar
here's what the compiler might output when compiling MyModule:
MyModule.foo,3:1-3:3,definition
MyOtherModule.bar,4:7-4:9,use
Note how all IDs that appear in the output are fully qualified, even though they might not appear that way in the source. This is why we use a compiler, it allows us to do more exact code search than just purely text-based search.
Question: Is it possible to write a custom tokenizer and analyzer that indexes the compiler output shown above in a way that the metadata (i.e. the fully qualified ID and whether the ID was defined or used at the given location) is kept an available when scoring the documents?
To be more precise, I'd like each term to be associated with the module where it was defined (e.g. foo would have associated metadata: defining module=MyModule). I want each posting in the posting list to store whether this particular appearance of an ID was a definition or a use of that ID.
In addition, I'd like to have Lucene store the non-qualified ID as synonyms for the qualified ID. This would allow users to search for "foo" and retrieve all documents that contain the IDs "Module1.foo" and "Module2.foo".

It's probably easier to put the various attributes into Lucene fields, so that you can query like:
parse module:MyModule use:yes
which would return only hits on 'parse' in 'MyModule' where 'parse' was used rather than defined.

Related

ASSIGN fails with variable from debugger path

I am trying to assign the value of this stucture path to a fieldsymbol, but this path does not work because it has a table in it's path.
But with in the debugger this value of this path is shown correctly.
Is there a way to dynamically assign a component of a table line to a fieldsymbol, by passing one path?
If not then I will just read the table line and then use the path to get the wanted value.
ls_struct (Struct)
- SUPPLYCHAINTRADETRANSACTION (Struct)
- INCL_SUPP_CHAIN_ITEM (Table)
- ASSOCIATEDDOCUMENTLINEDOCUMENT (Element)
i_component_path = |IG_DDIC-SUPPLYCHAINTRADETRANSACTION-INCL_SUPP_CHAIN_ITEM[1]-ASSOCIATEDDOCUMENTLINEDOCUMENT|.
ASSIGN (i_component_path) TO FIELD-SYMBOL(<lg_value>).
IF <lg_value> IS NOT ASSIGNED.
return.
ENDIF.
<lg_value> won't be assigned

Solution by Sandra Rossi
The debugger has its own syntax and own logic, it doesn't apply the ASSIGN algorithm at all. With ABAP source code, you have to use ASSIGN twice, the first one to reach the internal table, then you select the first line, and the second one to reach the component of the line.
The debugger works completely differently, the debugger code works only in debug mode, you can't call the code from the debugger (i.e. if you call it, the kernel code used by the debugger will fail). No, there's no "abappath". There are the XSL transformation objects (xpath), but it's slow for what you ask.
Thank you very much

This seems to be a rather unexpected limitation of the ASSIGN statement. Probably worth a ticket to SAP's ABAP language group to clarify whether it's even a bug.
While this works:
ASSIGN data-some_table[ 1 ]-some_field TO FIELD-SYMBOL(<lv_source>).
the same expressed as a string doesn't:
ASSIGN (`data-some_table[ 1 ]-some_field`) TO FIELD-SYMBOL(<lv_source>).
Alternative 1 for (name) of the ABAP keyword documentation for the ASSIGN statement says that "[t]he name in name is structured in the same way as if specified directly".
However, this declaration is immediately followed by "the content of name must be the name of a data object which may contain offsets and lengths, structure component selectors, and component selectors for assigning structured data objects and attributes in classes or objects", a list that does not include the table expressions we would need here.

Can I use filters in Intellij structural search to reference variable counts?

I am trying to create a custom inspection in IntelliJ using structural search. The idea is to find all methods that have one or more parameters, of which at least one is not annotated. Bonus: Only hit non-primitive types of parameters.
So far, I have created the following Search Template:
$MethodType$ $Method$(#$ParamAnnotation$ $ParameterType$ $Parameter$);
using these filters and the search target "complete match":
$Parameters$: count[1,∞]
$ParamAnnotation$: count[0,0]
However, this only hits methods without any parameters annotated. I want it to also match methods where only some parameters have an annotation but others don't.
Is it possible to reference the count of one variable in the filter of another, e.g. by using script filters? If so, how?

You can do this by creating a Search Template like this:
$MethodType$ $Method$($TypeBefore$ $before$,
#$ParamAnnotation$ $ParameterType$ $Parameter$,
$TypeAfter$ $after$);
Filters:
$Parameters$: count=[1,1] // i.e. no filter
$ParamAnnotation$: count=[0,0]
$before$: count=[0,∞]
$after$: count=[0,∞]
This will find all method with at least one parameter without annotation.

ABAP type pool: program with type code TYPP but with name longer than five characters

We are writing a tool in Java that parses and transforms ABAP code. We therefore have no intention to write new ABAP code but our tool has to handle all of ABAP, even obsolete statements. Furthermore, I'm not an ABAP expert.
ABAP programs can use type groups, introduced by key word TYPE-POOL. Names of type groups have a maximal length of five (internally eight, if you count the prefix "% C"), their type code is TYPP. In the past, relying on these assumptions worked well for us.
Recently, we see ABAP programs with type code TYPP but with name longer than 5, e.g., 'OIA===========================P'. Furthermore, for each of those, there is another, empty object with same name but type code INCL. These new objects are referenced only if a regular type group is, too.
These new objects may be internal ones and irrelevant for us - I haven't seen any reference to them in the ABAP Keyword Documentation. On the other hand, they are confusing us because we see them.
Can someone explain to me the meaning of these objects and point me to some documentation?
Edit: Here examples from an EHP7 for SAP ERP 6.0 system
An example object. Entries in D010INC look fine:
The same object now using type pool mrm. Where do the additional includes come from?

These objects are introduced through inclusions, extensions and switched objects. To read along:
Check type pool MRM, type mrm_idoc_data_ers - that type contains a statement to include rmrm_idoc_data_ers_sbo. A similar include statement pulls rmrm_upd_arseg_nfm into mrm_upd_arseg. That explains the last two lines. Your parser should have caught that.
RMRM_IDOC_DATA_ERS_SBO contains an enhancement point named RMRM_IDOC_DATA_ERS_SBO_02 that belongs to an enhancement spot ES_RMRM_IDOC_DATA_ERS_SBO. Similarly, RMRM_UPD_ARSEG_NFM contains an enhancement point RMRM_UPD_ARSEG_NFM_01 that belongs to the enhancement spot ES_RMRM_UPD_ARSEG_NFM.
For ES_RMRM_IDOC_DATA_ERS_SBO, an enhancement implementation named ISAUTO_MRM_RMRM_IDOC_DATA_ERS exists. For ES_RMRM_UPD_ARSEG_NFM, an implementation named /NFM/MM_RMRM_UPD_ARSEG_NFM exists. That explains the references ending with =E
The implementation ISAUTO_MRM_RMRM_IDOC_DATA_ERS is located in the package ISAUTO_MRM. The implementation /NFM/MM_RMRM_UPD_ARSEG_NFM is located in the package /NFM/MM. That explains the references ending with =P. Obviously, these references are not generated for every package:
The package ISAUTO_MRM is controlled by the switch AM_ERS, the package /NFM/MM is controlled by the switch /NFM/MM. That explains the references ending in =S.
Ultimately, these references can be used to determine which programs need to be re-generated when the state of a switch is changed.

Ocaml naming convention

I am wondering if there exists already some naming conventions for Ocaml, especially for names of constructors, names of variables, names of functions, and names for labels of record.
For instance, if I want to define a type condition, do you suggest to annote its constructors explicitly (for example Condition_None) so as to know directly it is a constructor of condition?
Also how would you name a variable of this type? c or a_condition? I always hesitate to use a, an or the.
To declare a function, is it necessary to give it a name which allows to infer the types of arguments from its name, for example remove_condition_from_list: condition -> condition list -> condition list?
In addition, I use record a lot in my programs. How do you name a record so that it looks different from a normal variable?
There are really thousands of ways to name something, I would like to find a conventional one with a good taste, stick to it, so that I do not need to think before naming. This is an open discussion, any suggestion will be welcome. Thank you!

You may be interested in the Caml programming guidelines. They cover variable naming, but do not answer your precise questions.
Regarding constructor namespacing : in theory, you should be able to use modules as namespaces rather than adding prefixes to your constructor names. You could have, say, a Constructor module and use Constructor.None to avoid confusion with the standard None constructor of the option type. You could then use open or the local open syntax of ocaml 3.12, or use module aliasing module C = Constructor then C.None when useful, to avoid long names.
In practice, people still tend to use a short prefix, such as the first letter of the type name capitalized, CNone, to avoid any confusion when you manipulate two modules with the same constructor names; this often happen, for example, when you are writing a compiler and have several passes manipulating different AST types with similar types: after-parsing Let form, after-typing Let form, etc.
Regarding your second question, I would favor concision. Inference mean the type information can most of the time stay implicit, you don't need to enforce explicit annotation in your naming conventions. It will often be obvious from the context -- or unimportant -- what types are manipulated, eg. remove cond (l1 # l2). It's even less useful if your remove value is defined inside a Condition submodule.
Edit: record labels have the same scoping behavior than sum type constructors. If you have defined a {x: int; y : int} record in a Coord submodule, you access fields with foo.Coord.x outside the module, or with an alias foo.C.x, or Coord.(foo.x) using the "local open" feature of 3.12. That's basically the same thing as sum constructors.
Before 3.12, you had to write that module on each field of a record, eg. {Coord.x = 2; Coord.y = 3}. Since 3.12 you can just qualify the first field: {Coord.x = 2; y = 3}. This also works in pattern position.

If you want naming convention suggestions, look at the standard library. Beyond that you'll find many people with their own naming conventions, and it's up to you to decide who to trust (just be consistent, i.e. pick one, not many). The standard library is the only thing that's shared by all Ocaml programmers.
Often you would define a single type, or a single bunch of closely related types, in a module. So rather than having a type called condition, you'd have a module called Condition with a type t. (You should give your module some other name though, because there is already a module called Condition in the standard library!). A function to remove a condition from a list would be Condition.remove_from_list or ConditionList.remove. See for example the modules List, Array, Hashtbl,Map.Make`, etc. in the standard library.
For an example of a module that defines many types, look at Unix. This is a bit of a special case because the names are mostly taken from the preexisting C API. Many constructors have a short prefix, e.g. O_ for open_flag, SEEK_ for seek_command, etc.; this is a reasonable convention.
There's no reason to encode the type of a variable in its name. The compiler won't use the name to deduce the type. If the type of a variable isn't clear to a casual reader from the context, put a type annotation when you define it; that way the information provided to the reader is validated by the compiler.

Description format for an embedded structure

I have a C structure that allow users to configure options in an embedded system. Currently the GUI we use for this is custom written for every different version of this configuration structure. What I'd like for is to be able to describe the structure members in some format that can be read by the client configuration application, making it universal across all of our systems.
I've experimented with describing the structure in XML and having the client read the file; this works in most cases except those where some of the fields have inter-dependencies. So the format that I use needs to have a way to specify these; for instance, member A must always be less than or equal to half of member B.
Thanks in advance for your thoughts and suggestions.
EDIT:
After reading the first reply I realized that my question is indeed a little too vague, so here's another attempt:
The embedded system needs to have access to the data as a C struct, running any other language on the processor is not an option. Basically, all I need is a way to define metadata with the structure, this metadata will be downloaded onto flash along with firmware. The client configuration utility will then read the metadata file over RS-232, CAN etc. and populate a window (a tree-view) that the user can then use to edit options.
The XML file that I mentioned tinkering with was doing exactly that, it contained the structure member name, data type, number of elements etc. The location of the member within the XML file implicitly defined its position in the C struct. This file resides on flash and is read by the configuration program; the only thing lacking is a way to define dependencies between structure fields.
The code is generated automatically using MATLAB / Simulink so I do have access to a scripting language to help with the structure creation. For example, if I do end up using XML the structure will only be defined in the XML format and I'll use a script to create the C structure during code generation.
Hope this is clearer.

For the simple case where there is either no relationship or a relationship with a single other field, you could add two fields to the structure: the "other" field number and a pointer to a function that compares the two. Then you'd need to create functions that compared two values and return true or false depending upon whether or not the relationship is met. Well, guess you'd need to create two functions that tested the relationship and the inverse of the relationship (i.e. if field 1 needs to be greater than field 2, then field 2 needs to be less than or equal to field 1). If you need to place more than one restriction on the range, you can store a pointer to a list of function/field pairs.
An alternative is to create a validation function for every field and call it when the field is changed. Obviously this function could be as complex as you wanted but might require more hand coding.
In theory you could generate the validation functions for either of the above techniques from the XML description that you described.

I would have expected you to get some answers by now, but let me see what I can do.
Your question is a bit vague, but it sounds like you want one of
Code generation
An embedded extension language
A hand coded run-time mini language
Code Generation
You say that you are currently hand tooling the configuration code each time you change this. I'm willing to bet that this is a highly repetitive task, so there is no reason that you can't write program to do it for you. Your generator should consume some domain specific language and emit c code and header files which you subsequently build into you application. An example of what I'm talking about here would be GNU gengetopt. There is nothing wrong with the idea of using xml for the input language.
Advantages:
the resulting code can be both fast and compact
there is no need for an interpreter running on the target platform
Disadvantages:
you have to write the generator
changing things requires a recompile
Extension Language
Tcl, python and other languages work well in conjunction with c code, and will allow you to specify the configuration behavior in a dynamic language rather than mucking around with c typing and strings and and and...
Advantages:
dynamic language probably means the configuration code is simpler
change configuration options without recompiling
Disadvantages:
you need the dynamic language running on the target platform
Mini language
You could write your own embedded mini-language.
Advantages:
No need to recompile
Because you write it it will run on your target
Disadvantages:
You have to write it yourself

How much does the struct change from version to version? When I did this kind of thing I hardcoded it into the PC app, which then worked out what the packet meant from the firmware version - but the only changes were usually an extra field added onto the end every couple of months.
I suppose I would use something like the following if I wanted to go down the metadata route.
typedef struct
{
unsigned char field1;
unsigned short field2;
unsigned char a_string[4];
} data;
typedef struct
{
unsigned char name[16];
unsigned char type;
unsigned char min;
unsigned char max;
} field_info;
field_info fields[3];
void init_meta(void)
{
strcpy(fields[0].name, "field1");
fields[0].type = TYPE_UCHAR;
fields[0].min = 1;
fields[0].max = 250;
strcpy(fields[1].name, "field2");
fields[1].type = TYPE_USHORT;
fields[1].min = 0;
fields[1].max = 0xffff;
strcpy(fields[2].name, "a_string");
fields[2].type = TYPE_STRING;
fields[2].min = 0 // n/a
fields[2].max = 0 // n/a
}
void send_meta(void)
{
rs232_packet packet;
memcpy(packet.payload, fields, sizeof(fields));
packet.length = sizeof(fields);
send_packet(packet);
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas