The right Mallet class for a Topic Model - text-mining

I'm working with the Mallet library for a project in Java.
I have 15,000 documents with 400 tokens each. I tried using ParallelTopicModel. But I would like to have a set of topics that contain both single tokens and sequences of tokens (e.g. "Java" as well as "Java Developer").
I am considering using LDA-HMM. What class of Mallet can I use?
Then I'll turn every topic into nodes of a Bayesian network, to receive as evidence a token or sequence of tokens, and make inferences. Which Java library can I use for that?
Thanks in advance.
Francesco

Related

ArchUnit to test actual layered architecture

Currently in our project we have layered architecture implemented in following way where Controller, Service, Repository are placed in the same package for each feature, for instance:
feature1:
Feature1Controller
Feature1Service
Feature1Repository
feature2:
Feature2Controller
Feature2Service
Feature2Repository
I've found following example of arch unit test where such classes are placed in dedicated packages https://github.com/TNG/ArchUnit-Examples/blob/master/example-junit5/src/test/java/com/tngtech/archunit/exampletest/junit5/LayeredArchitectureTest.java
Please suggest whether there is possibility to test layered architecture when all layers are in single package
If the file name conventions are followed properly across your project, how about you write custom test cases instead of using layeredArchitecture().
For Example:
classes().that().haveSimpleNameEndingWith("Service")
.should().onlyBeAccessed().byClassesThat().haveSimpleNameEndingWith("Controller")
noClasses().that().haveSimpleNameEndingWith("Service")
.should().accessClassesThat().haveSimpleNameEndingWith("Controller")
I know this question is rather old. But for the record, this has been possible for a while using predicates for the layers, e.g.
layeredArchitecture().consideringAllDependencies()
.layer("Controllers").definedBy(HasName.Predicates.nameEndingWith("Controller"))
.layer("Services").definedBy(HasName.Predicates.nameEndingWith("Service"))
.layer("Repository").definedBy(HasName.Predicates.nameEndingWith("Repository"))
.whereLayer("Controllers").mayNotBeAccessedByAnyLayer()
.whereLayer("Services").mayOnlyBeAccessedByLayers("Controllers")
.whereLayer("Repository").mayOnlyBeAccessedByLayers("Services")
However, I'm not sure how well this works in practice. Because usually you don't just have classes following this naming pattern and that's it. A service might also have some POJO as method parameter type (e.g. MyInput) and that should maybe for example not be used by repositories as well. Also, using forward dependency rules (mayOnlyAccessLayers(..)) this might then cause unwanted violations.

OpenTest custom test actors

I'm really impressed with OpenTest project. Found it highly intriguing how many ideas this project is sharing with some projects I created and worked on. Like your epic architecture with actors pulling tasks.. and many others :)
Did you think about including other automation technologies to base Actors on?
I could see two main groups:
1 Established test automation tooling like TestCafe (support for non-selenium gui testing could leverage the whole solution a lot)
2 Custom tooling needed for specific tasks. Would be great to have an actor with some domain-specific capabilities. Now as I can see this could be achieved by introducing another layer of execution worker called by an actor using rest api. What I mean is the possibility of using/including them as new 'actor types' with custom keywords releted.
Thank you for your nice words. We spent a lot of time thinking through the architecture and implementation of OpenTest and it's very rewarding to see that people understand and appreciate the design.
Implementing new keywords (test actions) can be done without creating custom test actors, by creating a new Java class that inherits from the TestAction base class and override its run method. For a simple example, you can take a look at the implementation of the Delay test action. You can then package the new test action in a JAR and drop it (along with any dependencies) in the user-jars subdirectory in your test actor's working directory. The test actor will dynamically load all the JARs it finds in there and will find the new test action class (using reflection) so you can make use of it in your tests. Some useful info and things to look out for:
Your Java project is going to have to define a dependency on the opentest-base project (which is where the TestAction base class is implemented).
When you copy the JAR to where your test actor is, make sure to copy any dependency JARs along with it. Please note that a lot of the dependencies that you might need are already included with the core test actor binaries (you can have a look at the POM.xml to see what they are).
If you happen to have any dependencies that conflict with the other JARs that included with the core test actor binaries, you can apply a technique called shading to "hide" the conflicting classes under a different package name. Most of the times you're not going to need this, but if you do and you get stuck let me know and I'll give you some pointers.
Here's sample project that demonstrates how to build an OpenTest extension that creates a couple of custom keywords: https://github.com/adrianth/opentest-extension-sample
And here's an extensive video tutorial about creating custom OpenTest keywords: https://getopentest.org/tutorials/custom-keywords.html

Is specific Lucene classes are intended to be consumed by applications?

I'm new to the Apache Lucene library. I'd like to directly consume a class in this library called: LevenshteinDistance to calculated similarity search between strings. Would that be correct for my own application to directly consume it, or should I go thru the Lucene api?
Just using that single class is totally ok, but if you just need that you should take the soure code of that class, remove unneded Lucene dependencies, and use it. Lucene is a huge thing and you don't want to have it in your project if you only needs to compute a string distance.
One thing: In the source code for LevenshteinDistance.java there's a comment mentioning that the code was taken from Apache Commons "StringUtils" clas. Maybe you should just add that. It's here: https://commons.apache.org/proper/commons-lang/

Wrapping ROS messages using a backward compatible format?

ROS message definitions are neither backward nor forward compatible. This is a problem when building largish systems in largish organizations, for all the same reasons that are familiar to those working with RPC-style messaging in distributed systems. In that world the problem has been solved for a long time using backward-compatible message formats (e.g. protobuffers, thrift, flatbuffers, etc.)
Question: does anyone have any real-life/production-tested experience/code/links to share that use similar schemes with ROS? I have done the obvious things already (sticking a serialized flatbuffer inside a byte array in a ROS message), but want to see if people have already done something better.
I think this is a feature that is missing in ROS and ROS2. A workaround could be to wrap a JSON or XML file as a string message inside your ROS messages. Inside the JSON you can have a protocol version string to archive backward compatible messages.

Implementing Java serialization independent encoding

I want to use several encodings in the presentation layer to encode a object/structure in the application layeri independenty from encoding scheme (such as binary, XML, etc) and programming language (Java, Javascript, PHP, C).
An example would be to transfer an object from a producer to a consumer in a byte stream. The Java client would encode it using something like this:
Object var = new Dog();
output.writeObject(var);
The server would share the Dog class definitions and could regenerate the object doing something like this:
Object var = input.readObject();
assertTrue(var instanceof Dog); // passes
It is important to note that producer and consumer would not share the type of var, and, therefore, the consumer would not need the type to decode var. They only would share data type definitions, if ever:
public interface Pojo {}
public class Dog implements Pojo { int i; String s; } // Generated by framework from a spec
What I found:
Java Serialization: It is language dependent. Cannot be used with for example javascript.
Protobuf library: It is limited to a specific binary format. It is not possible to support additional binary formats. Need name of class ("class" of message).
XStream, Simple, etc. They are rather limited to text/XML and require name of the class.
ASN.1: The standards are there and could be used with OBJECT IDENTIFIER and type definitions but they lack on documentation and tutorials.
I prefer 4th option because, among others, it is a standard. Is there any active project that support such requirements (specially something based on ASN.1)? Any usage example? Does the project include codecs (DER, BER, XER, etc.) that can be selected at runtime?
Thanks
You can find several open source and commercial implementation of tools for ASN.1. These usually include:
a compiler for the schema, which will generate code in your desired programming language
a runtime library which is used together with the generated code for encoding and decoding
ASN.1 is mainly used with the standardized communication protocols for telecom industry, so the commercial tools have very good support for the ASN.1 standard and various encoding rules.
Here are some starter tutorials and even free e-books:
http://www.oss.com/asn1/resources/asn1-made-simple/introduction.html
http://www.oss.com/asn1/resources/reference/asn1-reference-card.html
http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html
I know that the OSS ASN.1 commercial tools (http://www.oss.com/asn1/products/asn1-products.html) will support switching the encoding rules at runtime.
To add to bosonix's answer, there's also Objective System's tools at http://www.obj-sys.com/. The documentation from both OSS and Objective Systems includes many example uses.
ASN.1 is pretty much perfect for what you're looking for. I know of no other serialisation system that does this quite so thoroughly.
As well as a whole array of different binary encodings (ranging from the comprehensively tagged BER all the way down to the very packed-together PER), it does XML and now also JSON encodings too. These are well standardised by the ITU, so it is in theory fully inter operable between tool vendors, programming languages, OSes, etc.
There are other significant benefits to ASN.1. The schema language lets you define constraints on the value of message fields, or the sizes of arrays. These then get checked for you by the generated code. This is far more complete than many other serialisations. For instance, Google Protocol Buffers doesn't let you do this, meaning that you have to check the range of message fields (where applicable) in hand written code. That's tedious, error prone, and hard to maintain.
The only other ones that do this are XSD and JSON schemas. However with those you're at the mercy of the varying quality of tools used to turn those into source code - I've not yet seen any decent ones for JSON schemas. I'm not aware of whether or not Microsoft's xsd.exe honours such constraints either.