Lucene index replication - indexing

On a load balanced environment where in i have standalone Java thread(essentially through a spring boot jar for sake of simplicity lets call it Project 1), which reads some metadata and updates lucene indexes at a certain location.
Then There is an actual web application(Project 2) through which I want to query through these indexes(which another Project 1 has created) however the index file, what are the available options:
Copy the index file periodically to the lucene of web application which would not be possible as we may have to re kick the application I trust.
Maintain both projects as one package in a war and so single instance of lucene is available to both.
Any other replication strategy??
Any help on above would be highly appreciated.
Best,
- Vaibhav

This really depends on your non functional requirements by your application and any given architectural decision driven by them.
But here some thoughts:
copy an index like from folderA to folderB sounds like a pretty bad idea. especially if both applications have to run all the time.
You don't want a direct dependency between these two applications so you have to some kind of build your own lucene component which is serving API functionalities you need.
I would recommend building a component with a proper API. This component uses lucene as library and in cases like multiple systems or instances like to use this component i would suggest a nice NRT (Near Real Time) implementation of Lucene.

Related

Best way to share logic between teams (Web + Native)

We are 3 teams:
Website front-end (React)
Website back-end (Node.js)
Native app (React Native, Node.js)
We want to share logic (e.g. Validations).
As of now I found articles on 3 ways to do so:
A NPM Package we will create for our own needs
A micro-service with endpoints who carry relevant logic
Serverless functions who carry relevant logic
Any other real-life, production suggestions?
Any other real-life, production suggestions?
Kind of - in no specific order:
You could specify the rules in a language/technology agnostic way, and then have your app load them at runtime (or be compiled in during build). The rules could then exist as a config file, or even be fetched from a remote location (a variation on your options 2 & 3).
Of course, designing a language agnostic rules engine / approach is non-trivial, and depends on what you need the rules to do (how complex, etc). You might find a pre-built open source solution that does that.
I have seen people try this, but the projects never succeeded (for unrelated reasons). One team specified the rules in an Excel sheet.
But there are trade-offs:
Performance hit - how to take language agnostic rules and be able to execute them? This will probably take some translation. Native code is almost always going to be faster and more efficient.
Higher development effort.
Added complexity - harder to debug (even if you compensate by developing more mechanisms to assist you do that - which is more development effort).
Regarding Your Options
For what it's worth, code / design-time sharing is an obvious approach, which I guess is sufficiently covered by NPM. I don't know enough about React and Node to know if they have any better ways of doing that. Normally if I have logic I want to share I'll use a component which is purpose built (lean as possible, minimal dependencies, intended to be re-used across multiple projects), and ingested in (C# / .Net) at compile/design time.
As an alternative to NPM you could look at dependency injection. This would allow you to do things like update the logic even after the app was deployed, as long as it can access where ever a newer set of rules are. So it's a bit like your option 1 (NPM, code level loading) but at runtime, and just once, and your options 2 & 3 - fetched remotely at runtime - the difference being that you're ingesting the logic not firing off questions and receiving answers (less chatty).
Service base rules are good in that they are totally separated, but the obvious trade-offs are availability and performance at runtime.
I don't see a difference in your options 2 & 3 from the stand-point of creating, managing and sharing logic. The only material impact is on whomever implements and supports that service system.

Testing reusable components / services across multiple systems

I'm currently starting a new project where we are hoping to develop a new system using reusable components and services.
We currently have 30+ systems that all have common elements, but at the moment we develop each system in isolation so it feels like we are often duplicating code and then of course we have 30+ separate code bases to maintain and support.
What we would like to do is create a generic platform using shared components to enable quick development of new collections, reusing code and reusing automated tests and reduce the code base that needs to be maintained.
Our thoughts so far are that we would have a common code base for specific modules for example User Management and Secure System Access, these modules could consist of their own generic web module, API and Context. This would create a generic package of code.
We could then deploy these different components/packages to build up a new system to save coding the same modules over and over again, so if the new system needed to manage users, you could get the User Management package and boom it does what you need. However, because we have 30+ systems we will deploy the components multiple times for each collection. Also we appreciate that some of the systems will need unique functionality so there would be the potential to add extensions to the generic modules for system specific needs OR to choose not to use one of the generic modules and create a new one, but use the rest of the generic components.
For example if we have 4 generic components that make up the system A, B, C and D. These could be deployed to create the following system set ups:
System 1 - A, B, C and D (Happy with all generic components)
System 2 - Aa, B, C and D (extended component A to include specific functionality)
System 3 - A, E, C and F (Can't reuse components B and D so create specific ones, but still reuse components A and C)
This is throwing up a few issues for me as I need to be able to test this platform and each system to ensure it works and this is the first time I've come across having to test a set up like this.
I've done some reading around Mircroservices and how to test them, but these often approach the problem for just 1 system using microservices where we are looking at multiple systems with different configurations.
My thoughts so far lead me to believe that for the generic components that will be utilised by the different collections I can create automated tests at the base code level and then those tests will confirm the generic functionality and therefore it will not be necessary to retest these functions again for each component, other than perhaps a manual sense check after deployment. Then at each system level additional automated tests can be added to check the specific functionality that may be created.
Ideally what I'd like would be to have some sort of testing platform set up so that if a change is made to a core component such as User Management it would be possible to trigger all the auto tests at the core level and then all of the specific system tests for all systems that will share the component to ensure that any changes don't affect core functionality or create a knock on effect to the specific systems. Then a quick manual check would be required. I'm keen to try and remove a massive manual test overhead checking 30+ systems each time a shared component is changed.
We work in an agile way and for our current projects we have a strong continuous integration process set up, so when a developer checks in some code (Visual Studio) this triggers a CI build (TeamCity / Octopus) that will run all of the unit tests, provided that all these tests pass, this then triggers an Integration build that will run my QA Automated tests which are a mixture of tests run at an API level and Web tests using SpecFlow and PhantomJS or Selenium Webdriver. We would like to keep this sort of framework in place to keep the quick feedback loops.
It all sounds great in theory, but where I'm struggling is trying to put something into practice and create a sound testing strategy to cover this kind of system set up.
So really what I'm hoping is that there is someone out there who has encountered something similar in the past and has thoughts on the best way to tackle this and has proven that they work.
I'm keen to get a better understanding of how I could set up a testing platform / rig to aid the continuous integration for all systems considering that each system could potentially look different, yet have shared code.
Any thoughts or links to blogs / whitepapers etc. that you think might help would be much appreciated!!
Your approach is quite good, and since soon I'll have to face the same issues like you - I can give you my ideas so far. I'm pretty sure that to
create a sound testing strategy to cover this kind of system set up
can't be squeezed-in in one post. So the big picture looks like this (to me) - you're in the middle of the Enterprise application integration process, the fundamental basis to be test covered will be the Data migration. Maybe you need to consider the concept of Service-oriented architecture
generic platform using shared components
since it'll enable you to provide application functionality as services to other applications. Here indirect benefit will be that SOA involves dramatically simplified testing. Services are autonomous, stateless, with fully documented interfaces, and separate from the cross-cutting concerns of the implementation. There are a lot of resources like this E2E testing or efficiently testing SOA.

Lucene in Java, C#.Net and C++. Which is the best version for long-term use on Windows server?

I am going to implement Lucene search into my project and I want to make a best start.
So I consider between 3 versions of Lucene (Java/C#.Net/C++) which is the best version upon these criterias :
1.performance
2.easy to implement
3.plenty of documents ?
Assume the system is Window server, and I ask it for a long-term use.
Thanks
I would say Java. Lucene was initially developed in Java and I would think there are bigger community, more documentation and bigger deployments using Java.
Granted, Windows is not usually considered as primary platform for deploying Java services but it still would work with flying colors. Many people using Windows for Java development and even deployment so I don't expect any major issues.
Unless you've got a specific feature you need, I would look at best being:
a) Whatever platform you are developing the program in -- there are lots of advantages to not having to switch tools/contexts/platforms to muck around with the search internals.
b) Whatever platform your ops guys want to deal with -- I know lots of windows ops guys hate dealing with java as it is a strange foreign language. For example.
c) All of the above being equal, Java is the real flagship lucene project that everyone else is keeping up with with and that has the most tools & resources. It is the way to go if you don't have any reason not to use java. Solr is another advantage here -- you can pretty easily use a pre-wrapped fully functional lucene http server.
In any case, keep in mind that at least theoretically any lucene index written on one platform is readable by others so you don't necessarily have to fully commit to a single platform.

Which one is better for efficient free text search, Hibernate Search or Lucene?

We are developing a web application using Spring MVC, Spring and Hibernate.
We need to add efficient free text search capabilities to our applications. For this we are thinking of using either Hibernate Search (it uses Lucene under the hood) or directly lucene.
What is the best option for us as we are already using hibernate in our application? What are the pros and cons of one over the other?
Thanks.
You said it yourself - you'll be using Lucene one way or the other.
The raw Lucene API isn't very easy to use. It's much more low-level than Hibernate Search. if you're already using Hibernate, then it's a no-brainer - use Hibernate Search to implement your text search functionality.
disclaimer: I'm one of the developers of Hibernate Search.
The goal of the project is not to compete with Lucene nor Solr, but to facilitate as much as possible integration with Hibernate applications, to avoid having to maintain the two worlds in sync and duplicate all mapping and CRUD operations.
While we provide some common helpers and a nice encapsulation, Hibernate Search can also hand you over a direct reference to the Lucene API, so in case you find yourself needing to use the "raw" Lucene API you will never be stuck. Also for writing to the index Hibernate Search provides a common pattern which will solve most of known requirements, but in case you have very non-standard requirements you can get full control of the written Documents.
Solr is a good alternative, but as it is a separate server you have to interact with it via REST APIs which is quite different, with it's pros and cons. Having a second service to manage is not always wanted, and of course the remote invocations will never be as efficient as direct references to Lucene and to all it's internal filters and caches.
Not all functionality of Lucene can be exposed via a remote API, and if you need to do some "low level" operation, if this is not implemented in Solr you won't be able to do it (without patching Solr). Still Solr is very cute, especially when you want to share the index with other non-Java applications, and so we might add a Solr backend for Hibernate Search to eventually keep a Solr server in synch (especially if there's interest for it, and possibly some help).
Finally, the Lucene API is really hard core stuff. We spend a lot of effort to make the best use of it to provide top performance while exposing a stable API to people using Hibernate Search, basically until now all releases have been backwards compatible to provide a "drop-in" performance boost to use latest greatest tricks from Lucene - which actually changes API quite often; these changes are always exciting, but be prepared to maintain that in your application if you don't use a proper abstraction.
The other way of using Lucene is to get the middlman API which is known as SOLR. SOLR will connect to Lucene and perfom HTTP calls for search. Please note that you will need to build and Parse the XML what Solr consumes. All the functionality of Lucene is exponse via SOLR and should be really helpful.

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).