Use Hibernate-search in multiple database instances - lucene

I am working in a Hibernate-search application, currently I have been indexing a postgres database and after I created some Lucene queries and as far everything is working fine.
I have some concerns about if it is possible to use Hibernate-search with multiple instances (where all instance have the same db structure). That is because I want to split the information in several instances (not only local) and avoid to store all data in a single one.
What is the best approach to do this?
Can I use Hibernate-search to do this?
I would think there is a way to have several instances in a same Lucene Index, so I could query that index to get and retrieve information no matter where is the data coming from. But I would think the indexation process should be different as usual.

Currently the best approach would be using Hibernate Search 6 and its Elasticsearch integration. The idea is that instead of indexing locally, Hibernate Search delegates to a cluster of Elasticsearch nodes (which can be only one node if that's what you want). Clustering will be completely transparent to your application.
If you're currently on Hibernate Search 5, upgrading to 6 will require some work, since the artifact IDs, internal architecture and APIs changed. Hibernate Search 6 is also currently in Beta, so the APIs are reasonably stable but the migration guide is not ready yet, for example. I'd recommend having a look at the getting started guide for Search 6 before you make the jump.
If you want to stay on Search 5, or if you want to avoid Elasticsearch for some reason, then your only remaining option is to use Search 5's "clustering" backends relying on JMS or JGroups. However, these architectures are complex to set up, which is the very reason we gave up on them in Search 6 (at least for now) and focused on the Elasticsearch integration. See here for JMS and here for JGroups.

Related

Proper way to full text search an sql database using Elasticsearch in a Spring boot application

I want to add a full text search functionality to my Spring Boot application, data should be stored in an SQL database, I also read that using ES as a primary database is not recommended.
One way I thought of is: create, update and delete operations can be done on both the primary SQL database and in ES (which we can do using the Java High Level REST Client), for example, when inserting a row in SQL, we index it in ES as well, then we perform searches using Elasticsearch.
I think we can also use Hibernate search.
Is it the right approach? Otherwise any suggestions?
The main difference is that Hibernate Search provides integration between JPA and your index of choice (Lucene or Elasticsearch):
Hibernate Search will automatically add/update/delete documents in your full-text index according to changes in your JPA entities (as soon as you commit a transaction).
Hibernate Search will allow your to build a full-text query (full-text world), and retrieve the results as managed entities (JPA world).
As far as I understand, Spring-Data-Elasticsearch is focused on accessing Elasticsearch and has no JPA integration whatsoever. That is to say, you can use Spring-Data-JPA, and you can use Spring-Data-Elasticsearch, but they won't communicate with each other. You will have two separate models, which you will update and query separately.
Some other elements:
If you don't need a distributed index, Hibernate Search can run in embedded Lucene mode, without all the Elasticsearch stack. It will probably be more lightweight.
Hibernate Search is currently not very flexible when it comes to customizing your Elasticsearch mapping or using advanced Elasticsearch features, because of the abstraction layer. That will change in the future, though (Hibernate Search 6).
A Spring-Data-HibernateSearch module is in the works, allowing to benefit from the best of both worlds. It hasn't been released yet and is not really well documented yet, though: https://github.com/snowdrop/spring-boot-hibernate-search-booster
If you need only simple full text search consider postgresql, I'am using it for indexing and search document content: https://www.postgresql.org/docs/9.1/textsearch-controls.html .

Backend solution for fetching and transforming data from various third-party APIs

We are building new feature sets for one of our financial application. We have our own SQL server database and we will be calling multiple RESTful APIs that return JSON responses. For e.g. some returns news data, some returns stocks info, some returns finance data and our own sql server database has employee data. So, they all come with their own different data format. This new app we are building is going to aggregate all those data, transform it into meaningful display on web like mint.com does.
Web application will display analytical reports based on these data
There will be an option to download reports through various templates
We are completely open in terms of technology stack for our backend and middle-tier. As a first thought NoSQL like mongodb and elasticsearch for search and reporting comes to our mind. There will be a web application build on top of these data (stored or retrieved from API), most likely in Asp.net MVC.
We need your input, specially if you have experience with building similar enterprise solution.
Can you please share your opinions on,
What are some good tech stack you would pick for this app?
How would that scale now and in future when APIs data format changes.
Performance is also important since data will be displayed on web UI.
We have a similar setup to what you are mentioning, using ASP.Net MVC with ElasticSearch (SQL server for relational data, periodically updating ES), aggregating data (XML/JSON) from multiple sources, although with the purpose of improving searching and filtering results instead of reporting. However, I would expect that the scenario you are looking at would also be a suitable match for ElasticSearch, depending on your specific requirements.
1) Since you are already using SQL Server (and I expect are familiar with that), I would suggest combining that with ElasticSearch - the additional mongodb layer seems unnecessary, in terms of maintenance of another technology and development to fit that integration. There is a very good C# library (two actually, ElasticSearch.Net and NEST, used together) that exposes most of the ES functionality.
2) We chose ElasticSearch for its scalability in combination with flexibility and ease-of-use. A challenge you may face could be mapping the documents from C# classes to ElasticSearch documents. In essence, it is incredibly easy to set up, however you do need to do some planning to index data the way you want to search and retrieve it. So if choosing ES as a platform, spend some time with the structure of the documents - by default, dynamic mapping is enabled, so you can pretty much throw any JSON into a document. However, for a production environment, it's better to turn that off and have one or more mappings set up, so they can be queried in a standardized way.
3) Performance is a key factor for us as well, which is why we were looking at Lucene-based engines like Solr and ElasticSearch when doing research, along with NoSQL databases. ElasticSearch outperforms SQL Server by 10 to 1 or better, in most scenarios. Solr vs. ElasticSearch performance depends on scenario, benchmarks and comparisons are around if you Google them. The exception may be if many documents should be retrieved in one query - ES (or Lucene) is not made for that use case, it's best for fast retrieval of fewer results (similar to Google's per page results count) per page. If you need 1000 documents per page/result, a NoSQL database may be a better option.
ElasticSearch is fast to get up and running - install it on a local development box and try it out, you'll get a feel for if it fits.
From my experience, mongodb is the worst choice for reporting, especially for aggregation. It lacks in good aggregation functionality, has some data type conflicts (such as decimals being stored as strings, which you cannot use in it's built in aggregation framework api) and you'll probably will have to maintain map-reduce functions in javascript for most of the scenarios.
If your application's true nature is only reports, and they do not have to be updated in realtime, I would drop off the on-demand RPC calls to external APIs. I would consider copying ahead as much data as possible and storing it under a schema that is the most convenient for you to work with, and synchronising it afterwards under scheduled, predicted intervals.
I wouldn't be in a hurry making assumptions about that data to be available all the time nor it always to be in the format you expect. You also gain optimisation benefits on running your own copy of it, indexed in the way you want, instead of trying to figure which of the RPCs is your bottleneck.
As for your questions:
1) If you don't mind using Python, I would pick Django on top of PostgresSQL database. Django is a fully featured sturdy ORM + Web framework which is excellent for this kind of work. If not, just stick to a relational SQL database. I heard wonders of Cassandra but haven't tried it yet.
2 + 3) As I mentioned before, replicating the data as much as possible for your own good. After everything is "in house" you can cluster it and tweak it freely. Using a distributed cache against heavy client requests is also a good idea (such as REDIS), instead of generating those reports each time on demand.
I've been using Jasper reports and the Jasper Reports Server to integrate into our web app. Jasper accepts many different datasource types including JSON and SQLServer. The core version is free and allows you to product html amd pdf reports of high complexity. The paid version with the server allows you to easily integrate in your web app. The core is Java spring (partially open source) running on tomcat/jboss and you can interact with it using REST web services or the visualize.js library for your web front end. It uses highcharts which can produce some beautiful results and has options for adhoc reporting and dashboards built from many reports.
See demos here: http://www.jaspersoft.com/
This has an assumed stack of your backend db's and data sources, tomcat with Java Spring, web front end HTML/Javascript.
The tool is used by many large enterprises including Amazon so scalibility so shouldn't be an issue.
If the format of your data changes you'll need to change the report. This is xml formatted editted by GUI with WYSIWYG.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

Solr on a .NET site

I've got an ASP.NET site backed with a SQL Server database. I'm been using Lucene.NET to index and search the database. I'm adding faceted search navigation to the results page (the facets are a hiarchical category tree). I asked yesterday to make sure I was using the right technique for faceting. All I've gotten so far is a suggestion to use Solr, but Solr does a lot of things I don't need.
I would really like to know from anyone who is familiar with the Solr's source code if Solr's facet processing is terribly different from the one described here by Bert Willems. Bascially you have a Lucene filter for each facet, you get the bits array from it, and you count the set bits in the array.
I'm thinking since mine is hiarchical to begin with I should be able to optimize this pretty well, but I'm afraid I might be grossly under-estimating the impact of this design on search performance. If Solr is no quicker, I'm not going to gain anything by using it.
I'd recommend creating a prototype project modeling your faceting needs with Solr and benchmark it against Lucene.net.
Even though faceting in Solr is very optimized (and gets new optimizations all the time, like the parallel per-segment faceting method), when using Solr there is some overhead, for example network roundtrips and response parsing.
If your code already implements Lucene.NET, performs adequately and you don't need any of Solr's additional features, then there is no need to switch to Solr. But also consider that if you choose Solr you will get faceting performance boosts for free with each new version.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.