JSR 283 Query and Seaching - repository

I am in the early stage of a tagged document repository using Apache Jackrabbit.
My initial exploration was with JSR 170. Upon learning that JSR 283 deprecates things like XPath, I began to look more at JSR 283. My (still quite ignorant) impression is that JSR 283 moved to a more SQL-centric API, although the underlying node structure remains the same.
For my application, I would like to have scored tag searches and be able to search tags using regular expressions (although scoring becomes more complex in that case).
My questions are these:
Should I pursue using non-deprecated API features in JSR 283 for this project?
If so, where (or how) would I find information on JSR 283 API for such an application?

Related

Vaadin 14 Flow and SEO, PWA

I'm going to start working on the brand-new web application with Vaadin 14 Flow (pure Java). Right now, I'm unable to find the clear information about - will my Vaadin 14 Flow (pure Java) web application be SEO(Search Engine Optimization) friendly out of the box or no? And if no - what additional steps should I implement in order to achieve this? Also, is it worth to add PWA support to the application and how complex is it in case of Vaadin 14 Flow (pure Java)?
This question is on the fringe of StackOverflow policy, as it is potentially quite wide, so I am just answering briefly on top level. This question is seldomly asked in context of Vaadin applications, as they tend to be mostly implemented for company internal use, and thus SEO is not a requirement. Vaadin stance regarding SEO is neutral. So this is mostly application implementation level question. What you need to know is, that Vaadin's component implementations are web components, thus their internals are protected by shadow DOM. This means, if you have say ComboBox and you have set label to it, the label is not necessarily exposed to search index crawlers. In most cases it is not even necessary. Vaadin's layout components place components in light DOM, so if you use native html components representations like Span, Div, H1, H2, ... for the texts that you want to be exposed to SEO, you will be ok. Your text content will be exposed to the indexing crawlers. The rest is just about proper SEO copywriting, and that is naturally out of scope of Vaadin. You may be also interested in GoogleAnalyticsTracker.
A web app with Vaadin 14 is the perfect choice for companies requiring quick, agile, low memory solutions.
Where will your code be running? On a central server with Java enterprise architecture! What are the benefits to this setup? There are many benefits in one, it provides scalability to meet most customers needs without having any additional cost beyond storage and bandwidth usage.
Benefits of a scalable solution include the ability to add resources when needed because only one instance of the app is necessary if there's enough RAM and CPU capacity on the system or hardware that can be added. Another benefit includes portability- which means an organization can take their application data and settings wherever they want without proprietary reliance on software licenses or third party services.

Pure HTML vs frameworks to define HATEOAS API?

When should one develop HATEOAS server RESTful API instead of using HTML (resource links, forms, etc.)?
Isn't HTML and a browser good enough as hypermedia engine?
Isn't HTML and a browser good enough as hypermedia engine?
HTML + HTTP + URI + Browser === The world wide web. So it's pretty good, no joke.
It's not without fault.
HTML's understanding of links is disappointingly limited. No support for idempotent writes. Uri Template support for GET only. I'm not super keen on how many different spellings there are for "link".
It's kind of verbose for a hypermedia format; don't get me wrong - built in text markup is brilliant when you are trying to document what is going on for a human being. But my impression thus far is that same structure starts to get in the way when as a human being you want to quickly review the semantic content that your automated agent is consuming.
I call your attention to this quote from RFC-4287
The primary use case that Atom addresses is the syndication of Web content such as weblogs and news headlines to Web sites as well as directly to user agents.
So a bunch of really smart guys, specifically trying to address use cases directly related to the web, decided to invest a bunch of effort into standardizing a new hypermedia format rather than using the one that was already ubiquitous in their problem domain.
And over the past 10+ years, that format has been widely adopted.
Without adoption, I'm not sure that HATEOAS has much benefit. You don't need a hypermedia api if you are controlling both sides of the conversation (example: javascript on the web -- hypermedia with code on demand capability downloading a client that has learned the protocol of a web api via some out of band channel).
Evidence would seem to suggest that HTML is not nearly as convenient a format as, for example, any of the JSON based hypermedia formats.
In conclusion: no, it's not good enough. It might be an acceptable place holder for the moment; but the JSON hypermedia tool sets are soon going to be sufficiently mature that HTML will be seen as a giant step in the wrong direction.

Enterprise search platform vs General purpose search

I have a question about Solr. It is described as an enterprise search platform. Are there Enterprise oriented search platforms and general purpose search platforms? Can't you just use Solr for example to build a general purpose search engine? If there is such a distinction what are the major differences between them?
Enterprise is a vague term tacked on to things to say "Yes, you can totally use this in professional projects, it's super good". It's baloney, in short. When reading the front page of a software product (or any product really), I find it useful to ignore all adjectives and adverbs, which makes that first sentence on the Solr page read: "Solr is the search platform from the Apache Lucene project."
Don't know why I don't get hired to write ad copy.
I think it would be fair to say that Solr is a general purpose search server, sure (depending on what general purpose entails to you, of course). It indexes data, allows you to search it, and provides a lot of tools to do that in the way the best suits your data and users.
The term Search is overloaded with lots of semantics. It is often used to denote/describe either an action, a function or a technology. But more important wit respect to the question is the fact that there are two common kind of "search projects" which are Web Search and Enterprise Search projects.
Web Search is typically about indexing content from one kind of content source (Web Servers) serving content in html format. Most often it's only about public content and document level security is not an issue. A typical example for this kind of solution is Google's Web Search, but most full-text Site Search solutions can also be seen as good examples of this category. For a basic solution a crawler , an html markup removal tool and an indexing library and some "glue" is sufficient. Apache Nutch or Apache Solr and ElasticSearch in combination with a web crawler are good candidates to be used for implementing these kind of solutions.
Enterprise Search is typically about integrating content in various formats from multiple content sources. A typical example for this kind of solution are corporate intranets, but Search Based Applications often also fall into this category. Those solutions typically come with additional requirements such as support for document level security, advanced linguistics, metadata extraction, data mappings and enrichments, synonyms etc. The projects are more complex and a more complex technology stack is needed. While Apache Solr or ElasticSearch can both be used, a lot of the required functionality is not part of the standard download and needs to be developed or integrated as part of the project. But for both - Apache Solr and ElasticSearch - there are also commercial distributions available that already expand the functionality of the standard download into the direction of Enterprise Search. Other good alternatives are commercial search engines.
I agree with #femtoRgon that Solr:
is a good General Purpose Search Platform
and not an Enterprise Search Platform
but an Enterprise Search Platform can be built with Solr
Solr is a search platform that can be customized for either general purpose search or for Enterprise Search solutions. As suggested by Daniel in the previous comments, ESearch application is used specifically for an enterprise/organization to search for the organizations internal data and also in some cases can search external content as well but only related to the organization. Enterprises generally use various systems which are either internally developed or by a vendor and the ESearch application should be able to connect to the internal systems and index the content including the different file types, metadata and importantly security that is associated with each and every document from those systems.
To conclude, Solr is a Search system which can be used to index and search content as a general or as a ESearch application for a organization.

What is the status of HTML5 Database?

This spec http://www.w3.org/TR/webdatabase/ says:
This document was on the W3C Recommendation track but specification work has stopped. The specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.
Does this mean that HTML5 database is going away, and for some time we will have a de-facto standard using SQLite, possibly with browser differences? Or has the W3C published a plan of attack for finishing the standard?
According to this article:
[...] we think it is worth explaining our design choices, and why we think IndexedDB is a better solution for the web than Web SQL Database.
In another article, we compare IndexedDB with Web SQL Database, and note that the former provides much syntactic simplicity over the latter. IndexedDB leaves room for a third-party JavaScript library to straddle the underlying primitives with a BTree API, and we look forward to seeing initiatives like BrowserCouch built on top of IndexedDB. Intrepid web developers can even build a SQL API on top of IndexedDB. We’d particularly welcome an implementation of the Web SQL Database API on top of IndexedDB, since we think that this is technically feasible. Starting with a SQL-based API for use with browser primitives wasn’t the right first step, but certainly there’s room for SQL-based APIs on top of IndexedDB.
I'm not personally swayed by the arguments put forth in the article, but it seems clear that (for the time being) Mozilla has decided that Web SQL Database is dead.
Further interesting comments about this article may be found on Hacker News.
My understanding is that this is now called "IndexedDB"
http://www.w3.org/TR/IndexedDB/
Apparently the Firefox team has started implementing this:
http://hacks.mozilla.org/2011/01/indexeddb-in-firefox-4/
I don't know if anyone knows the answer. Mozilla doesn't like the dependence upon SQLite and has decided to go a different way. However, all WebKit based browsers already have it implemented and I don't see them removing it as any websites built to take advantage of the spec would be broken.
This means that at least in certain contexts, mostly within the mobile sphere where most browsers have a webkit implementation, it can still makes sense to use the HTML5 Web SQL spec. I see this as especially true for developers who are looking to create mobile applications using a framework like phonegap.
There are some times where as an application developer you want to provide users with access to data even if they aren't connected to the internet or if the connection is slow and some types of data is just more efficiently stored in a database than in a cookie or JSON cashe. For example, if you have data that has relationships it is much easier and quicker to do a join query to pull the data you need than it is to search a json map.
I don't think the spec is dead, and I actually hope that Mozilla will reverse their stance so that developers can use it to solve problems outside of the mobile webkit world.

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).