How to store custom token attribute in Lucene Index - lucene

I want to create a Lucene analyzer for RDF nodes. RDF nodes can have multiple types (uri, bnode, plain literal, plain literal with language, typed literal with datatype). While analyzing the term, I want to create a RDFNodeTypeAttribute, LanguageAttribute and DatatypeAttribute to store respectively the type of RDF node, the language of the literal and the datatype attribute. My question is how these attributes can be stored in lucene index. Do I have to write a custom Codecs ? Do I have to use the PayloadAttribute ? How can I leverage these attributes once stored in the index for my search ?
Thank you for your help

I could not exactly get your requirements but you would use Codecs if you are not happy with the way a Lucene index is encoded and decoded. Codecs gives you flexibility to have your own PostingsFormat, SegmentInfosFormat, LiveDocsFormat etc. So let us say, you want a different postingsFormat from the default Lucence codec - which is more like for every term, store all docIds it occurs in, how many times it occurs in a doc, at what position etc in a particular format. If you want this information to be stored in a different format, you would need a codec.
I do not think you need to write any Codec or any PostingFormat for this. Perhaps writing your own Analyzer and Similarity classes should be sufficient. If you give more information about your problem, I can think further.
Payload is at term level and typical use case is to store meta data for every term. So, a use case like: this term is written in Bold,or is a noun etc are meta data for the term and should be stored in a payload. You actually use payloads for scoring of the docs and they matter in giving a term some weight.
Though RDF is a metadata for a web resource, you are probably talking about indexing RDF itself. Even if it is part of the web document, you are indexing, putting the RDF info for every term in the web document will not be a viable approach, as there are better ways to allocate weights to a document than that.

Related

getting the document length during query evaluation apache lucene 5.3

I am trying to change the scoring in apache lucene 5.3, and for my formula I need the document length (the number of tokens in the document). I understood from answers to similar question, you don't have an easy way to do it. because lucene doesn't keep it at the index. so I thought maybe while indexing I will create an Map from docID to the document length, and then use it in query evaluation. But, I have no idea where I should put this map and where I will update it.
You are exactly right, storing this when the document is indexed is the best approach. The place to store it is in the norm (not to be confused with the queryNorm, that's something different). Norms provide a single value stored with the field, which is made available at query time for scoring.
In your Similarity implementation, this should go into the ComputeNorm method, which exposes the information you need through the FieldInvertState, particularly FieldInvertState.getLength(). Norms are made available at search time through LeafReader.GetNormValues.
If you are extending TFIDFSimilarity, instead, you just need to implement the encodeNormValue, decodeNormValue and lengthNorm methods.

REST API filter operator best practice

I am building a REST API that uses a filter parameter to control search results. E.g., one could search for a user by calling:
GET /users/?filter=name%3Dfoo
Now, my API should allow many different filter operators. Numeric operators such as equals, greater than, less than, string operators like contains, begins with or ends with and date operators such as year of or timediff. Moreover, AND and OR combinations should be possible.
Basically, I want to support a subset of the underlying MySQL database operators.
I found a lot of different implementations (two good examples are Google Analytics and LongJump) that seem to use custom syntax.
Looking at my requirements, I would probably design a custom syntax pretty similiar to the MySQL operator syntax.
However, I was wondering if there are any best practices established that I should follow and whether I should consider anything else. Thanks!
You need an already existing query language, don't try to reinvent the wheel! By REST this is complicated and not fully solved issue. There are some REST constraints your application must fulfill:
uniform interface / hypermedia as the engine of application state:
You have to send hypermedia responses to your clients, and they have to follow the hyperlinks given in those responses, instead of building the requests on their own. So you can decouple the clients from the structure of the URI.
uniform interface / self-descriptive messages:
You have to send messages annotated with semantics. So you can decouple the clients from the data structure. The best solution to do this is RDF with for example open linked data vocabs. If you don't want to use RDF, then the second best solution to use a vendor specific MIME type, so your messages will be self-descriptive, but the clients need to know how to parse your custom MIME type.
To describe simple search links, you can use URI templates, for example GET /users/{?name} will wait a name parameter in the query string. You can use the hydra:IRITemplateMapping from the hydra vocab to add semantics to the paramers like name.
Describing ad-hoc queries is a hard task. You have to describe somehow what your query can contain.
You can choose an URI query language and stick with URI templates and probably hydra annotation. There are many already existing URI query languages, like HTSQL, OData query (ppl don't like that one), etc...
You can choose an existing query language and send it in a single URI param. This can be anything you want, for example SQL, SPARQL, etc... You have to teach your client to generate that param. You can create your own vocab to describe the constraints of the actual query. If you don't need complicated things, this should not be a problem. I don't know of already existing query structure descibing vocabs, but I never looked for them...
You can choose an existing query language and send it in the body in a SEARCH request. Afaik SEARCH is not cached or supported by recent HTTP clients. It was defined by webdav. You can describe your query with the proper MIME type, and you can use the same vocab as by the previous solution.
You can use an RDF query solution, for example a SPARQL endpoint, or triple pattern fragments, etc... So your queries will contain the semantic metadata, and not your link description. By SPARQL you don't necessary need a triple data storage, you can translate the queries on server side to SQL, or whatever you use. You can probably use SPIN to describe query constraints and query templates, but that is new for me too. There might be other solutions to describe SPARQL query structures...
So to summarize if you want a real REST solution, you have to describe to your clients, how they can construct the queries and what parameters, logical operators they can use. Without query descriptions they won't be able to generate for example a HTML form for the user. If you don't want a REST solution, then pick a query language write a builder on the client, write a parser on the server and that's all.
The Open Data Protocol (OData)
You can check BreezeJs too and see how this protocol it's implemented for node.js + mongodb with breeze-mongodb module and for a .NET project using Web API and EntityFramework with Breeze.ContextProvider dll.
By embracing a set of common, accepted delimiters, equality comparison can be implemented in
straight-forward fashion. Setting the value of the filter query-string parameter to a string using those
delimiters creates a list of name/value pairs which can be parsed easily on the server-side and utilized
to enhance database queries as needed. You can use the delimeters of your choice say (“|”) to separate individual filter phrases for OR and ("&") to separate
individual filter phrases for AND and a double colon (“::”) to separate the names and values.
This provides a unique-enough set of delimiters to support the majority of use cases and creates a user readable
query-string parameter. A simple example will serve to clarify the technique. Suppose we want
to request users with the name “Todd” who live in "Denver" and have the title of “Grand Poobah”.
The request URI, complete with query-string might look like this:
GET http://www.example.com/users?filter="name::todd&city::denver&title::grand poobah”
The delimiter of the double colon (“::”) separates the property name from the comparison value,
enabling the comparison value to contain spaces—making it easier to parse the delimiter from the value
on the server.
Note that the property names in the name/value pairs match the name of the properties that would be
returned by the service in the payload.
Case sensitivity is certainly up for debate on a case-by-case basis, but in general,
filtering works best when case is ignored. You can also offer wild-cards as needed using the asterisk
(“*”) as the value portion of the name/value pair.
For queries that require more-than simple equality or wild-card comparisons, introduction of operators
is necessary. In this case, the operators themselves should be part of the value and parsed on the server
side, rather than part of the property name. When complex query-language-style functionality is
needed, consider introducing query concept from the Open Data Protocol (OData) Filter System Query
Option specification (http://www.odata.org/documentation/odata-version-4-0/)
There seems to be a lot of standards (like OData), but many are quite complicated in that they introduce new syntax.
For simple multi filtering the following format avoid polluting the parameter namespace while still standing on top of existing web-technology
GET /users?filter[name]=John&filter[title]=Manager
It's easily readable and on the backend languages like PHP will receive it as an array of filters to apply.
A possible standard would SCIM which is adopted by some commercial products. But it's not distinguished by brevity. For a pet project I used this
= equal
! not equal
* like
< smaller
> greater
& bitwise and 
| bitwise or
^ bitwise xor
~ in comma separated value list
Examples
So GET /user?name=*An* would get all users whose name start with An and GET /user?name=~Anna,Bertha would get those two users.
Not yet a standard but who knows...

Lucene: Getting terms from an unstored field

Is there any way to retrieve all the terms in a particular field which is unfortunately not stored. I can not rebuild the index. Positional based information is not necessary. I just need the list of terms.
UPDATE
I've built a sample index with one stored, another unstored field and tested it with Luke. I was wondering whether I could get access to all terms just like Luke did. This may not be the brightest idea, but might work.
Lucene uses two different concepts: Indexing and Storing. If you want to extract the terms, you dont need to store anything. You can use luke, as well as iterate over the terms through the API. For the java API you can use [1]: How can I get the list of unique terms from a specific field in Lucene?.
Luke is Open Source, so just look at how Luke does it.

What is the easiest way to implement terms association mining in Solr?

Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
To retrieve the most associated terms within particular field.
To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
All terms or at least the most frequent ones, because rare terms won't affect result of association mining by their nature.
Documents where these terms occur, again, at least top documents.
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
You can get the count of occurrences of the current term in found document in the following query:
http://ip:port/solr/someinstance/select?defType=func&fl=termfreq(field,xxx),*&fq={!frange l=1}termfreq(field,xxx)&indent=on&q=termfreq(field,xxx)&sort=termfreq(field,xxx) desc&wt=json

Searching Natural Language Sentence Structure

What's the best way to store and search a database of natural language sentence structure trees?
Using OpenNLP's English Treebank Parser, I can get fairly reliable sentence structure parsings for arbitrary sentences. What I'd like to do is create a tool that can extract all the doc strings from my source code, generate these trees for all sentences in the doc strings, store these trees and their associated function name in a database, and then allow a user to search the database using natural language queries.
So, given the sentence "This uploads files to a remote machine." for the function upload_files(), I'd have the tree:
(TOP
(S
(NP (DT This))
(VP
(VBZ uploads)
(NP (NNS files))
(PP (TO to) (NP (DT a) (JJ remote) (NN machine))))
(. .)))
If someone entered the query "How can I upload files?", equating to the tree:
(TOP
(SBARQ
(WHADVP (WRB How))
(SQ (MD can) (NP (PRP I)) (VP (VB upload) (NP (NNS files))))
(. ?)))
how would I store and query these trees in a SQL database?
I've written a simple proof-of-concept script that can perform this search using a mix of regular expressions and network graph parsing, but I'm not sure how I'd implement this in a scalable way.
And yes, I realize my example would be trivial to retrieve using a simple keyword search. The idea I'm trying to test is how I might take advantage of grammatical structure, so I can weed-out entries with similar keywords, but a different sentence structure. For example, with the above query, I wouldn't want to retrieve the entry associated with the sentence "Checks a remote machine to find a user that uploads files." which has similar keywords, but is obviously describing a completely different behavior.
Relational databases cannot store knowledge in a natural way, what you actually need is a knowledge base or ontology (though it may be constructed on top of relational database). It holds data in triplets <subject, predicate, object>, so your phrase will be stored as <upload_file(), upload, file>. There's a lot of tools and methods to search inside such KBs (for example, Prolog is a language that was designed to do it). So, all you have to do is to translate sentences from natural language to KB triplets/ontology graph, translate user query to incomplete triplets (your question will look like <?, upload, file>) or conjunctive queries and then search on your KB. OpenNLP will help you with translating, and the rest depends on concrete technique and technologies you decide to use.
I agree with ffriend that you need to take a different approach that builds on existing work on knowledge bases and natural language search. Storing context-free parse trees in a relational database isn't the problem, but it is going to be very difficult to do a meaningful comparison of parse trees as part of a search. When you are just interested taking advantage of a little knowledge about grammatical relations, parse trees are really too complicated. If you simplify the parse into dependency triples, you can make the search problem much easier and get at the grammatical relations you were interested in in the first place. For instance, you could use the Stanford dependency parser, which generates a context-free parse and then extracts dependency triples from it. It produces output like this for "This function uploads files to a remote machine":
det(function-2, This-1)
nsubj(uploads-3, function-2)
dobj(uploads-3, files-4)
det(machine-8, a-6)
amod(machine-8, remote-7)
prep_to(uploads-3, machine-8)
In your database, you could store a simplified subset of these triples associated with the function, e.g.:
upload_file(): subj(uploads, function)
upload_file(): obj(uploads, file)
upload_file(): prep(uploads, machine)
When people search, you can find the function that has the most overlapping triples or something along those lines, where you probably also want to weight the different dependency relations or allow partial matches, etc. You probably also want to reduce the words in the triples to lemmas, maybe POS depending on what you need.
There are plenty of people who have worked on natural language search (like Powerset), so be sure to search for existing approaches. My proposed approach here is really minimal and I can think of tons of examples where it will have problems, but I think something along these lines could work reasonably well for a restricted domain.
This is not a complete answer, but if you want to perform linguistically sophisticated queries on your trees, the best bet is to pre-process your parser output and search it with tgrep2:
http://www.stanford.edu/dept/linguistics/corpora/cas-tut-tgrep.html
Trgrep/tgrep2 are, as far as I know, the most flexible and full-featured packages for searching parse trees. This is not a MySQL-based solution as you requested, but I thought you might be interested to know about this option.
Tgrep2 allows you to ask questions about parents, descendants and siblings, whereas other solutions would not retain the full tree structure of the parse or allows such sophisticated queries.