Lucene - Custom Analyzer/Parser for JSON objects? - lucene

I have a requirement for a very specific Lucene implementation which stores multiple "Properties" fields with deserialized JSON strings.
Example:
Document:
ID: "99"
Text: "Lorepsum Ipsum"
Properties: "{
"lastModified": "1/2/2015",
"user": "johndoe",
"modifiedChars": 2,
"before": "text a",
"after": "text b",
}"
Properties:"{
"lastModified": "1/2/2013",
"user": "johncotton",
"modifiedChars": 6,
"before": "text aa",
"after": "text bbb",
}"
Properties: "{
"lastModified": "1/3/2015",
"user": "johnmajor",
"modifiedChars": 3,
"before": "text aa",
"after": "text b",
}"
I'm aware that ElasticSearch and Solr have implementations to lookup within JSON objects but I'm using Lucene's core API (3.0.5).
My goal is to use lucene's API and with some added implementation to search within the JSON strings, for example:
Building a type of BooleanQuery where at least one "Properties" Field MUST match all the values in the query. (e.g query "+user:tom +modifiedChars:3 +before:"text A", etc)
I have some ideas but I have no clue where to begin. What I'm asking is some high level ideas to achieve such implementation. A custom analyzer maybe to use with a query parser?
Consider it an open ended question. All suggestions are welcome.

If you will always search for the complete set of values...
Create a "property" field for each set. The value would just be the concatenated set of values ie "1/2/2015:johndoe:2:text a:text b".
Alternatively... create a separate doc for each set. This would allow you to search for different combinations of values without conflating the different sets.
Yes that might mean duplicating the Text field. If it's not big then I wouldn't care too much (especially if you're not using a "stored" field).
Do you need to need to combine text and property in your queries? ("text:ipsum AND property:xxx")
If not then put the text in yet another doc.
If the idea is to search in order to get the "ID" field then some combination of the above ought to work

Related

How to sort Redis list of objects using the properties of object

I have JSON data(see the ex below) which I'm storing in Redis list using 'rpush' with a key as 'person'.
Ex data:
[
{ "name": "john", "age": 30, "role": "developer" },
{ "name": "smith", "age": 45, "role": "manager" },
{ "name": "ram", "age": 35, "role": "tester" },
]
Now when I get this data using lrange person 0 -1, it gives me results as '[object object]'.
So, to actually get them with property names I'm storing them by stringifying them and parsing them back to objects to use the object properties.
But the issue with converting to a string is that I'm not able to sort them using any property, say name, age or role.
My question is, how do I store this JSON in Redis list and sort them using any of the properties.
Thanks.
Very recently I posted an answer for a very similar question.
The easiest approach is to use Redis Search module (which makese the approach portable to many clients / languages):
Store each needed object as separate key, following a prefixed key pattern (keys named prefix:something), and standard schema (all user keys are JSON, and all contain the field you want to sort).
Make a search index with FT.CREATE, with ON JSON parameter to search JSON-type keys, and likely PREFIX parameter to search just the needed keys, as well as x AS y parameters for all needed search fields, where x is field name, and y is type of field (TEXT, TAG, NUMERIC, etc. -- see documentation), optionally adding SORTABLE to the fields if they need to be sorted.
Use FT.SEARCH command with any combination of "#field=value" search parameters, and optionally SORTBY.
Otherwise, it's possible to just get all keys that follow a pattern using KEYS command, and use manual language-specific sorting code. That is of course more involved, depends on language and available libraries, and is therefore less portable.

Is there a way to add a default to a json schema array

I just want to understand if there is a way to add a default set of values to an array. (I don't think there is.)
So ideally I would like something like how you might imagine the following working. i.e. the fileTypes element defaults to an array of ["jpg", "png"]
"fileTypes": {
"description": "The accepted file types.",
"type": "array",
"minItems": 1,
"items": {
"type": "string",
"enum": ["jpg", "png", "pdf"]
},
"default": ["jpg", "png"]
},
Of course, all that being said... the above actually does seem to be validate as json schema however for example in VS code this default value does not populate like other defaults (like for strings) populate when creating documents.
It appears to be valid based on the spec.
9.2. "default"
There are no restrictions placed on the value of this keyword. When multiple occurrences of this keyword are applicable to a single sub-instance, implementations SHOULD remove duplicates.
This keyword can be used to supply a default JSON value associated with a particular schema. It is RECOMMENDED that a default value be valid against the associated schema.
See https://json-schema.org/draft/2020-12/json-schema-validation.html#rfc.section.9.2
It's up to the tooling to take advantage of that keyword in the JSON Schema and sounds like VS code is not.

Cloudant json index vs text index

Hi I am trying to understand json index vs text index in Cloudant. Now I know using
{ "index": {}, "type": "text" }
Will make the entire document searchable. But what is the difference between say,
{
"index": {
"fields": [
"title"
]
},
"type": "json"
}
and
{
"index": {
"fields": [
{
"name": "title",
"type": "string"
}
]
},
"name": "title-text",
"type": "text"
}
Thanks.
the json type:
leverages the Map phase of MapReduce
will build and query faster than a text type for a fixed key
no bookmark field
cannot use combination or array logical operators such as $regex as the basis of a query
only equality operators such as $eq, $gt, $gte, $lt, and $lte (but not $ne) can be used as the basis of a query
might end up doing more work in memory for complex queries
sorting fields must be indexed
the text type:
leverages a Lucene search index
permits indexing all fields in documents automatically with a single simple command
provides more flexibility to perform adhoc queries and sort across multiple keys
permits you to use any operator as a basis for query in a selector
type (:string, :number) sometimes need to be appended to sort field
from: https://docs.cloudant.com/cloudant_query.html
If you know exactly what data you want to look for, or you want to
keep storage and processing requirements to a minimum, you can specify
how the index is created, by making it of type json.
But for maximum possible flexibility when looking for data, you would
typically create an index of type text.
additional information:
https://developer.ibm.com/clouddataservices/docs/cloudant/get-started/use-cloudant-query/

Find relation between two entities in Freebase

I am new in Freebase and I have a simple question . I would like to use Freebase KB to find relation between two entities. For example if I have name entities "Washington" and "United States" , I would like to send a query to Freebase and get :
Location/Location/Capital or Null in the case of No relation.
Thank you very much.
If you only want to go one ply out (ie nearest neighbors), this is pretty simple to do using the reflection API if you're using the online version of Freebase. If you're using the bulk downloads, you'll need to work with whatever query engine you're using (probably SPARQL unless you converted the RDF to something else).
If you want to find the shortest path(es) regardless of who far apart they are, it becomes a graph search algorithm.
EDIT: If you only want to find capitols, you can fill in your IDs in this query:
[{
"type": "/location/administrative_division_capital_relationship",
"capital": [{
"id": null
}],
"administrative_division": [{
"id": null
}],
"limit": 1
}]
Note that for Washington, D.C., this will return null because the data isn't in Freebase.
If you need to handle arbitrary properties, you'll need to use reflection. See https://developers.google.com/freebase/mql/ch03#reflection

not_indexed field is stored in index

I'm trying to optimize my elasticsearch scheme.
I have a field which is a URL - I do not want to be able to query or filter it, just retreive it.
My understanding is that a field that is defined as "index":"no" is not indexed, but is still stored in the index.
(see slide 5 in http://www.slideshare.net/nitin_stephens/lucene-basics)
This should match to Lucene UnIndexed, right?
This confuses me, is there a way to store some fields, without them taking more storage than simply their content, and without encumbering the index for the other fields?
What am I missing?
I'm new to posting on stack exchange but believe I can help a bit!
There are a few considerations here:
Analyzing
As you don't want to do extra work you should set "index": "no". This will mean the field will not be run through any tokenizers and filters.
Furthermore it will not be searchable when directing a query at the specific field: (no hits)
"query": {
"term": {
"url": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
}
}
*here "url" is the field name.
However the field will still be searchable in the _all field: (might have a hit)
"query": {
"term": {
"_all": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
}
}
_all field
By default every field gets put in the _all field. Set "include_in_all": "false" to stop that. This might not be an issue with you as you are unlikely to search against the _all field with a URL by mistake.
I was working with a schema where countries were stored as 2 letter codes, e.g.: "NO" means Norway, and it is possible someone might do a search against the all field with "NO", so I make sure to set "include_in_all": "false".
Note: Any query where you don't specify a field explicitly will be executed against the _all field.
Storing
By default, elasticsearch will store your entire document (unanalyzed, as you sent it) and this will be returned to you in a hit's _source field. If you turn this off (if your elasticsearch db is getting huge perhaps?) then you need to explicitly set "store": "yes" to store fields individually. (One thing to notice is that store takes yes or no and not true or false - it tripped me up!)
Note: if you do this you will need to request the fields you want returned to you explicitly. e.g.:
curl -XGET http://path/index_name/type_name/id?fields=url,another_field
finally...
I would leave elasticsearch to store your whole document (as the default) and use the following mapping.
"type_name": {
"properties": {
"url": {
"type": "string",
"index": "no",
"include_in_all": "false"
},
// other fields' mappings
}
}
Source: elasticsearch documentation
There are two ways to input data into the index. Indexing and Storing. Indexing a piece of data means that it is tokenized, and placed in the inverted index, and can be searched. Storing data means it is not tokenized, or analyzed or anything, and is not added to the inverted index. It is stored in an entirely separate area, in it's full text form. It can not be searched against, but can be retrieved, in it's original form, by it's document ID.
The typical Lucene query process is to query against indexed data, and get the back Document IDs of matching documents, then to use those document IDs to retrieve the stored data for those documents, and display it to the user.
Data which is indexed, but not stored is searchable, but can not be retrieved in it's original form.
Data which is stored, but not indexed can be retrieved once you have found a hit, but is not searchable.
Data which is indexed and stored can be searched or retrieved.
Data which is neither can not be added to the index at all.
This is covered a bit in the Lucene FAQ.
You are looking for the 'index' => 'not_analyzed' mapping option.
Also, if you use the _source, you do not have to specify the store => false option.