Problem: indexing nested property in arangodb
Context: ArangoDB, community edition, running localy
Description:
I have a collection CONFIGURATIONS_NODES hosting documents with the following structure
{
"_sectionName": "MySection",
"_configurationName": "abc",
"nodeData": {
"_id": "61cc20793b83b2001c24a9ad",
"configuration_name": "xyz"
}
If I run a query
"for v in CONFIGURATIONS_NODES filter v.nodeData._id=="61cc20793b83b2001c24a9ad" return v". It finds the document as expected but alas its a full scan.
If I try to create an index on the nested nodeData._id property the index creation fails. I was able to create an index on the nested property nodeData.configuration_name. so it seems the issue relates only to the nested _id
The documentation states:
You cannot use the _id system attribute in user-defined indexes, but indexing _key, _rev, _from, and _to is possible.
The name _id conflicts with the internal _id attribute and is treated specially. Even though it is not mentioned in the documentation, this means that attributes named _id cannot be indexed, not even in nested subobjects.
Related
I've been doing some testing with laravel scout and according to the documentation (https://laravel.com/docs/8.x/scout#configuring-searchable-data), I've mapped my User model as such:
/**
* Get the indexable data array for the model.
*
* #return array
*/
public function toSearchableArray()
{
$data = $this->toArray();
return array_merge($data, [
'entity' => 'An entity'
]);
}
Just for the sake of testing, this is literally what I came down to on the debugging.
After importing the User model with this mapping, I can see on the meilisearch dashboard it is indeed showing the user data + the entity = 'An entity'.
However, when applying this:
User::search('something')->where('entity', 'An entity')->get()
It produces the following error:
"message": " --> 1:1\n |\n1 | entity=\"An entity\"\n | ^----^\n |\n = attribute `entity` is not filterable, available filterable attributes are: ",
"exception": "MeiliSearch\\Exceptions\\ApiException",
"file": "/var/www/api/vendor/meilisearch/meilisearch-php/src/Http/Client.php",
Tracing back to view the 'filterable attributes', I've ended at the conclusion that:
$client = app(\MeiliSearch\Client::class);
dump($client->index('users')->getFilterableAttributes()); // Returns []
$client->index('users')->updateFilterableAttributes(['entity']);
dump($client->index('users')->getFilterableAttributes()); // Returns ['entity']
Forcing the updateFilterableAttributes now allows me to complete the search as intended, but I don't feel this should be the regular behaviour? If its mapped on the searchableArray, it should be searchable? What am I not understanding and what other approaches are there to achieve this goal?
This is actually not an issue but a requirement of meilisearch in particular. Scout under the hood uses different drivers for indexing - "algolia", "meilisearch", "database", "collection" and even "null", all of them have different indexing methods unifing of which would be troublesome and inefficient for scout I believe.
So filtering or a faceted search, as meilisearch refers to it, requires us to establish filtering criteria first, which is empty by default for document (models in laravel) fields.
Quoting from the docs:
This step is mandatory and cannot be done at search time. Filters need
to be properly processed and prepared by Meilisearch before they can
be used.
Updating filterableAttributes requires recreating the entire
index. This may take a significant amount of time depending on your
dataset size.
For more info please refer to meilisearch official docs https://docs.meilisearch.com/learn/advanced/filtering_and_faceted_search.html
I'm really new to FaunaDb, and I currently have a collection of Users and an Index from that collection: (users_waitlist) that has fewer fields.
When a new User is created, the "waitlist_meta" property is an empty array initially, and when that User gets updated to join the waitlist, a new field is added to the User's waitlist_meta array.
Now, I'm trying to get only the collections that contain the added item to the waitlist_meta array (which by the way, is a ref to another index (products)). In a other words: if the array contains items, then return the collection/index
How can I achieve this? By running this query:
Paginate(Match(Index('users_waitlist')))
Obviously, I'm still getting all collections with the empty array (waitlist_meta: [])
Thanks in advance
you need to add terms to your index, which are explained briefly here.
the way I find it useful to conceptualise this is that when you add terms to an index, it's partitioned into separate buckets so that later when you match that index with a specific term, the results from that particular bucket are returned.
it's a slightly more complicated case here because you need to transform your actual field (the actual value of waitlist_meta) into something else (is waitlist_meta defined or not?) - in fauna this is called a binding. you need something along the lines of:
CreateIndex({
"name": "users_by_is_on_waitlist",
"source": [{
"collection": Collection("users"),
"fields": {
"isOnWaitlist": Query(Lambda("doc", ContainsPath(["data", "waitlist_meta"], Var("doc"))))
}
}],
"terms": [{
"binding": "isOnWaitlist"
}]
})
what this binding does is run a Lambda for each document in the collection to compute a property based on the document's fields, in our case here it's isOnWaitlist, which is defined by whether or not the document contains the field waitlist_meta. we then add this binding as a term to the index, meaning we can later query the index with:
Paginate(Match("users_by_is_on_waitlist", true))
where true here is the single term for our index (it could be an array if our index had multiple terms). this query should now return all the users that have been added to the waitlist!
I'm trying to query RavenDB using the HTTP client for all documents by type.
I would like a collection of the documents with a given type.
I understand that there might be limitations only the first 1024 documents will be returned.
I am well under that number and besides it's for a proof of concept.
I am able to obtain all the documents using the following syntax:
http://localhost:8080/databases/{database name}/docs/
I see that I could use the #metadata field to get the documents of the type I want but I don't know the syntax.
Since the HTTP api allows you to query indexes, I attempted to write a static index.
When I wrote the index from Raven Studio, the index was not returning the documents of the type I wanted. It was giving zero results.
from doc in docs.MyType
select new { doc};
I also tried this:
from doc in docs
let Tag = doc["#metadata"]["Raven-Entity-Name"]
where Tag == "MyType"
select new { doc};
You can do it using:
http://localhost:8080/databases/{database name}/indexes/dynamic/CollectionName
I'm trying to optimize my elasticsearch scheme.
I have a field which is a URL - I do not want to be able to query or filter it, just retreive it.
My understanding is that a field that is defined as "index":"no" is not indexed, but is still stored in the index.
(see slide 5 in http://www.slideshare.net/nitin_stephens/lucene-basics)
This should match to Lucene UnIndexed, right?
This confuses me, is there a way to store some fields, without them taking more storage than simply their content, and without encumbering the index for the other fields?
What am I missing?
I'm new to posting on stack exchange but believe I can help a bit!
There are a few considerations here:
Analyzing
As you don't want to do extra work you should set "index": "no". This will mean the field will not be run through any tokenizers and filters.
Furthermore it will not be searchable when directing a query at the specific field: (no hits)
"query": {
"term": {
"url": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
}
}
*here "url" is the field name.
However the field will still be searchable in the _all field: (might have a hit)
"query": {
"term": {
"_all": "http://www.domain.com/exact/url/that/was/sent/to/elasticsearch"
}
}
_all field
By default every field gets put in the _all field. Set "include_in_all": "false" to stop that. This might not be an issue with you as you are unlikely to search against the _all field with a URL by mistake.
I was working with a schema where countries were stored as 2 letter codes, e.g.: "NO" means Norway, and it is possible someone might do a search against the all field with "NO", so I make sure to set "include_in_all": "false".
Note: Any query where you don't specify a field explicitly will be executed against the _all field.
Storing
By default, elasticsearch will store your entire document (unanalyzed, as you sent it) and this will be returned to you in a hit's _source field. If you turn this off (if your elasticsearch db is getting huge perhaps?) then you need to explicitly set "store": "yes" to store fields individually. (One thing to notice is that store takes yes or no and not true or false - it tripped me up!)
Note: if you do this you will need to request the fields you want returned to you explicitly. e.g.:
curl -XGET http://path/index_name/type_name/id?fields=url,another_field
finally...
I would leave elasticsearch to store your whole document (as the default) and use the following mapping.
"type_name": {
"properties": {
"url": {
"type": "string",
"index": "no",
"include_in_all": "false"
},
// other fields' mappings
}
}
Source: elasticsearch documentation
There are two ways to input data into the index. Indexing and Storing. Indexing a piece of data means that it is tokenized, and placed in the inverted index, and can be searched. Storing data means it is not tokenized, or analyzed or anything, and is not added to the inverted index. It is stored in an entirely separate area, in it's full text form. It can not be searched against, but can be retrieved, in it's original form, by it's document ID.
The typical Lucene query process is to query against indexed data, and get the back Document IDs of matching documents, then to use those document IDs to retrieve the stored data for those documents, and display it to the user.
Data which is indexed, but not stored is searchable, but can not be retrieved in it's original form.
Data which is stored, but not indexed can be retrieved once you have found a hit, but is not searchable.
Data which is indexed and stored can be searched or retrieved.
Data which is neither can not be added to the index at all.
This is covered a bit in the Lucene FAQ.
You are looking for the 'index' => 'not_analyzed' mapping option.
Also, if you use the _source, you do not have to specify the store => false option.
I have a document in RavenDB that looks looks like:
{
"ItemId": 1,
"Title": "Villa
}
With the following metadata:
Raven-Clr-Type: MyNamespace.Item, MyNamespace
Raven-Entity-Name: Doelkaarten
So I serialized with a type MyNamespace.Item, but gave it my own Raven-Entity-Name, so it get its own collection.
In my code I define an index:
public class DoelkaartenIndex : AbstractIndexCreationTask<Item>
{
public DoelkaartenIndex()
{
// MetadataFor(doc)["Raven-Entity-Name"].ToString() == "Doelkaarten"
Map = items => from item in items
where MetadataFor(item)["Raven-Entity-Name"].ToString() == "Doelkaarten"
select new {Id = item.ItemId, Name = item.Title};
}
}
In the Index it is translated in the "Maps" field to:
docs.Items
.Where(item => item["#metadata"]["Raven-Entity-Name"].ToString() == "Doelkaarten")
.Select(item => new {Id = item.ItemId, Name = item.Title})
A query on the index never gives results.
If the Maps field is manually changed to the code below it works...
from doc in docs
where doc["#metadata"]["Raven-Entity-Name"] == "Doelkaarten"
select new { Id = doc.ItemId, Name=doc.Title };
How is it possible to define in code the index that gives the required result?
RavenDB used: RavenHQ, Build #961
UPDATE:
What I'm doing is the following: I want to use SharePoint as a CMS, and use RavenDB as a ready-only replication of the SharePoint list data. I created a tool to sync from SharePoint lists to RavenDB. I have a generic type Item that I create from a SharePoint list item and that I serialize into RavenDB. So all my docs are of type Item. But they come from different lists with different properties, so I want to be able to differentiate. You propose to differentiate on an additional property, this would perfectly work. But then I will see all list items from all lists in one big Items collection... What would you think to be the best approach to this problem? Or just live with it? I want to use the indexes to create projections from all data in an Item to the actual data that I need.
You can't easily change the name of a collection this way. The server-side will use the Raven-Entity-Name metadata, but the client side will determine the collection name via the conventions registered with the document store. The default convention being to use the type name of the entity.
You can provide your own custom convention by assigning a new function to DocumentStore.Conventions.FindTypeTagName - but it would probably be cumbersome to do that for every entity. You could create a custom attribute to apply to your entities and then write the function to look for and understand that attribute.
Really the simplest way is just to call your entity Doelkaarten instead of Item.
Regarding why the change in indexing works - it's not because of the switch in linq syntax. It's because you said from doc in docs instead of from doc in docs.Items. You probably could have done from doc in docs.Doelkaartens instead of using the where clause. They are equivalent. See this page in the docs for further examples.