How to add a custom field in the significant terms aggregation bucket - elasticsearch-aggregation

I am using the significant terms aggregation in ElasticSearch and I am wondering if it is possible to add a custom field to each element in the bucket. Currently, the bucket element looks like this:
{
"key" : "Q",
"doc_count" : 4,
"score" : 4.731818103557571,
"bg_count" : 22
},
By default there are 4 fields and I want to add something here which are computed from the documents belonging here (4 documents in the example above).
I found that there is a way to customize the score itself but that's not what I want. What I want is to add a new custom field which is calculated from the documents having the same key.

Related

How can I query data from FaunaDb if only some collections have a specific property which I need to filter out

I'm really new to FaunaDb, and I currently have a collection of Users and an Index from that collection: (users_waitlist) that has fewer fields.
When a new User is created, the "waitlist_meta" property is an empty array initially, and when that User gets updated to join the waitlist, a new field is added to the User's waitlist_meta array.
Now, I'm trying to get only the collections that contain the added item to the waitlist_meta array (which by the way, is a ref to another index (products)). In a other words: if the array contains items, then return the collection/index
How can I achieve this? By running this query:
Paginate(Match(Index('users_waitlist')))
Obviously, I'm still getting all collections with the empty array (waitlist_meta: [])
Thanks in advance
you need to add terms to your index, which are explained briefly here.
the way I find it useful to conceptualise this is that when you add terms to an index, it's partitioned into separate buckets so that later when you match that index with a specific term, the results from that particular bucket are returned.
it's a slightly more complicated case here because you need to transform your actual field (the actual value of waitlist_meta) into something else (is waitlist_meta defined or not?) - in fauna this is called a binding. you need something along the lines of:
CreateIndex({
"name": "users_by_is_on_waitlist",
"source": [{
"collection": Collection("users"),
"fields": {
"isOnWaitlist": Query(Lambda("doc", ContainsPath(["data", "waitlist_meta"], Var("doc"))))
}
}],
"terms": [{
"binding": "isOnWaitlist"
}]
})
what this binding does is run a Lambda for each document in the collection to compute a property based on the document's fields, in our case here it's isOnWaitlist, which is defined by whether or not the document contains the field waitlist_meta. we then add this binding as a term to the index, meaning we can later query the index with:
Paginate(Match("users_by_is_on_waitlist", true))
where true here is the single term for our index (it could be an array if our index had multiple terms). this query should now return all the users that have been added to the waitlist!

How to construct intersection in REST Hypermedia API?

This question is language independent. Let's not worry about frameworks or implementation, let's just say everything can be implemented and let's look at REST API in an abstract way. In other words: I'm building a framework right now and I didn't see any solution to this problem anywhere.
Question
How one can construct REST URL endpoint for intersection of two independent REST paths which return collections? Short example: How to intersect /users/1/comments and /companies/6/comments?
Constraint
All endpoints should return single data model entity or collection of entities.
Imho this is a very reasonable constraint and all examples of Hypermedia APIs look like this, even in draft-kelly-json-hal-07.
If you think this is an invalid constraint or you know a better way please let me know.
Example
So let's say we have an application which has three data types: products, categories and companies. Each company can add some products to their profile page. While adding the product they must attach a category to the product. For example we can access this kind of data like this:
GET /categories will return collection of all categories
GET /categories/9 will return category of id 9
GET /categories/9/products will return all products inside category of id 9
GET /companies/7/products will return all products added to profile page of company of id 7
I've omitted _links hypermedia part on purpose because it is straightforward, for example / gives _links to /categories and /companies etc. We just need to remember that by using hypermedia we are traversing relations graph.
How to write URL that will return: all products that are from company(7) and are of category(9)? In otherwords how to intersect /categories/9/products and /companies/7/products?
Assuming that all endpoints should represent data model resource or collection of them I believe this is a fundamental problem of REST Hypermedia API, because in traversing hypermedia api we are traversing relational graph going down one path so it is impossible to describe such intersection because it is a cross-section of two independent graph paths.
In other words I think we cannot represent two independent paths with only one path. Normally we traverse one path like A->B->C, but if we have X->Y and Z->Y and we want all Ys that come from X and Z then we have a problem.
So far my proposition is to use query strings: /categories/9/products?intersect=/companies/9 but can we do better?
Why do I want this?
Because I'm building a framework which will auto-generate REST Hypermedia API based on SQL database relations. You could think of it as a trans compiler of URLs to SELECT ... JOIN ... WHERE queries, but the client of the API only sees Hypermedia and the client would like to have a nice way of doing intersections, like in the example.
I don't think you should always look at REST as database representation, this case looks more of a kind of specific functionality to me. I think I'd go with something like this:
/intersection/comments?company=9&product=5
I've been digging after I wrote it and this is what I've found (http://www.vinaysahni.com/best-practices-for-a-pragmatic-restful-api):
Sometimes you really have no way to map the action to a sensible RESTful structure. For example, a multi-resource search doesn't really make sense to be applied to a specific resource's endpoint. In this case, /search would make the most sense even though it isn't a resource. This is OK - just do what's right from the perspective of the API consumer and make sure it's documented clearly to avoid confusion.
What You want to do is to filter products in one of the categories ... so following Your example if we have:
GET /categories/9/products
Above will return all products in category 9, so to filter out products for company 7 I would use something like this
GET /categories/9/products?company=7
You should treat URI as link to fetch all data (just like simple select query in SQL) and query parameters as where, limit, desc etc.
Using this approach You can build complex and readable queries fe.
GET /categories/9/products?company=7&order=name,asc&offset=10&limit=20
All endpoints should return single data model entity or collection of
entities.
This is NOT a REST constraint. If you want to read about REST constraints, then read the Fielding dissertation.
Because I'm building a framework which will auto-generate REST
Hypermedia API based on SQL database relations.
This is a wrong approach and has nothing to do with REST.
By REST you describe possible resource state transitions (or operation call templates) by sending hyperlinks in the response. These hyperlinks consist of a HTTP methods and URIs (and other data which is not relevant now) if you build the uniform interface using the HTTP and URI standards, and we usually do so. The URIs are not (necessarily) database entity and collection identifiers and if you apply such a constraint you will end up with a CRUD API, not with a REST API.
If you cannot describe an operation with the combination of HTTP methods and already existing resources, then you need a new resource.
In your case you want to aggregate the GET /users/1/comments and GET /companies/6/comments responses, so you need to define a link with GET and a third resource:
GET /comments/?users=1&companies=6
GET /intersection/users:1/companies:6/comments
GET /intersection/users/1/companies/6/comments
etc...
RESTful architecture is about returning resources that contain hypermedia controls that offer state transitions. What i see here is a multistep process of state transitions. Let's assume you have a root resource and somehow navigate over to /categories/9/products using the available hypermedia controls. I'd bet the results would look something like this in hal:
{
_links : {
self : { href : "/categories/9/products"}
},
_embedded : {
item : [
{json of prod 1},
{json of prod 2}
]
}
}
If you want your client to be able to intersect this with another collection you need to provide to them the mechanism to perform this. You have to give them a hypermedia control. HAL only has links, templated links, and embedded as control types. let's go with links..change the response to:
{
_links : {
self : { href : "/categories/9/products"},
x:intersect-with : [
{
href : "URL IS ABSOLUTELY IRRELEVANT!!! but unique 1",
title : "Company 6 products"
},
{
href : "URL IS ABSOLUTELY IRRELEVANT!!! but unique 2",
title : "Company 5 products"
},
{
href : "URL IS ABSOLUTELY IRRELEVANT!!! but unique 3",
title : "Company 7 products"
}
]
},
_embedded : {
item : [
{json of prod 1},
{json of prod 2}
]
}
}
Now the client just picks the right hypermedia control (aka link) based on the title field of the link.
That's the simplest solution. But you'll probably say there's 1000's of companies i don't want 1000's of links...well ok if that;s REALLY the case...you just offer a state transition in the middle of the two we have:
{
_links : {
self : { href : "/categories/9/products"},
x:intersect-options : { href : "URL to a Paged collection of all intersect options"},
x:intersect-with : [
{
href : "URL IS ABSOLUTELY IRRELEVANT!!! but unique 1",
title : "Company 6 products"
},
{
href : "URL IS ABSOLUTELY IRRELEVANT!!! but unique 2",
title : "Company 5 products"
},
{
href : "URL IS ABSOLUTELY IRRELEVANT!!! but unique 3",
title : "Company 7 products"
}
]
},
_embedded : {
item : [
{json of prod 1},
{json of prod 2}
]
}
}
See what i did there? an extra control for an extra state transition. JUST LIKE YOU WOULD DO IF YOU HAD A WEBPAGE. You'd probably put it in a pop up, well that's what the client of your app can do too with the result of that control.
It's really that simple...just think how you'd do it in HTML and do the same.
The big benefit here is that the client NEVER EVER needed to know a company or category id or ever plug that in to some template. The id's are implementation details, the client never knows they exist, they just executed Hypermedia controls..and that is RESTful.

Optimizing 4 sorted sets

I'm currently using Redis as a time sorted index for my Mongo database. My mongo document looks something like this:
{
_id : MongoId,
title : "Title",
publishDate : new Date().getTime(),
attachments : [
{
title : "Title",
type : "VIDEO"
},
{
title : "Title",
type : "PHOTO"
},
{
title : "Title",
type : "LINK"
}
]
}
Attachments can have 0 to 3 items in it. (No attachment, only link / only video / only photo, 2 of different types, 3 of different types)
Each time a document is added to Mongo it's automatically added to a sorted set in Redis.
The publishDate (unix timestamp) is used as the score and MongoId is used as the sorted set member. The key has the name "monitor:latest:feed"
If the attachments array contains a link, I add the same member to a sorted list with the key "monitor:latest:feed:links". Same goes for videos/photos.
The result is I have 4 Redis lists that are sorted by time.
The lists are called:
"monitor:latest:feed" that contains all of the document ids in mongodb.
"monitor:latest:feed:links" contains all the document ids that have an attachment with the type link.
"monitor:latest:feed:photos" contains all the document ids that have an attachment with the type photo.
"monitor:latest:feed:video" contains all the document ids that have an attachment with the type video.
Is there a way I could remove the last 3 sorted sets or use some other data structure that takes less memory in Redis? Keeping in mind that I would still need the members sorted by time somehow.

The RESTful way to include or not include children of a resource?

Say I have a team object, that has a name property, a city property and a players property, where the players property is a an array of possibly many players. This is represented in an SQL database with a teams table and a players table, where each player has a name and a team_id.
Building a RESTful api based on this simple data-structure, I'm in doubt if there is a clear rule regarding, if the return object should/could include a list of players, when hitting /teams/:id ?
I have a view, that needs to show a team, and its players with their names, so:
1: Should /teams/:id join the two tables behind the scene and return the full team object, with a players property, that is an array of names and id's?
2: Should /teams/:id join the two tables behind the scene and return the team object, with a players property, that is an array of just id's that will then have to be queried one-by-one to /players/:id ?
3: Should two calls be made, one to /teams/:id and one to /teams/:id/players ?
4: Should a query string be used like this /teams/:id?fields=name,city,players ?
If either 2 or 3 is the way to go, how would one approach the situation, where a team could also have multiple cities, resulting in another cities table in the DB to keep it normalized? Should a new endpoint then be created at /teams/:id/cities.
When creating RESTful API's, is it the normalized datastructure in the DB that dictates the endpoints in the API?
Usually with a RESTful API, it is best that the use-cases dictate the endpoints of the API, not necessarily the data structure.
If you sometimes need just the teams, sometimes need just the players of a team, and sometimes need both together, I would have 3 distinct calls, probably something like /teams/:id, /players/:teamid and player-teams/:teamid (or something similar).
The reason you want to do it this way is because it minimizes the number of HTTP requests that need to be made for any given page. Of all of the typical performance issues, an inflated number of HTTP requests is usually one of the most common performance hits, and usually one of the easiest to avoid.
That being said, you also don't want to go so crazy that you create an over-inflated API. Think through the typical use cases and make calls for those. Don't just implement every possible combination you can think of just for the sake of it. Remember You Aren't Gonna Need It.
I'd suggest something like:
GET /teams
{
"id" : 12,
"name" : "MyTeam"
"players" :
{
"self" : "http://my.server/players?teamName=MyTeam"
},
"city" :
{
"self" : "http://my.server/cities/MyCity"
}
}
GET /cities
GET /cities/{cityId}
GET /players
GET /players/{playerId}
You can then use URIs to call out to get whatever other related resources you need. If you want the flexibility to embed values, you can use ?expand, such as:
GET /teams?expand=players
{
"id" : 12,
"name" : "MyTeam"
"players" :
{
"self" : "http://my.server/players?teamName=MyTeam",
[
{
"name" : "Mary",
"number" : "12"
},
{
"name" : "Sally",
"number" : "15"
}
]
},
"city" :
{
"self" : "http://my.server/cities/MyCity"
}
}

How to implement our own UID in Lucene?

I wish to create an index with, lets say the following fields :
UID
title
owner
content
out of which, I don't want UID to be searchable. [ like meta data ]
I want the UID to behave like docID so that when I want to delete or update,
I'll use this.
Is this possible ? How to do this ?
You could mark is as non-searchable by adding it with Store.YES and Index.NO, but that wont allow you easy updating/removal by using it. You'll need to index the field to allow replacing it (using IndexWriter.UpdateDocument(Term, Document) where term = new Term("UID", "...")), so you need to use either Index.ANALYZED with a KeywordAnalyzer, or Index.NOT_ANALYZED. You can also use the FieldCache if you have a single-valued field, which a primary key usually is. However, this makes it searchable.
Summary:
Store.NO (It can be retrieved using the FieldCache or a TermsEnum)
Index.NOT_ANALYZED (The complete value will be indexed as a term, including any whitespaces)