MongoDB Aggregate to Fill an Array Where Not In Results - mongodb-query

Simplifying the case, I have the following collections:
db.emails.find()
{ "_id" : ObjectId("5b59db643fd217eb78b1eb6d"), "title" : "abc" }
{ "_id" : ObjectId("5b59db643fd217eb78b1eb6e"), "title" : "def" }
and
db.users.find()
{ "_id" : ObjectId("5b59dbab3fd217eb78b1eb70"), "name" : "john", "forwarded" : [ObjectId("5b59db643fd217eb78b1eb6d")] }
{ "_id" : ObjectId("5b59dbac3fd217eb78b1eb71"), "name" : "mary", "forwarded" : [ObjectId("5b59db643fd217eb78b1eb6e")] }
where that forwarded column means all the emails that have already been forwarded to that given user.
I need to create something like this:
{ "_id" : ObjectId("5b59dbab3fd217eb78b1eb70"), "name" : "john", "forwarded" : [ {_id: ObjectId("5b59db643fd217eb78b1eb6d"), title: "abc"}], "not_forwarded" : [ {_id: ObjectId("5b59db643fd217eb78b1eb6e"), title: "abc"}] }
{ "_id" : ObjectId("5b59dbac3fd217eb78b1eb71"), "name" : "mary", "forwarded" : [ { _id: ObjectId("5b59db643fd217eb78b1eb6e"), title: "def" }], "not_forwarded" : [ { _id: ObjectId("5b59db643fd217eb78b1eb6d"), title: "def" }] }
By using aggregate.
So far, I have been able to create the "forwarded" part mappings, but I'm still trying to create the "not_forwarded" mapping.
I'm too much of a noob in the noSQL world to make this simple "outer join".
As I said, for the "forwarded" mapping, it was easy. I created something like this:
db.users.aggregate([
{ $lookup: { from: "emails", localField: "forwarded", foreignField: "_id", as: "forwarded"}},
{ $project: { "email_ids._id": 0, "email_ids.tile": 0}}
]);
Now, I'm trying to add the "not_forwarded" part, by adding some $match function, where (I guess) the clause would be a $nin aggregate, to fill inside the "not_forwarded" array, all the not yet sent items.
...and that's where I'm miserably failing.
Or in much simpler terms, I'm trying to come up with a query that will render all the emails that haven't yet been forwarded to the users in the DB.
Up to this day, I had it by making javascript loops, first in the emails' recordset, then nested inside, looping through all the users. Then creating a new recordset with all the pending emails to be sent.
This is ugly and non performatic. I would very much like to take full advantage of this aggregate mechanism to get these results.

Related

How to look for more than one element in an embedded array in MongoDb

I have a mongodb query: (Give me settings where account='test')
db.collection_name.find({"account" : "test1"}, {settings : 1}).pretty();
where I get the following output:
{
"_id" : ObjectId("49830ede4bz08bc0b495f123"),
"settings" : {
"clusterData" : {
"us-south-1" : "cluster1",
"us-east-1" : "cluster2"
},
},
What I'm looking for now, is to give me the account where the clusterData has more than 1 element in its array.
I'm only interested in listing those accounts with (2) or more elements.
I've tried this:
db.collection_name.find({'settings.clusterData.1': {$exists: true}}, {account : 1}).pretty();
Its not returning any results. Is my query correct? Is there another way to do this?
The reason that it isn't working is that your clusterdata is an object, not an array. I would suggest changing your data to be an array of clusters with two properties like below, then it will work.
{
"_id" : ObjectId("49830ede4bz08bc0b495f123"),
"settings" : {
"clusterData" : [
{
name : "cluster1",
location : "us-south-1"
},
{
name : "cluster2",
location : "us-east-1"
}
]
}
}

MongoDB is slower than SQL Server

I have the same data of around 30 million record saved in a SQL Server table and a MongoDB collection. A sample record is shown below, I have set up the same indexes as well. Below are the queries to return the same data, one in SQL the other in mongo. The SQL query takes 2 seconds to compute and return, mongo on the other hand takes 50. Any ideas why mongo so much slower than SQL??
SQL
SELECT
COUNT(DISTINCT IP) AS Count,
DATEPART(dy, datetime)
FROM
collection
GROUP BY
DATEPART(dy, datetime)
MONGO
db.collection.aggregate([{$group:{ "_id": { $dayOfYear:"$datetime" }, IP: { $addToSet: "$IP"} }},{$unwind:"$IP"},{$group:{ _id: "$_id", count: { $sum:1} }}])
Sample Document, there are around 30 million of exact same data in both
{
"_id" : ObjectId("57968ebc7391bb1f7c2f4801"),
"IP" : "127.0.0.1",
"userAgent" : "Mozilla/5.0+(Windows+NT+10.0;+WOW64;+Trident/7.0;+LCTE;+rv:11.0)+like+Gecko",
"Country" : null,
"datetime" : ISODate("2016-07-25T16:50:18-05:00"),
"proxy" : null,
"url" : "/records/archives/archivesdb/deathcertificates/",
"HTTPStatus" : "302",
"HTTPResponseTime" : "218"
}
EDIT: added the explanation of both queries
MONGO
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"IP" : 1,
"datetime" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "IISLogs.pubprdweb01",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : {
"$dayOfYear" : [
"$datetime"
]
},
"IP" : {
"$addToSet" : "$IP"
}
}
},
{
"$unwind" : {
"path" : "$IP"
}
},
{
"$group" : {
"_id" : "$_id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
SQL Server I don't have the permissions on it since I'm not a DBA or anything but it works fast enough that I'm not too concerned about its execution plan, the troublesome thing to me is that the mongo is using FETCH
The MongoDB version is slow because $group can't use an index (as evidenced by the "COLLSCAN" in the query plan), so all 30 million docs must be read into memory and run through the pipeline.
This type of real-time query (computing summary data from all docs) is simply not a good fit for MongoDB. It would be better to periodically run your aggregate with an $out stage (or use a map-reduce) to generate the summary data from the main collection and then query the resulting summary collection instead.

What is the default doc sequence of the result from an Elasticsearch filter request?

I recently run an Elasticsearch filter request that is
{
"from" : 0,
"size" : 10,
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"must" : {
"terms" : {
"a_id" : [ 257793, 257798, 257844 ]
}
}
}
}
}
},
"explain" : false,
"fields" : "a_id"
}
So that I can find all docs with a_id in 257793, 257798, 257844 and the results are 257844, 257798, 257793. So far so good.
Then I find that whatever the sequence of the term numbers are, the return docs are always in the same a_id order. That is, even I run
"terms" : {
"a_id" : [257798, 257844, 257793 ]
}
The result docs are in the order of 257844, 257798, 257793 as well.
So I am so curious about the mechanism behind the Elasticsearch filtering. Can anyone help me and give me a hint?
By default, ES returns in descending order of _score. You can provide the sort option, to say in which order and based on what you want the results to be returned. For e.g., for based on date field
{
"sort": { "date": { "order": "desc" }}
"query" : {
"term" : { "user" : "kimchy" }
}
}
You can get more information:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_sorting.html

ElasticSearch for Attribute(Key) value data set

I am using Elasticsearch with Haystacksearch and Django and want to search the follow structure:
{
{
"title": "book1",
"category" : ["Cat_1", "Cat_2"],
"key_values" :
[
{
"key_name" : "key_1",
"value" : "sample_value_1"
},
{
"key_name" : "key_2",
"value" : "sample_value_12"
}
]
},
{
"title": "book2",
"category" : ["Cat_3", "Cat_2"],
"key_values" :
[
{
"key_name" : "key_1",
"value" : "sample_value_1"
},
{
"key_name" : "key_3",
"value" : "sample_value_6"
},
{
"key_name" : "key_4",
"value" : "sample_value_5"
}
]
}
}
Right now I have set up an index model using Haystack with a "text" that put all the data together and runs a full text search! In my opinion this is not the a well established search 'cause I am not using my data set structure and hence this is some kind odd.
As an example if for an object I have a key-value
{
"key_name": "key_1",
"value": "sample_value_1"
}
and for another object I have
{
"key_name": "key_2",
"value": "sample_value_1"
}
and we it gets a query like "Key_1 sample_value_1" comes I get a thoroughly mixed result of objects who have these words in their fields rather than using their structures.
P.S. I am totally new to ElasticSearch and better to say new to the search technologies and challenges. I have searched the web and SO button didn't find anything satisfying. Please let me know if there is something wrong with my thoughts and expectations from these search engines and if there is SO duplicate question! And also if there is a better approach to design a database for this kind of search
Read the es docs on nested mappings and do something like this:
"book_type" : {
"properties" : {
// title, cat mappings
"key_values" : {
"type" : "nested"
"properties": {
"key_name": {
"type": "string", "index": "not_analyzed"
},
"value": {
"type": "string"
}
}
}
}
}
Then query using a nested query
"nested" : {
"path" : "key_values",
"query" : {
"bool" : {
"must" : [
{
"term" : {"key_values.key_name" : "key_1"}
},
{
"match" : {"key_values.value" : "sample_value_1"}
}
]
}
}
}

Obtaining Object IDs for Schedule States in Rally

I have set up a "checkbox group" with the five schedule states in our organization's workspace. I would like to query using the Lookback API with the selected schedule states as filters. Since the LBAPI is driven by ObjectIDs, I need to pass in the ID representations of the schedule states, rather than their names. Is there a quick way to get these IDs so I can relate them to the checkbox entries?
Lookback API will accept string-valued ScheduleStates as query arguments. Thus the following query:
{
find: {
_TypeHierarchy: "HierarchicalRequirement",
"ScheduleState": "In-Progress",
__At:"current"
}
}
Works correctly for me. If you want/need OIDs though, and add &fields=true to the end of your REST query URL, you'll notice the following information coming back:
GeneratedQuery: {
{ "fields" : true,
"find" : { "$and" : [ { "_ValidFrom" : { "$lte" : "2013-04-18T20:00:25.751Z" },
"_ValidTo" : { "$gt" : "2013-04-18T20:00:25.751Z" }
} ],
"ScheduleState" : { "$in" : [ 2890498684 ] },
"_TypeHierarchy" : { "$in" : [ -51038,
2890498773,
10487547445
] },
"_ValidFrom" : { "$lte" : "2013-04-18T20:00:25.751Z" }
},
"limit" : 10,
"skip" : 0
}
}
You'll notice the ScheduleState OID here:
"ScheduleState" : { "$in" : [ 2890498684 ] }
So you could run a couple of sample queries on different ScheduleStates and find their corresponding OIDs.