How to remove collection from model [duplicate] - express

I have a query, which selects documents to be removed. Right now, I remove them manually, like this (using python):
for id in mycoll.find(query, fields={}):
mycoll.remove(id)
This does not seem to be very efficient. Is there a better way?
EDIT
OK, I owe an apology for forgetting to mention the query details, because it matters. Here is the complete python code:
def reduce_duplicates(mydb, max_group_size):
# 1. Count the group sizes
res = mydb.static.map_reduce(jstrMeasureGroupMap, jstrMeasureGroupReduce, 'filter_scratch', full_response = True)
# 2. For each entry from the filter scratch collection having count > max_group_size
deleteFindArgs = {'fields': {}, 'sort': [('test_date', ASCENDING)]}
for entry in mydb.filter_scratch.find({'value': {'$gt': max_group_size}}):
key = entry['_id']
group_size = int(entry['value'])
# 2b. query the original collection by the entry key, order it by test_date ascending, limit to the group size minus max_group_size.
for id in mydb.static.find(key, limit = group_size - max_group_size, **deleteFindArgs):
mydb.static.remove(id)
return res['counts']['input']
So, what does it do? It reduces the number of duplicate keys to at most max_group_size per key value, leaving only the newest records. It works like this:
MR the data to (key, count) pairs.
Iterate over all the pairs with count > max_group_size
Query the data by key, while sorting it ascending by the timestamp (the oldest first) and limiting the result to the count - max_group_size oldest records
Delete each and every found record.
As you can see, this accomplishes the task of reducing the duplicates to at most N newest records. So, the last two steps are foreach-found-remove and this is the important detail of my question, that changes everything and I had to be more specific about it - sorry.
Now, about the collection remove command. It does accept query, but mine include sorting and limiting. Can I do it with remove? Well, I have tried:
mydb.static.find(key, limit = group_size - max_group_size, sort=[('test_date', ASCENDING)])
This attempt fails miserably. Moreover, it seems to screw mongo.Observe:
C:\dev\poc\SDR>python FilterOoklaData.py
bad offset:0 accessing file: /data/db/ookla.0 - consider repairing database
Needless to say, that the foreach-found-remove approach works and yields the expected results.
Now, I hope I have provided enough context and (hopefully) have restored my lost honour.

You can use a query to remove all matching documents
var query = {name: 'John'};
db.collection.remove(query);
Be wary, though, if number of matching documents is high, your database might get less responsive. It is often advised to delete documents in smaller chunks.
Let's say, you have 100k documents to delete from a collection. It is better to execute 100 queries that delete 1k documents each than 1 query that deletes all 100k documents.

You can remove it directly using MongoDB scripting language:
db.mycoll.remove({_id:'your_id_here'});

Would deleteMany() be more efficient? I've recently found that remove() is quite slow for 6m documents in a 100m doc collection. Documentation at (https://docs.mongodb.com/manual/reference/method/db.collection.deleteMany)
db.collection.deleteMany(
<filter>,
{
writeConcern: <document>,
collation: <document>
}
)

I would recommend paging if large number of records.
First: Get the count of data you want to delete:
-------------------------- COUNT --------------------------
var query= {"FEILD":"XYZ", 'DATE': {$lt:new ISODate("2019-11-10")}};
db.COL.aggregate([
{$match:query},
{$count: "all"}
])
Second: Start deleting chunk by chunk:
-------------------------- DELETE --------------------------
var query= {"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var cursor = db.COL.aggregate([
{$match:query},
{ $limit : 5 }
])
cursor.forEach(function (doc){
db.COL.remove({"_id": doc._id});
});
and this should be faster:
var query={"FEILD":"XYZ", 'date': {$lt:new ISODate("2019-11-10")}};
var ids = db.COL.find(query, {_id: 1}).limit(5);
db.tags.deleteMany({"_id": { "$in": ids.map(r => r._id)}});

Run this query in cmd
db.users.remove( {"_id": ObjectId("5a5f1c472ce1070e11fde4af")});
If you are using node.js write this code
User.remove({ _id: req.body.id },, function(err){...});

Related

Results DataSet from DynamoDB Query using GSI is not returning correct results

I have a dynamo DB table where I am currently storing all the events that are happening in my system with respect to every product. There is a primary key with a Hash combination of productid,eventtype and eventcategory and Sort Key as Creation Time on the main table. The table was created and data was added into it.
Later I added a new GSI on the table with the attributes being Secondary Hash (which is just the combination of eventcategory and eventtype (excluding productid) and CreationTime as Sort Key. This was added so that I can query for multiple products at once.
The GSI seems to work fine, However only later I realized the data being returned is incorrect
Here is the scenario. (I am running all these queries against the newly created index)
I was querying for products with in the last 30 days and the Query returns 312 records, However, when I run the same query for last 90 days, it returns me only 128 records (which is wrong, should be atleast equal or greater than last 30 days records)
I have the pagination logic already embedded in my code, so that the lastEvaluatedKey is verified every-time, to loop and fetch the next set of records and after the loop, all the results are combined.
Not sure if I am missing something.
ANy suggestions would be appreciated.
var limitPtr *int64
if limit > 0 {
limit64 := int64(limit)
limitPtr = &limit64
}
input := dynamodb.QueryInput{
ExpressionAttributeNames: map[string]*string{
"#sch": aws.String("SecondaryHash"),
"#pkr": aws.String("CreationTime"),
},
ExpressionAttributeValues: map[string]*dynamodb.AttributeValue{
":sch": {
S: aws.String(eventHash),
},
":pkr1": {
N: aws.String(strconv.FormatInt(startTime, 10)),
},
":pkr2": {
N: aws.String(strconv.FormatInt(endTime, 10)),
},
},
KeyConditionExpression: aws.String("#sch = :sch AND #pkr BETWEEN :pkr1 AND :pkr2"),
ScanIndexForward: &scanForward,
Limit: limitPtr,
TableName: aws.String(ddbTableName),
IndexName: aws.String(ddbIndexName),
}
You reached the maximum number of items to evaluate (not necessarily the number of matching items). The limit is 1 MB.
The response will contain a LastEvaluatedKey parameter, it is the last item's id. You have to perform a new query with an extra ExclusiveStartKey parameter. (ExclusiveStartKey should be equal with LastEvaluatedKey's value.)
When the LastEvaluatedKey is empty you reached the end of the table.

Maintaining auto ranking as a column in MongoDB

I am using MongoDB as my database.
I have data which contains rank and name as columns. Now a new row can be updated with a rank different from ranks already existing or can be same.
If same then the ranks of other rows must be adjusted .
Rows having lesser rank than the to be inserted one must be incremented by one and the rows which are having ranks can remain as it it.
Feature is something like number bulleted list in MS Word type of applications. Where inserting a row in between adjust the numbering of other rows below it.
Rank 1 is the highest rank.
For e.g. there are 3 rows
Name Rank
A 1
B 2
C 3
Now i want to update a row with D as name and 2 as rank. So now after the row insert, the DB should like below
Name Rank
A 1
B 3
C 4
D 2
Probably using Database triggers i can achieve this by updating the other rows.
I have couple of questions
(a) Is there any other better way than using database trigger for achieving this kind of scenario ? Updating all the rows might be a time consuming job.
(b) Does MongoDB support database trigger natively ?
Best Regards,
Saurav
No, MongoDB, does not provide triggers (yet). Also I don't think trigger is really a great way to achieve this.
So I would just like to throw some ideas, see if it makes sense.
Approach 1
Maybe instead of disturbing those many documents, you can create a collection with only one document (Let's call the collection ranking). In that document, have an array field call ranks. Since it's an array it's already maintaining a sequence.
{
_id : "RANK",
"ranks" : ["A","B","C"]
}
Now if you want to add D to this rank at 2nd position
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["D"],$position:1}}});
it would add D to index 1 which is 2nd position considering index starts at 0.
{
_id : "RANK",
"ranks" : ["A","D","B","C"]
}
But there is a catch, what if you want to change C position to 1st from 4th, you need to remove it from end and put it in the beginning, I am sure both operation can't be achieved in single update (didn't dig in the options much), so we can run two queries
db.ranking.update({_id:"RANK"},{$pull : {"ranks": "C"}});
db.ranking.update({_id:"RANK"},{$push : {"ranks":{$each : ["C"],$position:0}}});
Then it would be like
{
_id : "RANK",
"ranks" : ["C","A","D","B"]
}
maintaining the rest of sequence.
Now you would probably want to store id instead of A,B,C etc. one document can be 16MB so basically this ranks array can store more than 1.3 million entries of id, if id is MongoDB ObjectId of 12 bytes each. if that is not enough, we still have option to have followup document(s) with further ranking.
Approach 2
you can also, instead of having rank as number, just have two field like followedBy and precededBy.
so your user document would look
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"A"
}
{
_id:"c"
"precededBy":"B",
}
if you want to add D at second position, then you need to change the current 2nd position and you need to insert the new One, so it would be change in only two document
{
_id:"A"
"followedBy":"B",
}
{
_id:"B"
"followedBy":"C",
"precededBy":"D" //changed from A to D
}
{
_id:"c"
"precededBy":"B",
}
{
_id:"D"
"followedBy":"B",
"precededBy":"A"
}
The downside of this approach is that you cannot sort in query based on the ranking until and unless you get all these in application and create a linkedlist sort of structure.
This approach just preserve the ranking with minimum DB changes.

Returning the first X records in a postgresql query with a unique field

Ok so I'm having a bit of a learning moment here and after figuring out A way to get this to work, I'm curious if anyone with a bit more postgres experience could help me figure out a way to do this without doing a whole lotta behind the scene rails stuff (or doing a single query for each item i'm trying to get)... now for an explaination:
Say I have 1000 records, we'll call them "Instances", in the database that have these fields:
id
user_id
other_id
I want to create a method that I can call that pulls in 10 instances that all have a unique other_id field, in plain english (I realize this won't work :) ):
Select * from instances where user_id = 3 and other_id is unique limit 10
So instead of pulling in an array of 10 instances where user_id is 3 and you can get multiple instances with the other_id is 5, I want to be able to run a map function on those 10 instances and get back something like [1,2,3,4,5,6,7,8,9,10].
In theory, I can probably do one of two things currently, though I'm trying to avoid them:
Store an array of id's and do individual calls making sure the next call says "not in this array". The problem here is I'm doing 10 individual db queries.
Pull in a large chunk of say, 50 instances and sorting through them in ruby-land to find 10 unique ones. This wouldn't allow me to take advantage of any optimizations already done in the database and I'd also run the risk of doing a query for 50 items that don't have 10 unique other_id's and I'd be stuck with those unless I did another query.
Anyways, hoping someone may be able to tell me I'm overlooking an easy option :) I know this is kind of optimizing before it's really needed but this function is going to be run over and over and over again so I figure it's not a waste of time right now.
For the record, I'm using Ruby 1.9.3, Rails 3.2.13, and Postgresql (Heroku)
Thanks!
EDIT: Just wanted to give an example of a function that technically DOES work (and is number 1 above)
def getInstances(limit, user)
out_of_instances = false
available = []
other_ids = [-1] # added -1 to avoid submitting a NULL query
until other_ids.length == limit || out_of_instances == true
instance = Instance.where("user_id IS ? AND other_id <> ALL (ARRAY[?])", user.id, other_ids).limit(1)
if instance != []
available << instance.first
other_ids << instance.first.other_id
else
out_of_instances = true
end
end
end
And you would run:
getInstances(10, current_user)
While this works, it's not ideal because it's leading to 10 separate queries every time it's called :(
In a single SQL query, it can be achieved easily with SELECT DISTINCT ON... which is a PostgreSQL-specific feature.
See http://www.postgresql.org/docs/current/static/sql-select.html
SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
each set of rows where the given expressions evaluate to equal. The
DISTINCT ON expressions are interpreted using the same rules as for
ORDER BY (see above). Note that the "first row" of each set is
unpredictable unless ORDER BY is used to ensure that the desired row
appears first
With your example:
SELECT DISTINCT ON (other_id) *
FROM instances
WHERE user_id = 3
ORDER BY other_id LIMIT 10

PDO SQL issue displaying multiple rows when using COUNT()

To display my results from PDO, I always use following PHP code for example:
$STH = $DBH->prepare("SELECT logo_id, guess_count, guessed, count(id) AS counter FROM guess WHERE user_id=:id");
$STH->bindParam(":id",$loginuser['id']);
$STH->execute();
while($row = $STH->fetch()){
print_r($row);
}
Now the issue is that I only get one result. I used to use $STH->rowCount() to check the amount of rows returned, but this method isn't really advised for SELECT statements because in some databases it doesn't react correctly. So I used the count(id) AS counter, but now I only get one result every time, even though the value of $row['counter'] is larger than one.
What is the correct way to count the amount of results in one query?
If you want to check the number of rows that are returned by a query, there are a couple of options.
You could do a ->fetchAll to get an array of all rows. (This isn't advisable for large result sets (i.e. a lot of rows returned by the query); you could add a LIMIT clause on your query to avoid returning more than a certain number of rows, if what you are checking is whether you get more than one row back, you would only need to retrieve two rows.) Checking the length of the array is trivial.
Another option is to run a another, separate query, to get the count separately, e.g.
SELECT COUNT(1) AS counter FROM guess WHERE user_id=:id
But, that approach requires another round trip to the database.
And the old standby SQL_CALC_ROUND_ROWS is another option, though that too can have problematic performance with large sets.
You could also just add a loop counter in your existing code:
$i = 0;
while($row = $STH->fetch()){
$i++
print_r($row);
}
print "fetched row count: ".$i;
If what you need is an exact count of the number of rows that satisfy a particular predicate PRIOR to running a query to return the rows, then the separate COUNT(1) query is likely the most suitable approach. Yes, it's extra code in your app; I recommend you preface the code with a comment that indicates the purpose of the code... to get an exact count of rows that satisfy a set of predicates, prior to running a query that will retrieve the rows.
If I had to process the rows anyway, and adding LIMIT 0,100 to the query was acceptable, I would go for the ->fetchAll(), get the count from the length of the array, and process the rows from the array.
You have to use GROUP BY. Your query should look like
SELECT logo_id, guess_count, guessed, COUNT(id) AS counter
FROM guess
WHERE user_id=:id
GROUP BY logo_id, guess_count, guessed

What's the least expensive way to get the number of rows (data) in a SQLite DB?

When I need to get the number of row(data) inside a SQLite database, I run the following pseudo code.
cmd = "SELECT Count(*) FROM benchmark"
res = runcommand(cmd)
read res to get result.
But, I'm not sure if it's the best way to go. What would be the optimum way to get the number of data in a SQLite DB? I use python for accessing SQLite.
Your query is correct but I would add an alias to make it easier to refer to the result:
SELECT COUNT(*) AS cnt FROM benchmark
Regarding this line:
count size of res
You don't want to count the number of rows in the result set - there will always be only one row. Just read the result out from the column cnt of the first row.