I've a database with an agent that periodically delete (via Java agent, "removePermanently" method) all documents in a view and re-create them.
After some month, i've noticed that database size is considerably increased.
Showing database information through this command
sh database <dbpath>
it results that i've a lot of deleted documents (i suppose they are deletion stubs)
Document Type Live Deleted
Documents 1,922 817,378
Compacting database, 80% space was recovered.
Is there a way to programmatically delete stubs definitively to avoid "database explosion"? Or, is there a way to correctly manage this scenario (deletion and creation of documents)?
Don't delete the documents! Re-use them. That's the best answer. Seriously. Take the existing documents, clear the fields and set Form := "Obsolete". Modify the selection formula for all your views by appending & Form != "Obsolete" Create a new hidden view called "Obsolete" with selection formula Form = "Obsolete", and instead of creating new documents, change your code to go to the Obsolete view, grab an available document and set new field values (including changing the Form field). Only create new documents if there are not enough available in the Obsolete view. Any performance that you lose by doing this, which really should be minimal with the number of documents that you seem to have, will be more than offset by what you will gain by avoiding the growth and fragmentation of the NSF file that you are creating by doing all the deletions and creating new documents.
If, however, there's no possible way for you to do that -- maybe some third party tool that is outside of your control is creating the documents -- then it's important to know if the database you are talking about is replicated. If it is replicated, then you must be very careful because purging deletion stubs before all replicas are brought up to date will cause deleted documents to "come back to life" if a replica that has been off-line since before the delete occurs comes back on-line.
If the database is not replicated at all, or is reliably replicated across all replicas quickly, then you can reduce the purge interval. Go to the Replication Settings dialog, find the checkbox labeled "Remove documents not modified in the last __ days". Do not check the box, but enter a small number into the number of days. The purge interval for deletion stubs will be set to 1/3 of this number. So if you set it to 3 the effect will be that stubs are kept for 1 day and then purged, giving you 24 hours to assure that all replicas are up to date. If you need more, set the interval higher, maintaining the 3x multiple as needed. If a server is down for an extended period of time (longer than your purge interval), then adjust your operations procedures so that you will be sure to disable replication of the database to that server before it comes back on line and the replica can be deleted and recreated. Be aware, though, that user replicas pose the same problem, and it's not really possible to control or be aware of user replicas that might go off-line for longer than the purge interval. In any case, remember: do not check the box. To reduce the purge interval for deletion stubs only, just reduce the number.
Apart from this, the only way to programmatically delete deletion stubs requires use of the Notes C API. It's possible to call the required routines from LotusScript, but in my experience once the total number of stubs plus documents gets too high you will likely run into an error and may have to create and deploy a new non-replica copy of the database to get past it. You can find code along with my explanation in the answer to this previous question.
I have to second Richard's recommendation to reuse documents. I recently had a similar project, and started the way you did with deleting everything and importing half a million records every night. Deletion stubs and the growth of the FT index quickly became problems, eating up huge amounts of disk space and slowing performance significantly. I tried to manage the deletion stubs, but I was clearly going against the grain of Domino's architecture.
I read Richard's suggestion here, and adopted that approach. Here's what I did:
1) create 2 views based on form - one for 'active' records, and another for 'inactive' records
2) start the agent by setting autoupdate = false for both views
3) use stampall("form", "inactive") to change all fo the active records to inactive
4) manually refresh the 2 views using notesview.refresh()
5) start importing data. for each record, pull a document out of the pool of inactive records (by walking the 'inactive' view)
6) if if run out of inactive records in the pool, create new ones
7) when import is complete, manually refresh the views again
8) use db.createftindex(0, true) to re-create the FT index
the code is really not that complex, and it runs in about the same amount of time, if not faster, than my original approach.
Thanks Richard!
Also, look at the advanced db properties - several things there that will help optimize the db.
It sounds like you are "refreshing" the contents of the database by periodically deleting all the documents and creating new ones from some other source. Cut that out. If the data are in the Notes database already, leave the document alone. What you're doing is very inefficient.
Related
We have a table which has Datetime stamp field when that record was created. How can we create a trigger or procedure to delete a record after 30 days?
Is there any advice how we can run this deletion scheduler?
Firebird doesn't have a scheduler. You will need to create an application that executes a clean up routine on a schedule yourself. You could do this as part of the normal application, or you could write a small application specifically for this purpose, and execute it with the scheduler of your OS (e.g. Windows Scheduled Tasks, or Linux Cron).
Firebird 2.1 introduced global triggers fired on database connection/disconnection and on transaction starting/ending.
https://www.firebirdsql.org/file/documentation/chunk/en/refdocs/fblangref30/fblangref30-ddl-trigger.html
While it is not exactly what you need it can be used to achieve similar results. Whether that similarity is good enough for you or not is for you to evaluate.
to delete a record after 30 days?
The question here is what you do specifically mean here. Would it still be okay, if the row is deleted in 31 day, in 40 days?
In our case, for a client-server office application, there was no time pressure and additionally there was no safe deletion as long as the programs had "documents" open.
We had to delete some global data, and while there were some marks in the database, which documents use them and which documents are currently opened - it was not very reliable. Which also meant that existing method of immediate delete occasionally could lead to application crashes.
So we reformulated a problem similar to yours the following way:
We need rows not deleted immediately but pending for deletion for 30 days or more. Those record would be rendered in the application in a special way, as a warning to users and also providing a way for them to cancel deletion, if they changed their mind (or if other users had different ideas).
The deletion would happen, in logic terms, "when there is no connected application". In technical term it could mean either "when first application is connecting, but before it started actual (business-related) work" or "when last application is disconnecting, after it ended doing actual work". We settled on the latter, we used on disconnect global database trigger.
We had not only main business-domain application, but a number of technical helper utilities. From the Firebird point of view there is no difference in them. So we had to modify "login sequence" in our main application: right after successful login it registered it's own CURRENT_CONNECTION into a special table. This is potentially slightly fragile.
ON DISCONNECT trigger used to do three actions:
it checked, if current_connection is in the table, and if it was - it called a special stored procedure, SP_LOCAL_CLEANUP.
it removed the current_connection from the table (it could had been BEFORE DELETE trigger then to call the procedure, but we decided our helper utilities should have a way to hook in, if they would need, so the call was put in the ON DISCONNECT trigger).
it checked if that table (known connected business-domain applications list) became empty, and if it did - called another special stored procedure, SP_GLOBAL_CLEANUP.
Those stored procedures were "umbrella" procedures, solely consisted of calls to different procedures, which did the actual work of checking for inconsistencies and fixing them. Like, removing marks "this document is opened for editing" if an application (or computer, or network) has crashed without removing the lock normal way. This way we could add or remove functionality without breaking Firebird object dependency chains.
In particular, one of the global sub-procedures looked into the "deletion pending" records, and deleted those "kept in recycle bin" for a time span running over 30 days. Actually, the records just had a column of planned deletion date and that could be more or less than 30 days, but that is technicality.
This meant that the actual deletion was happening "sometimes after 30 days" and it only happened when all main apps were shut down. When later those apps would be run again - they would re-read those global dictionaries tables in the updated, pruned state. The applications never again were in inconsistent state, using records removed from the database.
Potential fragile point: if users would not shut down application in the night, but just go home, it could mean there would never be a state "last application disconnected". This, however, would be a maintenance nightmare for their network admins (Windows updates and reboots, antivirus updates and reboots), so we documented the recommendation that those admins have to make sure at least once a week all the users went all together out of the database.
Potential fragile point: if the Firebird server crashes (not applications, but the server engine), then the "known connections" table would have stale values. We considered it not a practical problems, as then CURRENT_CONNECTION would be restarted as 1 value and go upward, eventually cleaning the table. But we also added a function into helper app, to use SYSDBA and monitoring tables and clean the table off non-existing connections.
You can re-use this framework if you do not have time pressure and you are okay if the actual deletion is deferred for a few days.
You can also use ON TRANSACTION START trigger instead, to shorten the delay to mere minutes, but I expect this would slow down your application badly, so would suggest against it.
Basically does any one know how to ask for delta changes that happened after certain time. I am saving all the changes that user has done to planner objects to the database, but I know eventually delta changes for 100 of plans will go insanely huge. GET /me/planner/all/delta GET /users/{id}/planner/all/delta. Does any one knows how to filter delta response with given time. My plan is to query delta after certain time.
It could be in any object that delta works. Right now I can bring all the delta changes but I do not see how I can ask for changes that happend after certain time.
Delta only works with the tokens presented in the links, it is not time based (we do not store it based on time internally). It is also best-effort, which means at certain time the delta changes will be cleared and the clients will be forced to read objects again to be in sync. So even if there was a time based query, there wouldn't be a guarantee that you can access older data.
What is your scenario? Some kind of history tracking or auditing?
As far as I know, nope. I have to cycle on all Planner plans and tasks in them to get the details. I am currently saving the planner task details to sharepoint and instead of updating it I am just deleting all old records and recreating them.
That makes sense, I was saving the deltas so that in future I could tell which user modified what planner objects. Since Microsoft has not implemented an audit trail for planner objects yet. Storing delta Link was just for my possible future rollback processes.
I realized deltaLink does not expire it is just using delta token to find the future changes from the time the delta was queried. Basically, I am requesting Microsoft teams to have some kind of audit trail for Planner objects changes(at least for who changed at what time) so we can query those activities and have those specific individuals held responsible for unwanted changes that they made. For instance, changing the due date of planner tasks
I'm doing an UI in excel which the goal is to have "live" information on Orders and Order Status between three users, I'll name them DataUser, DashboardOne, and DashboardTwo for examples sake.
The process is that the DataUser will fill in the Orders data, that data is going to be used to populate information on two dashboards. The dashboards are going to be updated live with changes from the DataUser(Orders Increases/Decreases), and changes on order status between DashboardOne and DashboardTwo. For the live updates I'm thinking on using Application.OnTime event call to refresh the View/Dashboards. The two dashboards will be active about 8 hours a day.
Where I'm struggling in on how/where to store the Data, I've though about a couple of options but I don't know the implications of one over the other, especially considering that I intend that the dashboards will run/refresh every 30 sec. with Application.OnTime which could prove expensive.
The options I thought about where:
A Master Workbook that would create separate Workbooks for DashboardOne and DashboardTwo and act database and main UI for DataUser.
Three separate workbooks that would all refer to the one DataWorkbook or another flat data file (perhaps and XML or JSON).
Using an actual database for the data, although this would bring other implications (don't currently have one).
I'm not considering a shared workbook as I've tried something similar in the past (and this time ^^, early steps) and it went rather poorly, nightmare to sync and poor data integrity.
In short:
Which would be the best Data storage strategy for Excel that wouldn't jeopardise the integrity of the data nor be so expensive as to interfere with the uptime rest of the code? Are there better options that I should be considering?
There are quite a number of alternatives, depending on the time you want to invest and the tools at hand. I'll give you a couple of options here.
But first, the basic assumptions:
The amount of data items that you need to share (being a dashboard) is of few tens (let's say, less than 100),
You have at least basic programming skills,
From your description, you have one client with READ-WRITE capabilities while there are two clients with READ-ONLY capability.
OPTION 1:
You can have the Excel saving the data in CSV format (very small amount of data and hence it would take a small fraction of a second to save it and to read it).
The two clients would then open the file in read-only mode, load the data and update the display. You would need to include exception handling at both types of client:
At the one writing, handle the condition of error when it attempts to write at the same time one of the clients attempts to read,
At the two reading, handle the condition of error when attempting to open the file (for read only) while the other process is writing.
Since the write and read operations are going to take a very, VERY short time (as stated, a small fraction of a second), these conditions will be very rare. Additional, since both dashboard clients would be open the file for read-only, they will not disturb each other if they make their attempt at the same moment.
If you wish to drastically reduce the chances of collision, you may set the timers (of the update process on one hand and of the reading processes on the other) to be a primary number of seconds. For instance, the timer of the updating process would be every 11 seconds while that of the reading process would be every 7 seconds.
OPTION 2:
Establish a TCP/IP channel between the processes, where the main process (meaning the one that would have WRITE privilege) would send a triggering message to the other two requesting to start an update whenever a new version of the data had been saved. Upon reception of the trigger, both READ-ONLY processes would approach the file and fetch the data.
In this case, the chances of collision would become near to null.
How are people coping with changes to redis object schemas - adding or removing properties from objects?
Sharing from my own experience (one year old project with thousands of user requests per second).
Usually, there were two scenarios for me:
Add new information to existing structures (like, "email" field to a user)
Remove or change existing values in existing structures (like, change format of some field)
Drop stuff from the database
For 1 I keep following simple strategy: degrade gracefully, e.g. if user doesn't have email record - treat it as empty email. Worked all the time.
For 2 and 3 it depends, whether data can be changed/calculated/fixed before releasing or after. I run a job on database that does all the work for me, for few millions of keys it takes considerable time (minutes). If that job can be run only after I release the new code - then degrading gracefully helps a lot, I simply release and then run the job.
PS: If you affect a lot of keys in redis then it is very important to use http://redis.io/topics/pipelining Saves a lot of time.
Take a list of all affected (i.e. you want to fix them in any way) keys or records in pipeline
Do whatever you want on them. If it's possible try to queue writing operations into pipeline too
Send queued operations to redis.
It is also very important for you to make indexes of your structures. I keep sets with ids. Then I simply iterate over SMEMBERS(set_with_ids).
It is much, much better than iterating over KEYS command.
For extremely simple versioning, you could use different database numbers. This could be quite limiting in cases where almost everything is the same between two versions but it's also a very clean way to do it if it will work for you.
Say there is a database with 100+ tables and a major feature is added, which requires 20 of existing tables to be modified and 30 more added. The changes were done over a long time (6 months) by multiple developers on the development database. Let's assume the changes do not make any existing production data invalid (e.g. there are default values/nulls allowed on added columns, there are no new relations or constraints that could not be fulfilled).
What is the easiest way to publish these changes in schema to the production database? Preferably, without shutting the database down for an extended amount of time.
Write a T-SQL script that performs the needed changes. Test it on a copy of your production database (restore from a recent backup to get the copy). Fix the inevitable mistakes that the test will discover. Repeat until script works perfectly.
Then, when it's time for the actual migration: lock the DB so only admins can log in. Take a backup. Run the script. Verify results. Put DB back online.
The longest part will be the backup, but you'd be crazy not to do it. You should know how long backups take, the overall process won't take much longer than that, so that's how long your downtime will need to be. The middle of the night works well for most businesses.
There is no generic answer on how to make 'changes' without downtime. The answer really depends from case to case, based on exactly what are the changes. Some changes have no impact on down time (eg. adding new tables), some changes have minimal impact (eg. adding columns to existing tables with no data size change, like a new nullable column that doe snot increase the null bitmap size) and other changes will wreck havoc on down time (any operation that will change data size will force and index rebuild and lock the table for the duration). Some changes are impossible to apply without *significant * downtime. I know of cases when the changes were applies in parallel: a copy of the database is created, replication is set up to keep it current, then the copy is changed and kept in sync, finally operations are moved to the modified copy that becomes the master database. There is a presentation at PASS 2009 given by Michelle Ufford that mentions how godaddy gone through such a change that lasted weeks.
But, at a lesser scale, you must apply the changes through a well tested script, and measure the impact on the test evaluation.
But the real question is: is this going to be the last changes you ever make to the schema? Finally, you have discovered the perfect schema for the application and the production database will never change? Congratulation, once you pull this off, you can go to rest. But realistically, you will face the very same problem in 6 months. the real problem is your development process, with developers and making changes from SSMS or from VS Server Explored straight into the database. Your development process must make a conscious effort to adopt a schema change strategy based on versioning and T-SQL scripts, like the one described in Version Control and your Database.
Use a tool to create a diff script and run it during a maintenance window. I use RedGate SQL Compare for this and have been very happy with it.
I've been using dbdeploy successfully for a couple of years now. It allows your devs to create small sql change deltas which can then be applied against your database. The changes are tracked by a changelog table within database so that it knows what to apply.
http://dbdeploy.com/