High availability of scheduled agents in clustered environment

High availability of scheduled agents in clustered environment - lotus-domino

One of my applications has been deemed business-critical and I'm trying to figure out a way to make my scheduled agents behave correctly in cases of failover. It doesn't need to be automatic, but an admin should be able to 'transfer' the running of the agents from one server to the other.
I was thinking of one solution of setting in a profile document the 'active' server, and have the agents (there are 4, 1 Java and 3 in LotusScript) check if they are currently running on the 'active' server, and if not, stop immediately.
Then there is IBM's workaround suggestion: http://www-01.ibm.com/support/docview.wss?uid=swg21098304 of making three agents, one 'core' agent which gets called by a 'main agent' running on the main server, and a 'failover agent' running on the failover server, but only if the 'main server' is available.
But that solution seems a bit clunky to me, that's going to be lots of agents that need to be set up in a fiddly fashion.
What would you recommend?

My logic is similar to yours, but I don't use profile- documents (caching is a bad thing for such important task), but a central configuration document.
Agents are scheduled to run on every server.
First they read the "MasterAppServer" from the config- document. If it is another server then they try to open the database (or names.nsf, depending on what you want) on the MasterServer. If the database can be opened -> everything is ok, agent stops its work. If it cannot be opened, then the agent assumes, that the other server is down and changes the MasterAppServer- field in the config- document to his own server and runs.
Of course I write a log in the config- document whenever "MasterAppServer" changes.
That works quite well and does not need any admin intervention when one server is down.

Related

ColdFusion 11 to 2018 Upgrade -- Server Locking Up, How to Test Better?

We are currently testing an upgrade from CF11 to CF2018 for my company's intranet. To give you an idea how long this site has been running, our first version of CF was 3.1! It is still using application.cfm, and there is code from 1998, when I started writing this thing. Yes, 21 years -- I'm astonished, too. It is a hodgepodge of all kinds of older frameworks, too, including Fusebox.
Anyway, we're running Win 2012 VM connected to a SQL 2016 farm. Everything looked OK initially, but in the Week I've been testing, the server has come to a slowdown once (a page took more than 5 seconds to run, something that usually takes 100ms, no DB involvement), and another time, the server came to a grinding halt. The only way I could restart CF App service was by connecting to the server with another server via Services, because doing it via Remote Desktop was so slow.
Now keep in mind -- it's just me testing. This is a site that doesn't have a ton of users, but still, having 5 concurrent connections is normal and there are upwards of 200-400 users hitting this thing every day.
I have FusionReactor running on this thing now, so the next time a lockup happens, I will be able to take a closer look, but what do you think is the best way I can test this? Our site is mostly transactional, users going and filling out forms to put internal orders through. We also connect to XML web services and REST services; we also provide REST services, too. Obviously there's no way to completely replicate a production server's requests onto a test server, but I need to do more thorough testing. Any advice would be hugely appreciated.

I realize your focus for now is trying to recreate the problem on test. That may not be as easy as hoped. Instead, you should be able to understand and resolve it in production. FusionReactor can help, but the answer may well be in the cf logs.
You don't mention assessing the logs at the time of the hangup. See especially the coldfusion-error log, for outofmemory conditions.
You mention raising the heap, but the problem may be with the metaspace instead. If so, consider simply removing the maxmetaspace setting in the jvm args. That may be the sole and likely cause of such new and unexpected outages.
Or if it's not, and there's nothing in the logs at the time, THEN do consider FR. Does IT show anything happening at the time?
If not then consider a need to tune the cf/web server connector. I assume you're using iis. How many sites do you have? And how many connectors (folders in the cf config/wsconfig folder)? What are the settings in their workers.properties file? Are they optimized for the number of sites using that connector?
Also, have you updated cf2018? Are there any errors in the update error log? Did you update the web server connector also?
Are you running the cf2018 pmt (performance monitoring tool set)? Have you updated it?
There could be still more to consider, but let's see how it goes with those. I have blog posts on these and many more topics that would elaborate on things, both at my site (carehart.org) and the Adobe cf portal (coldfusion.adobe.com).
But let's hear if any of this gets you going.

Periodic Email Notifications (Windows Azure .Net)

I have an application written in C# ASP.Net MVC4 and running on Windows Azure Website. I would like to write a service / job to perform following:
1. Read the user information from the website database
2. Build a user-wise site activity summary
3. Generate an HTML email message that includes the summary for each user account
4. Periodically send such emails to each user
I am new to Windows Azure Cloud Services and would like to know best approach / solution to achieve the above.
Based on my study so far, I see that independent Worker Role of Cloud Services along with SendGrid and Postal would be a best fit. Please suggest.

You're on the right track, but... Remember that a Worker Role (or Web Role) is basically a blueprint for a Windows Server VM, and you run one or more instances of that role definition. And that VM, just like Windows Server running locally, can perform a bunch of tasks simultaneously. So... there's no need to create a separate worker role just for doing hourly emails. Think about it: For nearly an hour, it'll be sitting idle, and you'll be paying for it (for however many instances of the role you launch, and you cannot drop it to zero - you'll always need minimum one instance).
If, however, you create a thread on an existing worker or web role, which simply sleeps for an hour and then does the email updates, you basically get this ability at no extra cost (and you should hopefully cause minimal impact to the other tasks running on that web/worker role's instances).
One thing you'll need to do, independent of separate role or reused role: Be prepared for multiple instances. That is: If you have two role instances, they'll both be running the code to check every hour. So you'll need a scheme to prevent both instances doing the same task. This can be solved in several ways. For example: Use a queue message that stays invisible for an hour, then appears, and your code would check maybe every minute for a queue message (and the first one who gets it does the hourly stuff). Or maybe run quartz.net.

I didn't know postal, but it seems like the right combination to use.

AX 2009 Code Propagation with Load Balancing

I'm curious how AX 2009 handles code propagation when operating in a load balanced environment.
We have recently converted our AX server infrastructure from a single AOS instance to 3 AOS instances, one of which is a dedicated load balancer (effectively 2 user-facing servers). All share the same application files and database. Since then, we have had one user who has been having trouble receiving code updates made to the system. The changes generally take a few days before they can see it, and the changes don't seem to update all at once.
For example, a value was added to an ENUM field, and they were not able to see it on a form where it was used (though others connected to the same instance were). Now, this user can see the field in the dropdown as expected, but when connected to one of the instances it will not flow onto a report as it should. When connected to the other instance it works fine, and for any other user connected to either instance it works properly.
I'm not certain if this is related to the infrastructure changes, but it does seem odd that only one user is experiencing it. My understanding was that with this setup, code changes would propagate across the servers either immediately (due to sharing the Application Files), or at least in a reasonable amount of time (<1 day). Is this correct or have I been misinformed?

As your cache problems seems to be per user, then go learn about AUC files.
The files are store on the client computer and can be tricky to keep in sync. There are other problems as well.
Start AX by a script, delete the AUC file before starting AX.
There is no cache coherency between AOS instances: import an XPO on one AOS server, and it is not visible on the other. You will either have to flush the cache manually or restart the other AOS. The simplest thing is to import on each server, this is especially true for labels, as this is the only way to bring labels in sync to my knowledge.

I am sort of curious to this as well, but what I do know, is that if a user has access to the AOT (member of admin or a group with developer access), the client will cache AOT-elements more aggressively than if not having developer access.
Elements (like an Enum) might be cached at client level, but also at AOS-level. Restarting the AOS (service) would flush out memory for that service, forcing it to reload elements upon restart.
I guess what I am suggesting is that you make sure the element is not cached client side. Either restart the client, or run the "Refresh AOD" from the developer tools menu. If that doesn't help, try restaring the AOS the client connects to, and see if that helps.
I think it is safe to say, if you want to be absolutely sure every user has the most recent "copy" of any element, you should not develop on the application files shared by all of these services, but rather develop in an environment with 1 AOS. And when you need to move things to production, you need to take down all AOSes in production and move the chances over while the system is down.

In such cases it is often difficult to find the exact cause for a specific case.
I try to follow some best practices to avoid such situations:
- Use separate environment for developing
- Deploy code changes using layer files, not XPOs
- When deploying, stop all AOSs, deploy files, delete index files in the application directory, start one AOSs, compile, sync DB, start other AOS (or even shut down all and start again)
- Try to have latest kernel versions for AOSs and client

SyncFramework 2.1 updates & deletes do not seem to apply properly

I'm synchronizing SQL Server 2008 with ~6 SQL Server 2008 Express clients (everything R2 I believe), using the SyncOrchestrator or specifically using http://code.msdn.microsoft.com/windowsdesktop/Database-SyncSQL-Server-e97d1208 as a base with slight modifications. To my knowledge this means all connections are peers or nodes.
I have 2 scopes. One is download only and the other is upload only. The download only scope is ridden with identity columns primarily because I didn't know any better and still couldn't wrap my head around introducing Guids as the PK on the client side. It doesn't totally matter as all clients should have exact replicas of about 8 or so tables and these machines don't touch this data in any way, only read it.
The upload only scope uses Guids as fortunately I can control that portion of the database and there would be no way 10 clients all using the same identity seed could sync back to the server properly. Both scopes use the default provisioning with bulk inserts and the whole 9 yards so there shouldn't be anything I'm doing on the provisioning end to screw this up.
I initially set everything up not using PerformPostRestoreFixup AND the initial database would be manually synchronized with insert statements from the host. This seemed fine but no updates or deletes seemed to ever be applied. You can safely ignore this (only used for historical accuracy and to prove my ineptness) as I then used VS2010 Database Projects to rebuild the database down to schema only & synchronized. I then used the steps outlined here (http://social.microsoft.com/Forums/br/syncdevdiscussions/thread/9ac6d1a1-1565-4b82-a8d8-3d4a9ff5d07b) (sync, backup, restore, call performpostrestorefixup, sync on x clients) and on my dev box where I'm setting all this up I could see updates and deletes just fine. Its when I deploy this to the x clients that I'm not seeing a mirror of the database as I think I should.
The initial sync will complain and try to synchronize all records again. I believe this is expected. During ApplyChangeFailed event on the client I set everything other than DbConflictType.ErrorsOccurred to ApplyAction.RetryWithForceWrite. This may be a source of problems as I initially thought this should be done to force the change down to the client. I want the server to always win in this scenario but during trace I always see the phrase "Local wins" during the bulk insert/update calls. It's possible I'm seeing the error before the re-apply happens but it's awkward to look at.
The only problem I seem to be having is with the download only scope. The initial client database is about a week old now and if I use the performpostrestorefixup steps I don't see any of the updates that have applied between now and then as I think I should. It's as if SyncFx almost prefers a blank database on the client side to kick off the initial sync then all the updates seem to apply just fine with no ApplyChangesFailed events kicking off.
If anyone has seen this before or has a clue where to go I would greatly appreciate it. My brain has fried trying to determine what it is that's going on. My last ditch effort will be to deploy blank databases to all the clients and have them start the sync. I've had no issues with this on the dev side but I can only test one other client to know if that'll do anything different. Aside from that I don't know what to do other than to keep doing manual syncs which would defeat this purpose entirely. I thought PerformPostRestoreFixup would alleviate the issue entirely but I seem to be having the same problems with or without it or perhaps I'm not looking at what I need to be.
Thanks

I wanted to report and close the entry with my findings.
When I would deploy a previously configured client database, I'd often get ApplyChangeFailed events in the form of this log:
"[05:30:41 PM] - ApplyChange Failed: TableName: , Stage: ApplyingInserts, ConflictType: LocalInsertRemoteInsert, Action: RetryWithForceWrite"
This is what I thought would be expected as it tried to reinsert the data that is already there. What this should've been changed to was an update statement during RetryWithForceWrite but I found the data was not updating with what was being sent down.
Once I started each client with a completely blank database and provisioned locally, all of these errors went away. It's as if every client expects some unique id only it sets. I'm also using x64 builds versus x86 which may have some or no bearing on the results. I wish I could determine what exactly happened but it seems that when in doubt, and whenever possible, starting from absolute zero and letting sync fill in the data is your safest option.

Deploying on EC2

This question is for anyone who has actually used Amazon EC2. I'm looking into what it would take to deploy a server there.
It looks like I can start in VirtualBox, setup my server and then export the image using the provided ec2-tools.
What gets tricky is if I actually want to make configuration changes to my running server, they will not be persistent.
I have some PHP code that I need to be able to deploy (and redeploy) to the system, so I was thinking that EBS would be a good choice there.
I have a massive amount of data that I need stored, but it just so happens that latency is not an issue, so I was thinking something like s3fs might work.
So my question is... What would you do? What does your configuration look like? What have been particular challenges that perhaps you didn't see coming?

We have deployed a large-scale commercial app in the AWS environment.
There are three basic approaches to keeping your changes under control once the server is running, all of which we use in different situations:
Keep the changes in source control. Have a script that is part of your original image that can pull down the latest and greatest. You can pull down PHP code, Apache settings, whatever you need. If you need to restart your instance from your AMI (Amazon Machine Image), just run your script to get the latest code and configuration, and you're good to go.
Use EBS (Elastic Block Storage). EBS is like a big external hard drive that you can attach to your instance. Even if your instance goes away, EBS survives. If you later need two (or more) identical instances, you can give each one of them access to what you save in EBS. See https://stackoverflow.com/a/3630707/141172
Burn a new AMI after each change. There's a tool to create a new AMI from a running instance. If EBS is like having an external hard drive, creating a new AMI is like having a DVD-R. You can save the current state of your machine to it. Next time you have to start a new instance, base it on that new AMI. Good to go.

I recommend storing your PHP code in a repository such as SVN, and writing a script that checks the latest code out of the repository and redeploys it when you want to upgrade. You could also have this script run on instance startup so that you get the latest code whenever you spin up a new instance; saves on having to create a new AMI every time.
The main challenge that I didn't see coming with EC2 is instance startup time - especially with Windows. Linux instances take 5 to 10 minutes to launch, but I've seen Windows instances take up to 40 minutes; this can be an issue if you want to do dynamic load balancing and start up new instances when your load increases.

I'd suggest the best bet is to simply 'try it'. The charges to run a small instance are not high and data transfer rates are very low - I have moved quite a few GB and my data fees are still less than a dollar(!) in my first month. You will likely end up paying mostly for system time rather than data I suspect.
I haven't deployed yet but have run up an instance, migrated it from Ubuntu 8.04 to 8.10, tried different port security settings, seen what sort of access attempts unknown people have tried (mostly looking for phpadmin), run some testing against it and generally experimented with the config and restart of the components I'm deploying. It has been a good prelude to my end deployment. I won't be starting with a big DB so will be initially sticking with the standard EC2 instance space.
The only negativity I have heard it that some spammers have made some of the IP ranges subject to spam-blocking - but have not yet confirmed that.

Your virtual box approach I will suggest you take after you are more familiar with the EC2 infrastructure. I suggest that you go to EC2, open an account and follow Amazon's EC2 getting-started guide. This guide will give you enough overview on all things (EBS, IP, CONNECTIONS, and otherS) to get you started. We are currently using EC2 for production and the way we started was like I am explaining here.
I hope you become a Cloud Expert Soon.

Per timbo's concern, I was able to nab an IP that, so far hasn't legitimately shown up on any spam lists. You will have a few hiccups since many blacklists are technically whitelists and will have every IP on their list until otherwise notified that a Mail Server is running on that IP. It's really easy to remove, most of them have automated removal request forms and every one that doesn't has been very cooperative in removing me from their lists. Just be professional, ask if they can give a time and reason for the block and what steps you should take to remove your IP. All the services I have emailed never asked me to jump through any hoops, within two or three business days they all informed me my IP had been removed.
Still, if you plan on running a mail server I would recommend reserving IPs now. They're 1 cent per every hour they are not bound to an instance so it works out to being about $7 a month. I went ahead and reserved an extra one as I plan on starting up another instance soon.

I have deployed some simple stuff to EC2 Win2k3 instances. Here's my advice:
Find a tutorial. Sign up for the service. Just spend an afternoon setting up your first server. It's pretty darned easy, though there will be obstacles to overcome. It's not too tough.
When I was fooling with EC2 I think I spent like $2.00 setting up a server and playing with it for a while.
Some of your data will be persistent, but you can connect S3 to EC2 as well.
Just go for it!

With regards to the concerns about blacklisting of mail servers, you can also use Amazon's Simple Email Service (SES), which obviates the need to run the mail server on the EC2 instances.

I had trouble with this as well, but posted a note here in their forums - https://forums.aws.amazon.com/thread.jspa?threadID=80158&tstart=0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas