Yarn Fair Scheduler queue - hadoop-yarn

I am seeking your help please in Hadoop Yarn Fair Scheduler.
My understanding is as follows:
The default YARN settings allow when queue is not explicit by application it will create a queue on the fly with the user name.
Q1) Is my understanding correct?
Q2) If yes, what is the weight for this on the fly queue will be?
yarn.scheduler.fair.allow-undeclared-pools When set to true, the Fair Scheduler uses the username as the default pool name, in the event that a pool name is not specified. When set to false, all applications are run in a shared pool, called default.
Default: true.
yarn.scheduler.fair.user-as-default-queue When set to true, pools specified in applications but not explicitly configured, are created at runtime. When set to false, applications specifying pools not explicitly configured run in a pool named default. This setting applies when an application explicitly specifies a pool and when the application runs in a pool named with the username associated with the application.
Default: true.

You can create a new queue where all applications that didn't specify a queue can go and then give the weight you want.
<queuePlacementPolicy>
<rule name="specified" create="false" />
<rule name="primaryGroup" create="false" />
<rule name="default" queue="yourDefaultQueueName" />
</queuePlacementPolicy>

Related

Wix - ServiceControl start takes four minutes to fail, should be 30 sec

My service automatically starts during install...
<ServiceControl Id="StartService" Start="install" Stop="both" Remove="uninstall" Name="HeskaGateway" Wait="yes" />
And it works fine if I provide the service with a valid connection string. If I provide a bad connection string the service starts and stops very quickly... I see this when I go to Services and do a manual start. According to the documentation on MSI ServiceControl Table, a Wait value of "yes" turns into a 1 which means it should wait for 30 seconds and then fail. It takes 4 minutes and 7 seconds. Why so long?
MSI (s) (6C:78) [16:36:41:932]: Executing op: ServiceControl(,Name=HeskaGateway,Action=1,Wait=1,)
StartServices: Service: Heska Gateway
MSI (s) (6C:78) [16:40:48:862]: Note: 1: 2205 2: 3: Error
MSI (s) (6C:78) [16:40:48:862]: Note: 1: 2228 2: 3: Error 4: SELECT `Message` FROM `Error` WHERE `Error` = 1920
Error 1920. Service 'Heska Gateway' (HeskaGateway) failed to start. Verify that you have sufficient privileges to start system services.
EDIT: I never got to find out what my real problem was. I also had an installation sequencing error because my CustomAction (deferred) which would edit the connection string in the JSON file was triggering AFTER the ServiceStart. Trying to move the ServiceStart after that deferred custom action was awful. So I killed off the start from the ServiceControl entry and then added another custom action which silently ran "SC.EXE start HeskaGateway". I'll document that below as a solution.
The installer has a custom UI which asks the user to copy-paste their connection string given to them from support department. The installer edits a JSON file in the app folder using a deferred CustomAction. It is deferred because it needed to be after the files are written to the disk and it also needed to have elevated permissions. This all worked great until I decided to have the service start itself "at the end" of the installation. My first attempt was to use a
<ServiceControl Id="StartService" Start="install" ...>
But that was taking 4 minutes to fail. Troubleshooting showed that the service was being started BEFORE the custom action which writes the connection string into the JSON file. I needed the service start to be delayed till after the custom action. I looked at adding a second ServiceControl entry into its own component that could be scheduled much later but that gave me an uncomfortable feeling that I was going to break uninstall and repair installs. So, I just added another deferred custom action sequenced right after the JSON file edit.
That new action executes "SC.EXE start MyServiceName". SC.EXE is a non-blocking way to start services so succeed or fail, it will finish quickly.
My final solution:
<Component Id="MyCloudSync.exe" Guid="{generate-your-own-guid}">
<File Id="MyCloudSync.exe.file" KeyPath="yes" Source="$(var.RELEASEBINARIES)\MyCloudSync.exe" />
<ServiceInstall Id="MyCloudSync.exe"
Type="ownProcess"
Name="MyGateway"
DisplayName="My Gateway"
Description="Synchronizes laboratory data with Cloud"
Start="auto"
ErrorControl="normal" />
<!--Start is performed by a customer action that calls SC.EXE so it can be delayed after the custom action that writes the JSON file -->
<ServiceControl Id="StartService" Stop="both" Remove="uninstall" Name="MyGateway" Wait="yes" />
</Component>
You should note that the ServiceControl entry above does not have a "start="
The DLL that GetConnectionString and SetConnectionString calls is one of my own making. Wix has its own custom action for running command lines quietly... WixQuietExec
<CustomAction Id= "GetConnectionString"
BinaryKey="MyCustomActions"
DllEntry="GetConnectionString"
Execute="immediate"/>
<CustomAction Id= "SetConnectionString"
BinaryKey="MyCustomActions"
Impersonate="no"
DllEntry="SetConnectionString"
Execute="deferred"/>
<CustomAction Id="SetConnectionStringDeferredParams"
Property="SetConnectionString"
Value=""[INSTALLFOLDER]""[CONNECTIONSTRING]"" />
<Property Id="QtExecStartService" Value=""SC.EXE" start MyGateway"/>
<CustomAction Id="QtExecStartService"
BinaryKey="WixCA"
DllEntry="WixQuietExec"
Impersonate="no"
Execute="deferred"
Return="ignore"/>
Starting the service is just a convenience so for the installer to prevent going to Services.msc to perform the start or requiring a reboot. So I used Return="ignore". Also SC.EXE just puts the service into "Start Pending" so it probably can't return much of an error unless your service doesn't exist.
NOTE: WixQuietExec is documented here. Make sure to quote the EXE and give your property the same Id as your CustomAction that uses WixQuietExec. That info is under "Deferred Execution" but I still got it wrong on my first try.
The service itself might be doing something. The service control protocol includes a service status that's returned from the service itself, and this tells Windows what's going on. One of the items in there is a wait hint. Knowing nothing about the service, it's possible that the service is aware that it might have a slow startup and tells Windows (with a wait hint) that it should wait longer. 30 seconds is really a default, not a fixed value. This post refers to the wait hint for a managed code service:
How to choose value for serviceStatus.dwWaitHint?
You didn't show the ServiceControl used to install the service, but if it's shared with another service in the same process things can get complicated because the process itself can't terminate while it's also hosting another service.
Error 1920: this error would seem to indicate a missing privilege (logon as a service) or access right, or perhaps some sort of interference from an external cause or perhaps an MSI package that is not running elevated (unlikely I think - then you can only write to per-user paths).
Are you running this service with the LocalSystem account or NetworkService or LocalService or with a regular user account? (about the above service accounts).
Or the newer concepts of managed service accounts, group managed service accounts or virtual accounts step-by-step info (concepts that I do not know enough about).
If you use a regular user account, does it have the "logon as a service" privilege set?
If you create the user in your WiX MSI (or define it), you can set LogonAsService="yes" for the User(element) in question. I believe this adds the privilege for the account (SeServiceLogonRight).
A privilege is different from access rights (ACLs) - for the record: a privilege is a pervasive system-wide access to some sort of function / feature - for example change system time, start / stop services, logon as a service, etc... (see section 13 here for more).
Is there a security software / antivirus on your test box? It could be interfering with your setup's API calls. If so, try to disable it (if possible) during the installation process. I'll mention firewalls too.
Is your MSI set to run elevated? (Package Element => InstallPrivileges).
UPDATE: Just adding the issue it turned out to be: the custom action update of config data needed to run with elevated rights or faulty service configuration data resulted which in turn caused the generic 1920 error message. In this case the configuration was in JSON format, it can obviously be in several formats: XML, registry, INI, etc... See OP comment above for more details.
Timeout: As to the long timeout. I have seen this sometimes with security software (locks the whole setup so the service timeout runs only after some scanning delay), or with setups that trigger the creation of a system restore point prior to the installation kicking off in the first place (this is one of the possible reasons why some MSI installations suddenly take a long time when they ordinarily complete quickly - it is possible to prevent this restore point creation - MSIFASTINSTALL). Also, maybe check this answer from serverfault and the registry value described for whether there is a policy on your network to change the default service start timeout: How do I increase windows service startup timeout (ServicesPipeTimeout). Frankly I am not sure whether MSI uses its own timeout or the system default one - maybe someone can illuminate? One could also speculate that the database connection you initiate has its own timeout? (doesn't match your interactive test experience?) Maybe you can check your code and your call and let us know? Does your service depend on another service? (see symantec link below). Any dependencies to files installed to the GAC or to WinSxS as mentioned by Chris in the first link below?
A lot of speculation. Let's hope some of it helps or that it inspires other ideas that solve the problem. Below some links for safekeeping (trying to write a generic answer that may also help others with the same or similar problems - which makes the answer too long, sorry about that).
Links:
Error 1920 service failed to start. Verify that you have sufficient privileges to start system services
http://blog.iswix.com/2008/09/different-year-same-problem.html
https://support.symantec.com/en_US/article.TECH103676.html (adds issues such as: dependency on other services)
How to debug Windows services (service timeout and more)
windows service startup timeout (I would check this)
Service failed to start error 1920 (generatePublisherEvidence)

Azure Cache not persisting Session State across VIP swaps?

As a follow-up to this post: Enabling co-located Session Caching in an Azure Cloud Service - I'm trying to store session state in Azure Cache to persist sessions between VIP swaps. Quoted from the answer:
To fix this problem, I'd like you to try the new Cache Service
(Preview). In this way you create dedicate cache for your subscription
so that you can use them across cloud service deployments, virtual
machines and websites.
I've set up an Azure Cache (Preview) instance, used its endpoint and primary access key in my web.config, and deployed to my Azure Cloud Service Staging slot.
I then logged in using Forms auth, and redeployed to the same slot. My credentials were persisted! This was great to see.
But then I VIP swapped to Production, logged in the same way to the production instance, redeployed to Staging, VIP swapped again, and then refreshed, expecting to remain logged in, but it didn't work - my session was lost on both production and staging.
I've followed the instructions found here:
http://www.windowsazure.com/en-us/manage/services/cache/net/how-to-in-role-cache/#getting-started-cache-role-instance
What could be causing this? No exceptions are thrown - my access key works (tested by giving it a bogus one and getting an exception)... I'm not sure what's going on. Config sections in web.config:
<sessionState mode="Custom" customProvider="AFCacheSessionStateProvider" xdt:Transform="Insert">
<providers>
<add name="AFCacheSessionStateProvider" type="Microsoft.Web.DistributedCache.DistributedCacheSessionStateStoreProvider, Microsoft.Web.DistributedCache" cacheName="default" dataCacheClientName="default" applicationName="AFCacheSessionState"/>
</providers>
</sessionState>
And:
<dataCacheClient name="default">
<autoDiscover isEnabled="true" identifier="mysite.cache.windows.net" />
<securityProperties mode="Message" sslEnabled="false">
<messageSecurity authorizationInfo="{my key}" />
</securityProperties>
</dataCacheClient>
As for timeout policy - I have it set to never expire with eviction enabled. I'm one of a handful of users and the cache is storing cookies in 128MB of space, so I don't think it's related to expiry.
I also noticed that in the docs, there is no entry for applicationName as I have above. I tried removing it and re-testing, to no avail - my Prod session is still lost upon VIP swap.
What am I doing wrong?
Update:
From a microsoft forum post:
I was able to reproduce the issue. I am investigating.
Forms authentication is not based on session state. It relies only on client-side cookies. Cookies are encrypted and validated with keys specified in machineKey section of web.config.
Default config is:
<machineKey validationKey="AutoGenerate,IsolateApps"
decryptionKey="AutoGenerate,IsolateApps"
validation="SHA1" decryption="Auto" />
AutoGenerate means that each physical machine gets its own decryptionKey. Cookies generated by production VM will not be accepted by staging VM.
After VIP swap all cookies set by old production VM will be rejected by new production VM (ex-Staging VM), causing all users to be logged out.
You need to specify machineKey values explicitly to force Forms Auth to generate cookies that will be valid for both new and old production VMs (see How To: Configure MachineKey, Web Farm Deployment Considerations section).
Check this online tool for machineKey section generation: http://aspnetresources.com/tools/machineKey.
UPD: There is a related note in Manage Deployments in Windows Azure/Managing ASP.NET machine keys for IIS:
Windows Azure automatically manages the ASP.NET machineKey for
services deployed using IIS. If you routinely use the VIP Swap
deployment strategy, you should manually configure the ASP.NET machine
keys.

IBM Worklight 5.0.6 - Usage of testWebResourcesChecksum

According to Worklight 5.0.6 Information Center, for the attribute testWebResourcesChecksum in application-descriptor.xml:
The element controls whether the application verifies the integrity of its web resources each time it starts running on the mobile device. If its enabled attribute is set to true, the application calculates the checksum of its web resources and compares it with a value stored when it was first run. Checksum calculation can take a few seconds, depending on the size of the web resources. To make it faster, you can provide a list of file extensions to be ignored in this calculation.
<security>
<encryptWebResources enabled="false"/>
<testWebResourcesChecksum enabled="false" ignoreFileExtensions="png, jpg, jpeg, gif, mp4, mp3"/>
<publicSigningKey> value </publicSigningKey>
</security>
1) If the attribute is set to true, the web resources checksum will be compared with a value stored when it was first run. What happen if the checksum is different? There will be an error message prompted and force application to start?
2) By default this attribute is set to false. However, for my understanding Direct Update will also require calculating checksum of the web resources. What is the underlying meaning of setting it to true or false?
Environment: Worklight 5.0.6 Developer Edition
Thanks!
1) If the attribute is set to true, the web resources checksum will be compared with a value stored when it was first run. What happen if the checksum is different? There will be an error message prompted and force application to start?
Yes. An error message will be displayed and the user will be forced to quit the app.
2) By default this attribute is set to false. However, for my understanding Direct Update will also require calculating checksum of the web resources. What is the underlying meaning of setting it to true or false?
Direct Update is a valid path for Worklight to change the web resources of an application, and will happen after the application contacts the Worklight Server upon launch or return to the foreground.
The idea here that if someone managed to get into the filesystem of the device and alter the web resources, the app will detect this and prevent use of the application.

Can't get service to pull from (dead letter) queue

I have a queue named log on a remote machine. When I call that queue locally, I specify a custom dead-letter queue by modifying my NetMsmqBinding:
_binding.DeadLetterQueue = DeadLetterQueue.Custom;
_binding.CustomDeadLetterQueue = new Uri(
"net.msmq://localhost/private/Services/Logging/LogDeadLetterService.svc");
This works fine; when I force my message to fail to get to its destination, it appears in this queue.
Now, I have a service hosted in IIS/WAS to read the dead-letter queue. It it hosted in a site called Services, at Services/Logging/LogDeadLetterService.svc. Here's the service in my config:
<service name="Me.Logging.Service.LoggingDeadLetterService">
<endpoint binding="netMsmqBinding"
bindingNamespace="http://me.logging/services/2012/11"
contract="Me.Logging.Service.Shared.Service.Contracts.ILog" />
</service>
And here's my activation:
<add relativeAddress="LogDeadLetterService.svc"
service="Me.Logging.Service.LoggingDeadLetterService" />
My actual service is basically this:
[ServiceBehavior(AddressFilterMode = AddressFilterMode.Any, // Pick up any messages, regardless of To address.
InstanceContextMode = InstanceContextMode.Single, // Singleton instance of this class.
ConcurrencyMode = ConcurrencyMode.Multiple, // Multiple callers at a time.
Namespace = "http://me.logging/services/2012/11")]
public class LoggingDeadLetterService : ILog
{
public void LogApplication(ApplicationLog entry)
{
LogToEventLog(entry);
}
}
My queue is transactional and authenticated. I have net.msmq included as enabled protocols both on the Services site and on the Logging application, and I added a net.msmq binding to the Services site. If I have the binding information as appdev.me.com, I get the following error when browsing to http://appdev.me.com/Logging/LogDeadLetterService.svc (appdev.me.com is setup in my HOSTS file):
An error occurred while opening the queue:Access is denied. (-1072824283, 0xc00e0025).
If I have the binding information as localhost, I get the following error:
An error occurred while opening the queue:The queue does not exist or you do not have sufficient permissions to perform the operation. (-1072824317, 0xc00e0003).
No matter which way I have it set up, the service isn't picking up the dead letter, as it's still in the queue and not in my event log.
Now, I realize that both of these reference a permissions issue. However, in the interest of getting the code part of this tested before figuring out the authentication piece, I have given Full Control to everyone I could think of - to include Everyone, Authenticated Users, NETWORK SERVICE, IIS_USERS, ANONYMOUS LOGON, and myself. (The app pool is running as me.)
Any help as to how to get my service to be able to pull from this queue would be phenomenal. Thanks!
EDIT: According to this MSDN blog entry, 0xC00E0003 corresponds to MQ_ERROR_QUEUE_NOT_FOUND, and 0xc00e0025 corresponds to MQ_ERROR_ACCESS_DENIED, so it looks like I want to have the binding information as appdev.me.com. However, that still doesn't resolve the apparent permissions issue occurring.
EDIT2: It works if I host the service in a console app and provide the following endpoint:
<endpoint address="net.msmq://localhost/private/Services/Logging/LogDeadLetterService.svc"
binding="netMsmqBinding"
bindingNamespace="http://me.logging/services/2012/11"
contract="Me.Logging.Service.Shared.Service.Contracts.ILog" />
So what's going on differently in the console app than is going on in IIS? I'm pretty confident, due to EDIT above, that I'm hitting the queue. So why can't I get into it?
EDIT3: Changed Services/Logging/LogDeadLetterService.svc to Logging/LogDeadLetterService.svc per the advice given here, but no change.
//
[Bonus question: Do I need to handle poison messages in my dead letter queue?]
So, three things needed to be changed:
The binding does have to be localhost.
The queue has to be named Logging/LogDeadLetterService.svc to be found - it's the application and the service, not the site, application, and service.
I had something messed up with the application pool - I have no idea what it was, but using a different app pool worked, so I backed out all of my service-related changes and then recreated everything, and it works.
Well, that was a lot of banging my head against my desk for something as simple as "don't mess up your app pool."

What can cause IIS app pool to recycle?

I am currently experiencing some instability in my session variables and believe the app pool is where the error is coming from. What I cannot find is a list of possible culprits for the issue. What can cause the app pool to recycle on its own, other than a scheduled recycle?
Common reasons why your application pool may unexpectedly recycle
EDIT: Full Text in the event that the link goes 404:
If your application crashes, hangs and deadlocks it will cause/require the application pool to recycle in order to be resolved, but sometimes your application pool inexplicably recycles for no obvious reason. This is usually a configuration issue or due to the fact that you're performing file system operations in the application directory.
For the sake of elimination I thought I'd list the most common reasons.
Application pool settings
If you check the properties for the application pool you'll see a number of settings for recycling the application pool. In IIS6 they are:
Recycle worker processes (in minutes)
Recycle worker process (in requests)
Recycle worker processes at the following times
Maximum virtual memory
Maximum used memory
These settings should be pretty self explanatory, but if you want to read more, please take a look at this MSDN article
The processModel element of machine.config
If you're running IIS5 or the IIS5 isolation mode you'll have to look at the processModel element. The Properties you should pay the closest attention to are:
memoryLimit
requestLimit
timeout
memoryLimit
The default value of memoryLimit is 60. This value is only of interest if you have fairly little memory on a 32 bit machine. 60 stands for 60% of total system memory. So if you have 1 GB of memory the worker process will automatically restart once it reaches a memory usage of 600 MB. If you have 8 GB, on the other hand, the process would theoretically restart when it reaches 4,8 GB, but since it is a 32 bit process it will never grow that big. See my post on 32 bit processes for more information why.
requestLimit
This setting is "infinite" by default, but if it is set to 5000 for example, then ASP.NET will launch a new worker process once it's served 5000 requests.
timeout
The default timeout is "infinite", but here you can set the lifetime of the worker process. Once the timeout is reached ASP.NET will launch a new worker process, so setting this to "00:05:00" would recycle the application every five minutes.
Other properties
There are other properties within the processModel element that will cause your application pool to recycle, like responseDeadlockInterval. But these other settings usually depend on something going wrong or being out of the ordinary to trigger. If you have a deadlock then that's your main concern. Changing the responseDeadlockInterval setting wouldn't do much to resolve the situation. You'd need to deal with the deadlock itself.
Editing and updating
ASP.NET 2.0 depends on File Change Notifications (FCN) to see if the application has been updated. Depending on the change the application pool will recycle. If you or your application is adding and removing directories to the application folder, then you will be restarting your application pool every time, so be careful with those temporary files.
Altering the following files will also trigger an immediate restart of the application pool:
web.config
machine.config
global.asax
Anything in the bin directory or it's sub-directories
Updating the .aspx files, etc. causing a recompile will eventually trigger a restart of the application pool as well. There is a property of the compilation element under system.web that is called numRecompilesBeforeAppRestart. The default value is 20. This means that after 20 recompiles the application pool will recycle.
A workaround to the sub-directory issue
If your application really depends on adding and removing sub-directories you can use linkd to create a directory junction. Here's how:
Create a directory you'd like to exclude from the FCN, E.g. c:\inetpub\wwwroot\WebApp\MyDir
Create a separate folder somewhere outside the wwwroot. E.g. c:\MyExcludedDir
use linkd to link the two: linkd c:\inetpub\wwwroot\WebApp\MyDir c:\MyExcludedDir
Any changes made in the c:\inetpub\wwwroot\WebApp\MyDir will actually occur in c:\MyExcludedDir so they will go unnoticed by the FCN.
Is recycling the application pool really that bad?
You really shouldn't have to recycle the application pool, but if you're dealing with a memory leak in your application and need to buy time to fix it, then by all means recycling the application pool could be a good idea.
What about session state?
Well, if you're running in-process session state, then obviously it's going to be reset each and every time the application pool is recycled. If you need to brush up on your state server options, then I recommend taking a look at this entry.