Configuring SLURM so it requires the user to specify --account

Configuring SLURM so it requires the user to specify --account - config

I'm trying to figure out how to configure SLURM so that a user is required to specify --account when using the SLURM commands (salloc, sbatch, srun). Effectively I want to disable the default account behavior.
Has anyone out there found a simple way to do this?

I had the same requirement to force users to specify accounts and, after finding several ways to fulfill it with slurm, I decided to revive this post with the shortest/easiest solution.
The slurm lua submit plugin sees the job description before the default account is applied. Hence, you can install the slurm-lua package, add "JobSubmitPlugins=lua" to the slurm.conf, restart the slurmctld, and directly test against whether the account was defined via the job_submit.lua script (create the script wherever you keep your slurm.conf; typically in /etc/slurm/):
-- /etc/slurm/job_submit.lua to reject jobs with no account specified
function slurm_job_submit(job_desc, part_list, submit_uid)
if job_desc.account == nil then
slurm.log_error("User %s did not specify an account.", job_desc.user_id)
slurm.log_user("You must specify an account!")
return slurm.ERROR
end
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
return slurm.SUCCESS
end
return slurm.SUCCESS
Errors resulting from not specifying an account appear as follows:
# srun --pty bash
srun: error: You must specify an account!
srun: error: Unable to allocate resources: Unspecified error
# sbatch submit.slurm
sbatch: error: You must specify an account!
sbatch: error: Batch job submission failed: Unspecified error
These errors are also printed out to the slurmctld log so that you know what the resource allocation issue was for the particular job:
[2017-09-12T08:32:00.697] error: job_submit.lua: User 0 did not specify an account.
[2017-09-12T08:32:00.697] _slurm_rpc_submit_batch_job: Unspecified error
As an addendum, the Slurm Submit Plugins Guide is only moderately useful and you will probably be much better off simply examining the Lua job_submit plugin implementation for guidance.

One option is to set the AccountingStorageEnforce parameter to associations in slurm.conf.
AccountingStorageEnforce
This controls what level of association-based enforcement to impose on job submissions. Valid options are any combination of
associations, limits, nojobs, nosteps, qos, safe, and wckeys, or all
for all things (expect nojobs and nosteps, they must be requested as
well).
By enforcing Associations no new job is allowed to run unless a corresponding association exists in the system. If limits are enforced
users can be limited by association to whatever job size or run time
limits are defined.
Then, with the sacctmgr command, make sure the default account has no access to the defined partitions. Effectively, the users will be denied submission if they do not specify a valid account.
Another option is to write a custom submission plugin, which you can write in Lua. In that script, you can check whether the --account parameter was set and deny submission with a custom message if it was not.

Related

Validation error when trying to log in to app using Amazon Cognito

After making updates to my aws app, I'm suddenly getting this error when trying to log in.
2 validation errors detected: Value '_______#gmail.com' at 'userName' failed to satisfy constraint: Member must satisfy regular expression pattern: [\p{L}\p{M}\p{S}\p{N}\p{P}]+; Value '______#gmail.com' at 'userAlias' failed to satisfy constraint: Member must satisfy regular expression pattern: [\p{L}\p{M}\p{S}\p{N}\p{P}]+
This doesn't seem to occur on all accounts, only certain accounts in our pool. I can't find any information on this. Has anyone encountered this or has ideas on what might be causing it?
Image of the error

How to Call package successfully

got error when calling package
error is
Error starting at line : 1 in command -
PKG_Generate_GRNo.GenerateGR(TO_NUMBER(:P164_APP_ID,
'9999999'),:APP_USER,:P164_FIRST_NAME,:P164_LAST_NAME,:P164_EMAIL,:P164_SKYPE_ID,:P164_COUNTRY,:P164_DATE_OF_BIRTH)
Error report - Unknown Command
PKG_Generate_GRNo.GenerateGR(TO_NUMBER(:P164_APP_ID,
'9999999'),:APP_USER,:P164_FIRST_NAME,:P164_LAST_NAME,:P164_EMAIL,
:P164_SKYPE_ID,:P164_COUNTRY,:P164_DATE_OF_BIRTH);

Session state protection violation is definitely an Apex error, relating to your page settings. It seems your package is trying to change the state of a read-only page. See this other question.
The item identifier in the error message P164_COURSECOUNT has the same prefix as the parameters you pass to the package (:P164_APP_ID) so presumably they relate to the same page. We know nothing about your application or its architecture, so it's hard to offer concrete advice. Maybe you need to change the page or item settings, maybe you need to change what the package does. Only you can tell the right course of action.

As you didn't post the whole command, a note: you have to enclose it into begin-end block, e.g.
BEGIN
PKG_Generate_GRNo.GenerateGR (TO_NUMBER ( :P164_APP_ID, '9999999'),
:APP_USER,
:P164_FIRST_NAME,
:P164_LAST_NAME,
:P164_EMAIL,
:P164_SKYPE_ID,
:P164_COUNTRY,
:P164_DATE_OF_BIRTH);
END;
/

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.

It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().

Torque PBS Manager permission rules cannot be changed

When I try to change the queue as such:
set queue standard total_jobs=16
I get the following error:
qmgr obj=standard svr=default: Cannot set attribute, read only or insufficient permission total_jobs
I am issuing the command as root.

total_jobs does not appear as valid qmgr parameter in the documentation.
http://docs.adaptivecomputing.com/torque/help.htm#topics/12-appendices/serverParameters.htm
I'm not sure what total_jobs is supposed to do. Maybe you are looking for max_user_queuable

CTest build ID not set

I have a CDash configured to accept posts for automatic builds and tests. However, when any system attempts to post results to the CDash, the following error is produced. The result is that each result gets posted four times (presumably the original posting attempt plus the three retries).
Can anyone give me a hint as to what sets this mysterious build ID? I found some code that seems to produce a similar error, but still no lead on what might be happening.
Build::GetNumberOfErrors(): BuildId not set
Build::GetNumberOfWarnings(): BuildId not set
Submit failed, waiting 5 seconds...
Retry submission: Attempt 1 of 3
Server Response:

The buildid for CDash is computed based on the site name, the build name and the build stamp of the submission. You should have a Build.xml file in a Testing/20110311-* directory in your build tree. Open that up and see if any of those fields (near the top) is empty. If so, you need to set BUILDNAME and SITE with -D args when configuring with CMake. Or, set CTEST_BUILD_NAME and CTEST_SITE in your ctest -S script.
If that's not it, then this is a mystery. I've not seen this error occur before...

I'm having the same issue though Site and Buildname are available in test.xml and are visible on cdash (4 times). I can see the jobs increment by refreshing between retries so it seems that the submission succeeds and reports a timeout.
Update: This seems to have started when I added the -j(nprocs) switch to the ctest command. changing CtestSubmitRetryDelay: 20 (was 5) allowed a server response through that indicates the cdash version may not be able to handle the multi-proc option I'll have to look into that for my issue. Perhaps setting CtestSubmitRetryDelay to a larger number will get you back a server response as it did for me. g'luck!
Out of range value for column 'processorclockfrequency'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Configuring SLURM so it requires the user to specify --account - config

I'm trying to figure out how to configure SLURM so that a user is required to specify --account when using the SLURM commands (salloc, sbatch, srun). Effectively I want to disable the default account behavior. Has anyone out there found a simple way to do this?

Related

Validation error when trying to log in to app using Amazon Cognito

How to Call package successfully

Determine actual errors from a load job

Torque PBS Manager permission rules cannot be changed

CTest build ID not set

Categories

Resources