Slurm - job runs, gets data but gives TIMEOUT error - error-handling

So I'm running some code which takes about 2 hours to run on the cluster. I configured the batch file with
# Set maximum wallclock time limit for this job
#Time Format = days-hours:minutes:seconds
#SBATCH --time=0-02:15:00
Just to give some overhead if the job slows for whatever reason. I checked the directory that the generated files are stored in and the simulation completes successfully every time. Despite this, slurm keeps the job running until it hits the max time. The .out file keeps saying
slurmstepd: *** JOB CANCELLED AT 2022-03-05T10:38:26 DUE TO TIME LIMIT ***
Any ideas why it doesn't show as complete instead?

In my opinion, this error is not related to Slurm rather about your application. Your application is somehow not sending the exit signal to the slurm.
You can use sstat -j jobid to see the status of the job, may be after 2 hours to see how the cpu consumption etc going and figure out what happens in your application (where it hangs after completion or so).

Related

Why is the query for the scheduled query is running but not as Scheduled query in Bigquery?

I have a query which will run if I simply run it through console or from code.
When I created Scheduled Query for the Query, it would not run. The Scheduled Query is successfully created, and the interval I set (every 2 hours) is correctly implemented but only the jobs are not created (I can see in Scheduled query that the time to run is being incremented by 2 hours every time it is supposed to run).
These are the properties when running query from Scheduled query:
Overwrite table, Processing location: US, Allow large results, Batch priority
If I do a Schedule Backfill, it creates 12 jobs which fails with an error messages similar to the following:
Exceeded CPU limit 125%
Exceeded memory
If I cancel all the created jobs and leave one to run, it would run successfully. The Scheduled Query itself would not create any jobs.
I started the Scheduled query at 12:00 and made it to run for every 2 hours in repeats.
I assumed the jobs would run at the start time but apparently it is not the case. Scheduled Query ran perfectly as intended from 14:00 following with 16:00 and so on.
The errors regarding maximum CPU/memory usage is because the query I wrote had ORDER BY statement which was causing this issue. Removing that cleared the issue.

Run a SQL Server job until it succeeds

I have a SQL Server job that has run for almost 2 years.
It's connecting to a bad Oracle database that keeps disconnecting, it always fails due to that. And when I run it again after 10 or 15 minutes, it works successfully. I'm getting bored of checking it every day...
Is there a way that make the job run to connect to that Oracle source until it succeeds, or another job that looks over this job status and if it failed, then it runs it again until it succeeds?
A solution we are using is something like this:
Wrap your Oracle query in an SSIS package, and after the query, have the package update a SQL table that keeps either a history of executions, or just a single row that tracks the last time the job ran successfully. In short, if the Oracle query was successful, then put something in a table saying the query ran successfully today. If it was not successful, then don't put anything in the table for today.
Then at the beginning of the package, BEFORE the Oracle query, check to see if the query has been run successfully today. If it has already run successfully, then do nothing and exit the package. If it has not run successfully today, then go ahead and try to run it, following the post-query steps described above. If you have any other conditions about when the package should run (like "only after 10 am" or anything like that) you would include that logic here.
Finally, schedule the job to call the package, and schedule to run every 15 minutes, or however often you like. It will try every 15 minutes until it runs successfully, and after that it will stop doing anything until the next day.
As a bonus, you can use this same package and job to initiate all tasks that you want handled the same way. You just need to keep meta data about all these tasks in your history/metadata table.
an alternative is to create the job step and leave it unscheduled, and create an ssis job that acts as the master to all your jobs and it runs every minute checking all job steps from a config table that have yet to succeed today and any it finds execute using sp_start_job.
if they do run successfully log the stats to a log table and this prevents them ever being launched again until the next day. This prevents all yours jobs needing to be scheduled every 15 minutes etc, they launch asap, and you can add extra logic to handle dependencies, number parallel running, importance level etc, start time, latest start time, max number to retty etc

How to execute X times a Job Executor step

Introduction
To keep it simple, let's imagine a simple transformation.
This transformation gets an input of 4 rows, from a Data Grid step.
The stream passes through a Job Executor, referencing to a simple job, with a Write Log component.
Expectations
I would like the simple job executes 4 times, that means 4 log messages.
Results
It turns out that the Job Executor step launches the simple job only once, instead of 4 times : I only have one log message.
Hints
The documentation of the Job Executor component specifies the following :
By default the specified job will be executed once for each input row.
This is parametrized in the "Row grouping" tab, with the following field :
The number of rows to send to the job: after every X rows the job will be executed and these X rows will be passed to the job.
Answer
The step actually works well : an input of X rows will execute a "Job Executor" step X times. The fact is I wasn't able to see it with the logs.
To verify it, I have added a simple transformation inside the "Job Executor" step, which writes into a text file. After I have checked this file, it appeared that the "Job Executor" was perfectly executed X times.
Research
Trying to understand why I didn't have X log messages after the X times execution of "Job Executor", I have added a "Wait for" component inside the initial simple job. Finally, adding two seconds allowed me to see X log messages appearing during the execution.
Hope this helps because it's pretty tricky. Please feel free to provide further details.
A little late to the party, as a side note:
Pentaho is a set of programs (Spoon, Kettle, Chef, Pan, Kitchen), The engine is Kettle, and everything inside transformations is started in parallel. This makes log retrieving a challenging task for Spoon (the UI), you don't actually need a Wait for entry, try outputting the logs into a file (specifying a log file on the Job executor entry properties) and you'll see everything in place.
Sometimes we need to give Spoon a little bit of time to get everything in place, personally that's why I recommend not relying on Spoon Execution Results logging tab, it is better to output the logs to a DB or files.

BigQuery Error: Destination deleted/expired during execution

I've a batch script to load data from google cloud bucket to a table in big query. A scheduled SSIS job executes this batch file daily.
bq load -F "\t" --encoding=UTF-8 --replace=true db_name.tbl_name gs://GSCloudBucket/file.txt "column1:string, column2:string, column3:string"
Weirdly, the execution is successful some days and not some other time. Here is what I have on the log.
Waiting on bqjob_r790a43a4_00000155a65559c2_1 ... (0s) Current status: RUNNING ......
Waiting on bqjob_r790a43a4_00000155a65559c2_1 ... (7s) Current status: DONE
BigQuery error in load operation: Error processing job: Destination
deleted/expired during execution
one option is if you have 1 day (or multiple of days) expiration on that table (either on table directly or via default expiration on dataset). In this case - because actual time of load very you can get to situation when destination table has expired by that time.
You can use configuration.load.createDisposition attribute to address this.
Or/and you can make sure you have proper expiration set - for daily process it would be let's say - 26 hours - so you have extra 2 hours for your SSIS job to complete before table can expire

bq command line tool momentarily hangs eating memory with --format=none

I'm running a BigQuery query using bq that selects a subset of rows from one table into a destination table.
Our command looks like:
bq --format=none query --destination_table=dpm_legacy.unique_test [query]
On the command line I get:
Waiting on job_cda83335e0a4416ea9d4a2a0262d1ec7 ... (0s) Current status: RUNNING
Waiting on job_cda83335e0a4416ea9d4a2a0262d1ec7 ... (10s) Current status: DONE
But then the process hangs for awhile and it's CPU and memory usage begin to creep up until it finally exists with no output.
Empirically, it seems like the amount of time the tool hangs is directly proportional to how large the destination table is so is it possible that even with the --format=none flag it is still returning data?
Thanks!
bq does try to read the whole table on reply, even if the format is set to none. One way to prevent this is to use --nosync which will exit immediately and not wait for the query to complete. I'm in the process of adding a --max_rows flag that will allow you to specify how many rows you want in the result (so if you want none you can just specify 0).