step-functions-local: Can't start state machine within state machine - serverless-framework

I've got step-functions-local and serverless-offline configured to test a state machine (let's call it #1) that triggers another state machine (#2) defined within the project.
Both show as created when I fire up the local server with sls offline start --stage dev:
[Serverless Step Functions Local] 2022-07-29 11:03:59.867: [200] CreateStateMachine <=
{"sdkResponseMetadata":null,"sdkHttpMetadata":null,"stateMachineArn":"arn:aws:states:us-east-1:123:stateMachine:Foo",
"creationDate":1659117839863}
[Serverless Step Functions Local] 2022-07-29 11:03:59.883: [200] CreateStateMachine <=
{"sdkResponseMetadata":null,"sdkHttpMetadata":null,"stateMachineArn":
"arn:aws:states:us-east-1:123:stateMachine:Bar","creationDate":1659117839882}
I then test #1 with the following command:
aws stepfunctions --endpoint http://localhost:8083 start-execution --state-machine \
arn:aws:states:us-east-1:123:stateMachine:Foo --name local-test-$RANDOM --input <JSON string payload>
#1 executes several steps successfully, including read/write S3 operations, until it reaches the step to trigger #2; at that point, it fails with an exception that reads in part:
"Error":"StepFunctions- StateMachineDoesNotExistException",
"Cause":"State Machine Does Not Exist: 'arn:aws:states:us-east-1:123:stateMachine:Bar'
(Service: AWSStepFunctions; Status Code: 400; Error Code: StateMachineDoesNotExist
Here's how the step to start state machine #1 is defined in the #1 .yml file:
BarStateMachine:
Type: Task
Resource: "arn:aws:states:::states:startExecution.sync:2"
Parameters:
StateMachineArn:
arn:aws:states:us-east-1:123:stateMachine:Bar
I can get #1 to work if, instead of pointing to the arn for the locally-created #2, I point it to the arn of the deployed version. However, this deployed version is of course a remote resource, which sort of defeats the purpose of local testing. Any ideas on how to get the local version of #2 executed properly?

Try something like export STEP_FUNCTIONS_ENDPOINT=http://localhost:8083 && serverless offline start - that should cause step function local to use itself for the step function service integration.

Related

Configure allowed_pull_policies on shared GitLab runner

I'm using GitLab.com's managed CI runners, and I'd like to run my CI jobs using the if-not-present pull policy to avoid the extra minutes it takes to pull the image for each job. Trying to set that value in the .gitlab-ci.yml file gives me this error:
pull_policy ([if-not-present]) defined in GitLab pipeline config is not one of the allowed_pull_policies ([always])
This led me to the config.toml settings for restricting Docker pull policies, so I created a config.toml file at the root of my repository and tried that. However, I still get the same error.
Is config.toml only available for manual/self-hosted runners? Is there any other way to get past this?
Context
Image selection in .gitlab-ci.yml:
default:
image:
name: registry.gitlab.com/myorg/myrepo/ci/builder:latest
pull_policy: if-not-present
Contents of config.toml:
[[runners]]
executor = "docker"
[runners.docker]
pull_policy = ["if-not-present"]
allowed_pull_policies = ["always", "if-not-present"]
First of all, the config.toml file is not meant to be in your repo but on the runner machine (or container).
But anyways, the always pull policy should not cause image pulls to last minutes if the layers are already cached locally: it just ensures you have the latest version by checking the metadata. If the pulls take minutes, it means that either the layers are not available locally, or the image was actually updated (or that the connection to your container registry is so incredibly slow that just checking the metadata takes minutes, but that is unlikely).
It is very possible that Gitlab's managed runners do not have a way to locally cache layers, and thus there would be no practical difference between the always and if-not-present policies. For instance if you use Gitlab Saas:
A dedicated temporary runner VM hosts and runs each CI job.
(see https://docs.gitlab.com/ee/ci/runners/index.html)
Thus the downloaded layers are discarded as soon as the job finishes.

Batch job submission failed: I/O error writing script/environment to file

I installed slurm on a workstation and it seemed to work, i can use the slurm commands, srun is working too.
But when i try to launch a job from a script using sbatch test.sh i get the following error : Batch job submission failed: I/O error writing script/environment to file even if the script is the simplest like
#!/bin/bash
srun hostname
Make sure slurmd is running as root. See the SlurmdUser parameter in slurm.conf. Its default value is root and it should be so.
Note this is different from the SlurmUser parameter, that defines the user which runs the controller processes ; this one is preferably not root.
If the configuration is correct, then you might have a faulty filesystem at the location referred to in the SlurmdSpoolDir parameter, where slurmd writes the submission script and environment for jobs assigned to the node.

Flink job started from another program on YARN fails with "JobClientActor seems to have died"

I'm new flink user and I have the following problem.
I use flink on YARN cluster to transfer related data extracted from RDBMS to HBase.
I write flink batch application on java with multiple ExecutionEnvironments (one per RDB table to transfer table rows in parrallel) to transfer table by table sequentially (because call of env.execute() is blocking).
I start YARN session like this
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/yarn-session.sh -n 1 -s 4 -d -jm 2048 -tm 8096
Then I run my application on YARN session started via shell script transfer.sh. Its content is here
#!/bin/bash
export YARN_CONF_DIR=/etc/hadoop/conf
export FLINK_HOME=/opt/flink-1.3.1
export FLINK_CONF_DIR=$FLINK_HOME/conf
$FLINK_HOME/bin/flink run -p 4 transfer.jar
When I start this script from command line manually it works fine - jobs are submitted to YARN session one by one without errors.
Now I should be able to run this script from another java program.
For this aim I use
Runtime.exec("transfer.sh");
(maybe are there better ways to do this? I have seen at REST API but there are some difficulties because job manager is proxied by YARN).
At the beginning is works as usually - first several jobs are submitted to session and finished successfully. But the following jobs are not submitted to YARN session.
In /opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log I see error (and no another errors found in DEBUG level)
The program execution failed: JobClientActor seems to have died before the JobExecutionResult could be retrieved.
I have tried to analyse this problem by myself and found out that this error has occurred in JobClient class while sending ping request with timeout to JobClientActor (i.e. YARN cluster).
I tried to increase multiple heartbeat and timeout options like akka.*.timeout, akka.watch.heartbeat.* and yarn.heartbeat-delay options but it doesn't solve the problem - new jobs are not submit to YARN session from CliFrontend.
The environment for both case (manual call and call from another program) is the same. When I call
$ ps axu | grep transfer
it will give me output
/usr/lib/jvm/java-8-oracle/bin/java -Dlog.file=/opt/flink-1.3.1/log/flink-tsvetkoff-client-hadoop-dev1.log -Dlog4j.configuration=file:/opt/flink-1.3.1/conf/log4j-cli.properties -Dlogback.configurationFile=file:/opt/flink-1.3.1/conf/logback.xml -classpath /opt/flink-1.3.1/lib/flink-metrics-graphite-1.3.1.jar:/opt/flink-1.3.1/lib/flink-python_2.11-1.3.1.jar:/opt/flink-1.3.1/lib/flink-shaded-hadoop2-uber-1.3.1.jar:/opt/flink-1.3.1/lib/log4j-1.2.17.jar:/opt/flink-1.3.1/lib/slf4j-log4j12-1.7.7.jar:/opt/flink-1.3.1/lib/flink-dist_2.11-1.3.1.jar:::/etc/hadoop/conf org.apache.flink.client.CliFrontend run -p 4 transfer.jar
I also tried to update flink to 1.4.0 release or change parallelism of job (even to -p 1) but error has still occurred.
I have no idea what could be different? Is any workaround by the way?
Thank you for any help.
Finally I find out how to resolve that error
Just replace Runtime.exec(...) with new ProcessBuilder(...).inheritIO().start().
I really don't know why the call of inheritIO helps in that case because as I understand it just redirects IO streams from child process to parent process.
But I have checked that if I comment out this line of code the program begins to fall again.

Jenkins SSH remote process is getting killed as soon as the Jenkins SSH plugin returns back

Jenkins version: 1.574
I created a simple job which performs the following:
Using "Execute shell script on remote host using SSH" as one of the BUILD steps, I'm just calling a shell script. This shell script performs stop and start operations on Tomcat to restart an application on the target machine.
I have a valid username, password, port defined for the target SSH server in Jenkins Global settings.
I saw this behavior that when I run a Jenkins job and call the restart script (which gets the application name as parameter $1), it works fine, but as soon as "Execute shell script on remote host using SSH" step completes, I see the new process dies on the remote/target application server.
If I run the script from the target/remote server itself, everything works fine and the new process/PID remains live forever, but running the same script from Jenkins, though I don't see any errors and everything works as expected, the new process dies as soon as the above mentioned SSH step is complete and control comes back to the next BUILD step in Jenkins job OR the Jenkins job is complete.
I saw a few posts/blogs and tried setting: BUILD_ID=dontKillMe in the Jenkins job (in various places i.e. Prepare Environment variables and also using Inject Environment variables...). When the job's particular build# is complete, I can see Environment Variables for that build# does say BUILD_ID=dontKillMe as its value (instead of the default Timestamp tag value).
I tried putting nohup before calling the restart script, i.e.,
nohup restart_tomcat.sh "${app}"
I also tried:
BUILD_ID=dontKillMe nohup restart_tomcat.sh "${app}"
This doesn't give any error and creates a nohup.out file on the remote server (but I'm not worried about it as the restart_tomcat.sh script itself creates its own LOG file which I'm "cat"ing after the restart_tomcat.sh script is complete. cat'ing on the log file is performed using another "Execute shell script on remote host using SSH" build step, and it successfully shows the log file created by the restart script).
I don't know what I'm missing at this point, but as soon as the restart_tomcat.sh step is complete, the new PID/process on the remote/target server dies.
How can I fix this?
I've been through this myself.
On my first iteration, before I knew about Jenkins ProcessTreeKiller, I ended up just daemonizing Tomcat. The Apache Tomcat documentation includes a section on running as a daemon.
You can also try disabling the ProcessTreeKiller for your whole Jenkins instance, if it's relatively small (read the first link for information).
The BUILD_ID=dontKillMe should be passed to the shell, and therefore it should be in your command line, not in Jenkins global configuration or job parameters.
BUILD_ID=dontKillMe restart_tomcat.sh "${app}" should have worked without problems.
You can also try nohup restart_tomcat.sh "${app}" & with the & at the end.
My solution (it worked after trying everything else) in Ubuntu 14.04 (Trusty Tahr) (Amazon AWS - Amazon EC2), Jenkins 1.601:
Exec command: (setsid COMMAND < /dev/null > /dev/null 2>&1 &);
Exec in PTY: DISABLED
// Example COMMAND=socat TCP4-LISTEN:1337,fork TCP4:127.0.0.1:1338
I created this Transfer as my last one.
#!/bin/ksh
export BUILD_ID=dontKillMe
I added the above line to the start of my script and the issue was resolved.

Unable to start Active MQ on Linux

I am trying to get ActiveMQ server running on a RaspberryPI Debian Squeeze box and all appears to be installed correctly but when I try and start the service I am getting the following...
root#raspberrypi:/var/www/activemq/apache-activemq-5.7.0# bin/activemq
INFO: Loading '/etc/default/activemq'
INFO: Using java '/usr/jre1.7.0_07/bin/java'
/usr/jre1.7.0_07/bin/java: 1: /usr/jre1.7.0_07/bin/java:ELF0
4°: not found
/usr/jre1.7.0_07/bin/java: 2: /usr/jre1.7.0_07/bin/java: Syntax error: "(" unexpected
Tasks provided by the sysv init script:
restart - stop running instance (if there is one), start new instance
console - start broker in foreground, useful for debugging purposes
status - check if activemq process is running
setup - create the specified configuration file for this init script
(see next usage section)
Configuration of this script:
The configuration of this script can be placed on /etc/default/activemq or /root/.activemqrc.
To use additional configurations for running multiple instances on the same operating system
rename or symlink script to a name matching to activemq-instance-<INSTANCENAME>.
This changes the configuration location to /etc/default/activemq-instance-<INSTANCENAME> and
$HOME/.activemqrc-instance-<INSTANCENAME>. Configuration files in /etc have higher precedence.
root#raspberrypi:/var/www/activemq/apache-activemq-5.7.0#
It looks like there is an error somewhere but I am a fairly newbie at this and don't know where to look.
Just adding an answer since,becoz as per the documentation , the command is wrong
to start activemqm, use
Navigate to [installation directory/bin] and run ./activemq start or simple bin/activemq start
if you want to see it live in a window use bin/activemq console
To stop, you have to kill the process
The default ActiveMQ "getting started" link sends u here : http://activemq.apache.org/getting-started.html which is the "Getting Started Guide for ActiveMQ 4.x".
If you scroll down main documentation page, you will find this link with the proper commands :
http://activemq.apache.org/version-5-getting-started.html