CodeBuild projects are not being queued when conccurent build is 1 - aws-codebuild

We are using AWS CodeBuild along with GitHub webhooks to trigger a build process. When a PR is created for a branch that starts with a Jira ticket prefix, i.e oscs-278, we build a new environment with Terraform. When we make commits to the PR it triggers the build process to update that environment.
This flow works well for us, especially since as of February 2021, AWS CodeBuild allows you to set concurrent builds to 1. This is important for us as we should only ever have one deployment at one time, the rest should be queued.
However, our current build process takes up to 15 minutes, if we commit to the branch within this time frame, the project is not being queued if another build is in process.
Is this likely to be an issue with the GitHub webhooks, or something to do with AWS CodeBuild.
From the AWS docs:
The maximum number of builds in a queue is five times the concurrent build limit.
So in theory, I should have 5 in the queue (maximum)

CodeBuild won't queue new builds if the number of currently running builds is at your limit (which is 1). Attempts to start more builds in this condition will fail with an error. The AWS Docs say:
If the build project has a concurrent build limit set, builds return an error if the number of running builds reaches the concurrent build limit for the project. For more information, see Enable concurrent build limit.
This applies for webhooks and attempts to start them manually. The same docs also say:
If the build project does not have a concurrent build limit set, builds are queued if the number of running builds reaches the concurrent build limit for the platform and compute type. The maximum number of builds in a queue is five times the concurrent build limit. For more information, see Quotas for AWS CodeBuild.
That section sort of hints that you can get queuing behavior if you reset your project concurrency limit to a high number (say, 60) and then set the "platform and compute type" concurrency limit to 1, but this isn't possible because that limit isn't user-adjustable (and it would probably apply across all projects).
In short, I don't think you can make CodeBuild queue builds after a configured concurrency limit is reached. A (rather complex) alternative is to do your own locking inside your buildpsec.yml.

Related

Snakemake: how to best discern between the various types of Node failures in cluster-mode?

In Snakemake, as far as I know, we can only adapt job resources dynamically based on the number of attempts a job has made. When trying to re-adjust resources after a failure, it would be useful to discern from the following type of cluster failures:
Program error
Transient node failure
Out of memory
Timeout
The last 3 cases, in particular, are exposed to the SLURM user via different job completion status codes. The snakemake interface to the status script merges all types of failures into a single "failed" status.
Is there any way to do so? Or is this a planned feature? Keeping a list of previous failure reasons, instead of just the attempts count would be most useful.
e.g. goal:
rule foo:
resources:
mem_gb=lambda wildcards, attempts: 100 + (20*attempts.OOM)
time_s=lambda wildcards, attempts: 3600 + (3600*attemps.TIMEOUT)
...
The cluster I have access to has heterogeneous machines where each node is configured with various walltime and memory limits, and it would minimize scheduling times if I didn't have to conservatively bump all resources at once.
Possible workaround: I thought of keeping track of that extra info between the job status script, and the cluster submission script (e.g. keeping a history of status codes for each jobid). Is the attempt# available to the cluster submission and cluster status commands?
I think it would be best to handle such cluster specific functionality in a profile, e.g. in the slurm profile. When an error is detected, the status script could simply silently resubmit with updated resources based on what slurm reports. I don't think the Snakefile has to be cluttered with such platform details.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

Dataflow reading using PubSubIO is really slow

I'm having some trouble with a Dataflow pipeline that reads from PubSub and writes to BigQuery.
I had to drain it to perform some more complex updates. When I rerun the pipeline it started reading fom PubSub at a normal rate, but then after some minutes it stopped and now it is not reading messages from PubSub anymore! Data watermark is almost one week delayed and not progressing. There are more than 300k messages in the subscription to be read, according to Stackdriver.
It was running normally before the update, and now even if I downgrade my pipeline to the previous version (the one running before update), I still doesn't get it to work.
I tried several configurations:
1) We use Dataflow autoscaling, and I tried starting the pipeline with more powerful workers (n1-standard-64), and limiting it to ten workers, but it won't improve performance neither autoscale (it keeps only the initial worker).
2) I tried providing more disk through diskSizeGb (2048) and diskType (pd-ssd), but still no improvement.
3) Checked PubSub quotas and pull/push rates, but it's absolutely normal.
Pipeline shows no errors or warnings, and just won't progress.
I checked instances resources and CPU, RAM, disk read/write rates are all okay, compared to other pipelines. The only thing a little higher is network rates: about 400k bytes/sec (2000 packets/sec) outgoing and 300k bytes/sec incoming (1800 packets/sec).
What would you suggest I do?
The Dataflow SDK 2.x for Java and the Dataflow SDK for Python are based on Apache Beam. Make sure you are following the documentation as a reference when you update. Quotas can be an issue for slow running pipeline and lack of output but you mentioned those are fine.
It seems there is a need to look at the job. I recommend to open an issue on the PIT here and we’ll take a look. Make sure to provide your project id, job id and all the necessary details.

What is the practice for scheduling multiple inter-dependent SQL Server Agent jobs?

The way my team currently schedules jobs is through the SQL Server Job Agent. Many of these jobs have dependencies on other internal servers which in turn have their own SQL Server Jobs that need to be run to keep their data up to date.
This has created dependencies in the start time and length of each of our SQL Server Jobs. Job A might depend on Job B finishing, so we schedule Job B a certain estimated time in advance to Job A. All of this process is very subjective and not scalable, as we add more jobs and servers which create more dependencies.
I would love to get out of the business of subjectively scheduling these jobs and hoping that the dominos fall in the right order. I am wondering what the accepted practices for scheduling SQL Server jobs are. Do people use SSIS to chain jobs together? Is there tooling already built into the SQL Server Job Agent to handle this?
What is the accepted way to handle the scheduling of multiple SQL Server jobs with dependencies on each other?
I have used Control-M before to schedule multiple inter-dependent jobs in different environment. Control-M generally works by using batch files (from what I remember) to execute SSIS packages.
We had a complicated environment hosting 2 data warehouses side by side (1 International and 1 US Local). There were jobs that were dependent on other jobs and those jobs on others and so on, but by using Control-M we could easily decide on the dependency (It has a really nice and intuitive GUI). Other tool that comes to my mind is Tidal Scheduler.
There is no set standard for job scheduling, but I think its safe to say that job schedules depend entirely on what an organization needs. For example Finance jobs might be dependent on Sales and Sales on Inventory and so on. But the point is, if you need to have job inter dependency, using a third party software such as Control-M is a safe bet. It can control jobs on different environments and give you real sense of the company wide job control.
We too had the requirement to manage dependencies between multiple agent jobs - after looking at various 3rd party tools and discounting them for various reasons (mainly down to the internal constraints relating to the use of 3rd party software) we decided to create our own solution.
The solution centres around a configuration database that holds details about processes (jobs) that need to run and how they are grouped (batches), along with the dependencies between processes.
Summary of configuration tables used:
Batch - highlevel definition of a group of related processes, includes metadata such as max concurrent processes, and current batch instance etc.
Process - meta data relating to a process (job) such as name, max wait time, earliest run time, status (enabled / disabled), batch (what batch the process belongs to), process job name etc.
Batch Instance - the active instance of a given batch
Process Instance - active instances of processes for a given batch
Process Dependency - dependency matrix
Batch Instance Status - lookup for batch instance status
Process Instance Status - loolup for process instance status
Each batch has 2 control jobs - START BATCH and UPDATE BATCH. The 1st deals with starting all processes that belong to it and the 2nd is the last to run in any given batch and deals with updating the outcome statuses.
Each process has an agent job associated with it that gets executed by the START BATCH job - processes have a capped concurrency (defined in the batch configuration) so processes are started up to a max of x at a time and then START BATCH waits until a free slot becomes available before starting the next process.
The process agent job steps call a templated SSIS package that deals with the actual ETL work and with the decision making around whether the process needs to run and has to wait for dependencies etc.
We are currently looking to move to a Service Broker solution for greater flexibility and control.
Anyway, probably too much detail and not enough example here so VS2010 project available on request.
I'm not sure how much this will help, but we ended up creating an email solution for scheduling.
We built an email reader that accesses an exchange mailbox. As jobs finish, they send an email to the mail reader to start another job. The other nice part, is that most applications have email notifications built in, so there really isn't much in the way of custom programming.
We really only built it in the first place to handle data files coming in from lots of other partners. It was much easier to give them an email address rather than setting them up with an ftp site, etc.
The mail reader app now has grown to include basic filtering, time of day scheduling, use of semaphores to prevent concurrent jobs, etc. It really works great.

Can TeamCity tests be run asynchronously

In our environment we have quite a few long-running functional tests which currently tie up build agents and force other builds to queue. Since these agents are only waiting on test results they could theoretically just be handing off the tests to other machines (test agents) and then run queued builds until the test results are available.
For CI builds (including unit tests) this should remain inline as we want instant feedback on failures, but it would be great to get a better balance between the time taken to run functional tests, the lead time of their results, and the throughput of our collective builds.
As far as I can tell, TeamCity does not natively support this scenario so I'm thinking there are a few options:
Spin up more agents and assign them to a 'Test' pool. Trigger functional build configs to run on these agents (triggered by successful Ci builds). While this seems the cleanest it doesn't scale very well as we then have a lead time of purchasing licenses and will often have need to run tests in alternate environments which would temporarily double (or more) the required number of test agents.
Add builds or build steps to launch tests on external machines, then immediately mark the build as successful so queued builds can be processed then, when the tests are complete, we mark the build as succeeded/failed. This is reliant on being able to update the results of a previous build (REST API perhaps?). It also feels ugly to mark something as successful then update it as failed later but we could always be selective in what we monitor so we only see the final result.
Just keep spinning up agents until we no longer have builds queueing. The problem with this is that it's a moving target. If we knew where the plateau was (or whether it existed) this would be the way to go, but our usage pattern means this isn't viable.
Has anyone had success with a similar scenario, or knows pros/cons of any of the above I haven't thought of?
Your description of the available options seems to be pretty accurate.
If you want live update of the builds progress you will need to have one TeamCity agent "busy" for each running build.
The only downside here seems to be the agent licenses cost.
If the testing builds just launch processes on other machines, the TeamCity agent processes themselves can be run on a low-end machine and even many agents on a single computer.
An extension to your second scenario can be two build configurations instead of single one: one would start external process and another one can be triggered on external process completeness and then publish all the external process results as it's own. It can also have a snapshot dependency on the starting build to maintain the relation.
For anyone curious we ended up buying more agents and assigning them to a test pool. Investigations proved that it isn't possible to update build results (I can definitely understand why this ugliness wouldn't be supported out of the box).