Groking the parallel-matrix - gitlab-ci

This seems to be a mis-use of the new Parallel Matrix feature in GitLab 13.3 (https://docs.gitlab.com/ee/ci/yaml/#parallel-matrix-jobs)
I have a collection of parallel jobs for a set of services: build (docker image), test, release, delete.... and the code-base is created so that each parallel service is in a separate sub-directory.
This way I can have a common template:
variables:
IMAGE_NAME: $CI_REGISTRY_IMAGE/$LOCATION
.build-template:
script:
- docker build --tag $IMAGE_NAME:$CI_PIPELINE_ID-$CI_COMMIT_REF_SLUG --tag $IMAGE_NAME:latest $LOCATION
stage: build
when: manual
then multiple build jobs:
build-alpha:
extends: .build-template
variables:
LOCATION: alpha
build-beta:
extends: .build-template
variables:
LOCATION: beta
.... and repeat as needed.
I can then do the same thing for test, release, and delete jobs: a common template that takes just the one variable to distinguish the service.
Matrix to the rescue?
It would seem to me that
variables:
IMAGE_NAME: $CI_REGISTRY_IMAGE/$NOTEBOOK
build-services:
parallel:
matrix:
- LOCATION: alpha
- LOCATION: beta
script:
- docker build --tag $IMAGE_NAME:$CI_PIPELINE_ID-$CI_COMMIT_REF_SLUG --tag $IMAGE_NAME:latest $LOCATION
stage: build
when: manual
would be an ideal candidate for this matrix form.... but apparently matrix requires two variables.
Has anyone got a good solution for this multiple-parallel jobs problem?

would be an ideal candidate for this matrix form.... but apparently matrix requires two variables.
Not anymore.
See GitLab 13.5 (October 2020)
Allow one dimensional parallel matrices
Previously, the parallel: matrix keyword, which runs a matrix of jobs in parallel, only accepted two-dimensional matrix arrays. This was limiting if you wanted to specify your own array of values for certain jobs.
In this release, you now have more flexibility to run your jobs the way that works best for your development workflow.
You can run a parallel matrix of jobs in a one-dimensional array, making your pipeline configuration much simpler. Thanks Turo Soisenniemi for your amazing contribution!
Here’s a basic example of this in practice that will run 3 test jobs for different versions of Node.js, but you can apply this approach to your specific use cases and easily add or remove jobs in your pipeline as well:
See Documentation and Issue.
And with GitLab 13.10 (March 2021)
Use 'parallel: matrix' with trigger jobs
You can use the parallel: matrix keywords to run jobs multiple times in parallel, using different variable values for each instance of the job.
Unfortunately, you could not use it with trigger jobs.
In this release, we’ve expanded the parallel matrix feature to support trigger jobs as well, so you can now run multiple downstream pipelines (child or multi-project pipelines) in parallel using different variables value for each downstream pipeline.
This lets you configure CI/CD pipelines that are faster and more flexible.
See Documentation and Issue.

This is a known issue by Gitlab.
There is a workaround using a "dummy" value for the second value (I'm working with Gitlab 13.3.1).
For your example, it gives this :
parallel:
matrix:
- LOCATION: ['alpha', 'beta']
DUMMY: 'dummy'

Did you try:
parallel:
matrix:
- LOCATION: [alpha, beta]

Related

Running dbt tests as a pre-hook, prior to running a model

I want to prevent the scenario where my model runs, even though any number of source tables are (erroneously) empty. The phrase coming to mind is a "pre-hook," although I'm not sure that's the right terminology.
Ideally I'd run dbt run --select MY_MODEL and as a part of that, these tests for non-emptiness in the source tables would run. However, I'm not sure dbt works like that. Currently I'm thinking I'll have to apply these tests to the sources and run those tests (according to this document), prior to executing dbt run.
Is there a more direct way of having dbt run fail if any of these sources are empty?
Personally the way I'd go about this would be to define your my_source.yml
to have not_null tests on every column using something like this docs example
version: 2
sources:
- name: jaffle_shop
database: raw
schema: public
loader: emr # informational only (free text)
loaded_at_field: _loaded_at # configure for all sources
tables:
- name: orders
identifier: Orders_
loaded_at_field: updated_at # override source defaults
columns:
- name: id
tests:
- not_null
- name: price_in_usd
tests:
- not_null
And then in your run / build, use the following order of operations:
dbt test --select source:*
dbt build
In this circumstance, I'd highly recommend making your own variation on the generate_source macro from dbt-codegen which automatically defines your sources with columns & not_null tests included.

Pylint: same pylint and pandas version on 2 machines, 1 fails

I have 2 places running the same linting job:
Machine 1: Ubuntu over SSH
pandas==1.2.3
pylint==2.7.4
python 3.8.10
Machine 2: Gitlab CI Docker image, python:3.8.12-buster
pandas==1.2.3
pylint==2.7.4
Python 3.8.12
The Ubuntu machine is able to lint all the code fine, and it has for many months. Same for the CI job, except it had been running Python 3.7.8. Now that I upgraded the Docker image to Python 3.8.12, it throws several no-member linting errors on some Pandas objects. I've tried clearing CI caches etc.
I wish I could provide something more reproducible. But, to check my understanding of what a linter is doing, is it theoretically possible that a small version difference in python messes up pylint like this? For something like a no-member error on Pandas objects, I would think the dominant factor is the pandas version, but those are equal, so I'm confused!
Update:
I've looked at the Pandas code for pd.read_sql_query, which is what's causing the no-member error. It says:
def read_sql_query(
sql,
con,
index_col=None,
coerce_float=True,
params=None,
parse_dates=None,
chunksize: Optional[int] = None,
) -> Union[DataFrame, Iterator[DataFrame]]:
In Docker, I get E1101: Generator 'generator' has no 'query' member (no-member) (because I'm running .query on the returned dataframe). So it seems Pylint thinks that this function returns a generator. But it does not make this assumption in my other setup. (I've also verified the SHA sum of pandas/io/sql.py matches). This seems similar to this issue, but I am still baffled by the discrepancy in environments.
A fix that worked was to bump a limit like:
init-hook = "import astroid; astroid.context.InferenceContext.max_inferred = 500"
in my .pylintrc file, as explained here.
I'm unsure why/if this is connected to my change in Python version, but I'm happy to use this and move on for now. It's probably complex.
(Another hack was to write a function that returns the passed arg if the passed arg is a dataframe, and returns 1 dataframe if the passed arg is an iterable of dataframes. So the ambiguous-type object could be passed through this wrapper to clarify things for Pylint. While this was more intrusive on our codebase, we had dozens of calls to pd.read_csv and pd.real_sql_query, and only about 3 calls caused confusion for Pylint, so we almost used this solution)

how to use an "or" statement to make a dependent job in Gitlab CI?

I have a pipeline I'm putting together that has three different "deploy" steps, each with its own unique deployment but can be triggered by the same job. Ideally, I would like to find a way to "or" the items inside of the needs section to make the job automatically run after one of those previous jobs completes.
I know I can create a separate job for each job to "run" but I'd like to avoid repeating myself if possible.
Int (Dry Run):
extends: .stageBatchDryRunJob
stage: Deploy Non-Prod
except:
variables:
- $CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+$/
variables:
<<: *hubNonVars
kube_cluster_id: HubInt
kube_env: int
environment: int
Int (Rollback):
extends: .stageBatchRollbackJob
stage: Deploy Non-Prod
except:
variables:
- $CI_COMMIT_TAG =~ /^v[0-9]+\.[0-9]+\.[0-9]+$/
variables:
<<: *hubNonVars
kube_cluster_id: HubInt
kube_env: int
environment: int
I would like to have a single "run" job that requires one of these two above jobs to be completed.
Int (run):
extends: .run-batch
variables:
TOWER_JOB_TEMPLATE: $TOWER_JOB_TEMPLATE_INT
kube_cluster_id: HubInt
kube_env: int
needs: [Int (Dry Run), Int (Rollback)] # can this be "or-ed?" IE needs Int (Dry Run) or Int (Rollback)
If both jobs actually exist in the pipeline, no. There is no way to do what you describe with a single job and have the downstream job start immediately.
However, if only one of the two jobs will exist in the pipeline (for example one is excluded by a rules configuration), you can leverage the needs optional feature.
needs:
- job: "Int (Dry Run)"
optional: true
- job: "Int (Rollback)"
optional: true
Of course this introduces the possibility that the job could run when neither of these jobs exist. So, you may want to run a check in your job to see if expected artifacts or variables exist.
You can make the two-job approach a bit more DRY, too, you can use !reference tags or other YAML features like anchors/extends (which you're already using in your provided code).

How to create a negative test case in GitHub

I am working on a repository in GitHub and learning to use their Workflows and Actions to execute CI tests. I have created a simple workflow that runs against a shell script to test a simple mathematical expression y-x=expected_val. This workflow isn't that different from other automatic tests I have set up on code in the past, but I cannot figure out how to perform negative test cases.
on:
push:
branches:
- 'Math-Test-Pass*'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout#v1
- name: T1. Successful math test
uses: ./.github/actions/mathTest
with:
OPERAND1: 3
OPERAND2: 5
ANSWER: 2
- name: T2. Mismatch answer math test
if: ${{ always() }}
uses: ./.github/actions/mathTest
with:
OPERAND1: -3
OPERAND2: 2
ANSWER: 1
- name: T3. Missing operand math test
if: ${{ always() }}
uses: ./.github/actions/mathTest
with:
OPERAND1: -3
ANSWER: 5
- name: T4. Another test should pass
if: ${{ always() }}
uses: ./.github/actions/mathTest
with:
OPERAND1: 6
OPERAND2: 9
ANSWER: 3
- name: T5. Another test should pass
uses: ./.github/actions/mathTest
with:
OPERAND1: 1
OPERAND2: 9
ANSWER: 8
Now, I expected tests T.2 and T.3 to fail, but I run into two problems. First, I want all the steps to execute and the errors thrown by T.2 and T.3 make the job status a failure. Github's default response is to not run any additional steps unless I force it with something like if: ${{ always() }} This means that T.3 and T.4 only run because of that logic and T.5 doesn't run at all. See below.
The second problem is that while the mathTest action failed on T.2 and T.3 that was the intended behavior. It did exactly what it was supposed to do by failing. I wanted to show that by improperly configuring the parameters the script would fail. These negative pass tests shouldn't show up as failures, but as successes. The whole math test should pass to show that the script in question was prompting the right errors as well as the right answers.
There is a third case that doesn't show here. I definitely don't want to use continue on error. If the script failed to throw an error I want the test case to fail. There should be a failure and then the rest of the tests should continue. My ideal solution would show a pass on T.2 and T.3 and run T.4 and T.5. The same solution would also fail on T.2 or T.3 if they didn't generate an exception and still run T.4 and T.5. I just don't know how to fix that.
I have considered a couple of options but I don't know what is usually done. I expect that while I could jury rig something (e.g. put the failure into the script as another parameter, nest the testing in a second script that passes the parameters and catches the error, etc.), there is some standard way of doing this that I haven't considered. I'm looking for anyone who can tell me how it should be done.
I obtained an answer from the GitHub community that I want to share here.
https://github.community/t/negative-testing-with-workflows/116559
The answer is that the workflow should kick off one of several tools instead of multiple actions and that the tools can handle positive/negative testing on their own. The example given by the respondent is https://github.com/lee-dohm/close-matching-issues/blob/c65bd332c8d7b63cc77e463d0103eed2ad6497d2/.github/workflows/test.yaml#L16 which uses npm for testing.

How do you use the benchmark flags for the go (golang) gocheck testing framework?

How does one use the flag options for benchmarks with the gocheck testing framework? In the link that I provided it seems to be that the only example they provide is by running go test -check.b, however, they do not provide additional comments on how it works so its hard to use it. I could not even find the -check in the go documentation when I did go help test nor when I did go help testflag. In particular I want to know how to use the benchmark testing framework better and control how long it runs for or for how many iterations it runs for etc etc. For example in the example they provide:
func (s *MySuite) BenchmarkLogic(c *C) {
for i := 0; i < c.N; i++ {
// Logic to benchmark
}
}
There is the variable c.N. How does one specify that variable? Is it through the actual program itself or is it through go test and its flags or the command line?
On the side note, the documentation from go help testflag did talk about -bench regex, benchmem and benchtime t options, however, it does not talk about the -check.b option. However I did try to run these options as described there but it didn't really do anything I could notice. Does gocheck work with the original options for go test?
The main problem I see is that there is no clear documentation for how to use the gocheck tool or its commands. I accidentally gave it a wrong flag and it threw me a error message suggesting useful commands that I need (which limited description):
-check.b=false: Run benchmarks
-check.btime=1s: approximate run time for each benchmark
-check.f="": Regular expression selecting which tests and/or suites to run
-check.list=false: List the names of all tests that will be run
-check.v=false: Verbose mode
-check.vv=false: Super verbose mode (disables output caching)
-check.work=false: Display and do not remove the test working directory
-gocheck.b=false: Run benchmarks
-gocheck.btime=1s: approximate run time for each benchmark
-gocheck.f="": Regular expression selecting which tests and/or suites to run
-gocheck.list=false: List the names of all tests that will be run
-gocheck.v=false: Verbose mode
-gocheck.vv=false: Super verbose mode (disables output caching)
-gocheck.work=false: Display and do not remove the test working directory
-test.bench="": regular expression to select benchmarks to run
-test.benchmem=false: print memory allocations for benchmarks
-test.benchtime=1s: approximate run time for each benchmark
-test.blockprofile="": write a goroutine blocking profile to the named file after execution
-test.blockprofilerate=1: if >= 0, calls runtime.SetBlockProfileRate()
-test.coverprofile="": write a coverage profile to the named file after execution
-test.cpu="": comma-separated list of number of CPUs to use for each test
-test.cpuprofile="": write a cpu profile to the named file during execution
-test.memprofile="": write a memory profile to the named file after execution
-test.memprofilerate=0: if >=0, sets runtime.MemProfileRate
-test.outputdir="": directory in which to write profiles
-test.parallel=1: maximum test parallelism
-test.run="": regular expression to select tests and examples to run
-test.short=false: run smaller test suite to save time
-test.timeout=0: if positive, sets an aggregate time limit for all tests
-test.v=false: verbose: print additional output
is writing wrong commands the only way to get some help with this tool? it doesn't have a help flag or something?
I'm 5 years late, but to specify how many N times to run. Use the option -benchtime Nx.
Example:
go test -bench=. -benchtime 100x
BenchmarkTest 100 ... ns/op
Please read more about all go testing flags here.
see the Description_of_testing_flags:
-bench regexp
Run benchmarks matching the regular expression.
By default, no benchmarks run. To run all benchmarks,
use '-bench .' or '-bench=.'.
-check.b works the same way as -test.bench.
E.g. to run all benchmarks:
go test -check.b=.
to run a specific benchmark:
go test -check.b=BenchmarkLogic
more information about testing in Go can be found here