Delimit BigQuery REGEXP_EXTRACT strings in Google Cloud Build YAML script - google-bigquery

I have a complex query that creates a View within the BigQuery console.
I have simplified it to the following to illustrate the issue
SELECT
REGEXP_EXTRACT(FIELD1, r"[\d]*") as F1,
REGEXP_REPLACE(FIELD2, r"\'", "") AS F2,
FROM `project.mydataset.mytable`
Now I am trying to automate the creation of the view with cloud build.
I cannot workout how to delimit the strings inside the regex to work with both yaml and SQL.
- name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bq'
args: [
'mk',
'--use_legacy_sql=false',
'--project_id=${_PROJECT_ID}',
'--expiration=0',
'--view=
REGEXP_EXTRACT(FIELD1, r"[\d]*") as F1 ,
REGEXP_REPLACE(FIELD2, r"\'", "") AS F2,
REGEXP_EXTRACT(FIELD3, r"\[(\d{3,12}).*\]") AS F3
FROM `project.mydataset.mytable`"
'${_TARGET_DATASET}.${_TARGET_VIEW}'
]
I get the following error
Failed to trigger build: failed unmarshalling build config
cloudbuild/build-views.yaml: json: cannot unmarshal number into Go
value of type string
I have tried using Cloud Build substitution parameters, and as many combinations of SQL and YAML escape sequences as I can think of to find a working solution.

Generally, you want to use block scalars in such cases, as they do not process any special characters inside them and are terminated via indentation.
I have no idea how the command is supposed to look, but here's something that's at least valid YAML:
- name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bq'
args:
- 'mk'
- '--use_legacy_sql=false'
- '--project_id=${_PROJECT_ID}'
- '--expiration=0'
- >- # folded block scalar; newlines are folded into spaces
--view=
REGEXP_EXTRACT(FIELD1, r"[\d]*") as F1,
REGEXP_REPLACE(FIELD2, r"\'", "") AS F2,
REGEXP_EXTRACT(FIELD3, r"\[(\d{3,12}).*\]") AS F3
FROM `project.mydataset.mytable`"
'${_TARGET_DATASET}.${_TARGET_VIEW}'
- dummy value to show that the scalar ends here
A folded block scalar is started with >, the following minus tells YAML to not append the final newline to its value.

Related

Deploy sql workflow with DBX

I am developing deployment via DBX to Azure Databricks. In this regard I need a data job written in SQL to happen everyday. The job is located in the file data.sql. I know how to do it with a python file. Here I would do the following:
build:
python: "pip"
environments:
default:
workflows:
- name: "workflow-name"
#schedule:
quartz_cron_expression: "0 0 9 * * ?" # every day at 9.00
timezone_id: "Europe"
format: MULTI_TASK #
job_clusters:
- job_cluster_key: "basic-job-cluster"
<<: *base-job-cluster
tasks:
- task_key: "task-name"
job_cluster_key: "basic-job-cluster"
spark_python_task:
python_file: "file://filename.py"
But how can I change it so I can run a SQL job instead? I imagine it is the last two lines of code (spark_python_task: and python_file: "file://filename.py") which needs to be changed.
There are various ways to do that.
(1) One of the most simplest is to add a SQL query in the Databricks SQL lens, and then reference this query via sql_task as described here.
(2) If you want to have a Python project that re-uses SQL statements from a static file, you can add this file to your Python Package and then call it from your package, e.g.:
sql_statement = ... # code to read from the file
spark.sql(sql_statement)
(3) A third option is to use the DBT framework with Databricks. In this case you probably would like to use dbt_task as described here.
I found a simple workaround (although might not be the prettiest) to simply change the data.sql to a python file and run the queries using spark. This way I could use the same spark_python_task.

Gitlab CI only variables that come from extends

I have the following setup (simplified version), which doesn't run the expected merge::my when I use tag that includes the string "TEST". I can't figure out why is it happening - I know that only doesn't support variable expansion, but here the variable is just a string, that is being set up in a different extend - is that a problem? Would using yaml anchors be better? Are there different suggestions?
The reason that I check for only:variable in merge_builds is because I have many languages, in this case I used en, but I have many others, and I don't want to do the only:variables for each (the real matching is more complex - I stripped it to bare minimum for the example)
.merge_builds:
script:
- echo 'test'
only:
variables:
- $CI_COMMIT_TAG =~ $VARIABLEMATCH
.en_variables:
variables:
VARIABLEMATCH: /^$|(?i)EN/
merge::en:
extends:
- .en_variables
- .merge_builds
Based on GitLab issue 35438, I'd say that it is not currently possible to use a variable (as opposed to a literal) as the regular expression pattern.
Within issue 35438, #furkanayhan explains in a comment titled "Introduction" from 2021-09-06 (sorry, I wasn't able to get a permalink to it) that GitLab will make a simple string comparison between a value and a pattern given as a variable:
variables:
teststring: 'abcde'
pattern: '/^ab.*/'
test1:
script: exit 0
rules:
- if: '$teststring =~ $pattern'
test2:
script: exit 0
rules:
- if: '$teststring =~ /^ab.*/'
The test1 job is not created because the backend makes string comparison between "abcde" and "/'^ab.*/".
The test2 job is created because the backend makes regexp comparison between "abcde" and /'^ab.*/.
I believe that you are encountering the same behavior that caused "test1 job" not to be created.
However, issue 35438 shows that GitLab is planning on offering a fix in version 15.0, scheduled for 2022-05-22.
One other thing you might want to check on is the regular expression itself. GitLab's regexp doc (here) states that GitLab uses the re2 regular expression syntax for these kinds of comparison. To achieve case insensitivity, I believe one appends the "i" flag as in:
/pattern/i

Running command in perl6, commands that work in shell produce failure when run inside perl6

I'm trying to run a series of shell commands with Perl6 to the variable $cmd, which look like
databricks jobs run-now --job-id 35 --notebook-params '{"directory": "s3://bucket", "output": "s3://bucket/extension", "sampleID_to_canonical_id_map": "s3://somefile.csv"}'
Splitting the command by everything after notebook-params
my $cmd0 = 'databricks jobs run-now --job-id 35 --notebook-params ';
my $args = "'{\"directory\": \"$in-dir\", \"output\": \"$out-dir\",
\"sampleID_to_canonical_id_map\": \"$map\"}'"; my $run = run $cmd0,
$args, :err, :out;
Fails. No answer given either by Databricks or the shell. Stdout and stderr are empty.
Splitting the entire command by white space
my #cmd = $cmd.split(/\s+/);
my $run = run $cmd, :err, :out
Error: Got unexpected extra arguments ("s3://bucket", "output":
"s3://bucket/extension",
"sampleID_to_canonical_id_map":
"s3://somefile.csv"}'
Submitting the command as a string
my $cmd = "$cmd0\"$in-dir\", \"output\": \"$out-dir\", \"sampleID_to_canonical_id_map\": \"$map\"}'";
again, stdout and stderr are empty. Exit code 1.
this is something about how run can only accept arrays, and not strings (I'm curious why)
If I copy and paste the command that was given to Perl6's run, it works when given from the shell. It doesn't work when given through perl6. This isn't good, because I have to execute this command hundreds of times.
Perhaps Perl6's shell https://docs.perl6.org/routine/shell would be better? I didn't use that, because the manual suggests that run is safer. I want to capture both stdout and stderr inside a Proc class
EDIT: I've gotten this running with shell but have encountered other problems not related to what I originally posted. I'm not sure if this qualifies as being answered then. I just decided to use backticks with perl5. Yes, backticks are deprecated, but they get the job done.
I'm trying to run a series of shell commands
To run shell commands, call the shell routine. It passes the positional argument you provide it, coerced to a single string, to the shell of the system you're running the P6 program on.
For running commands without involving a shell, call the run routine. The first positional argument is coerced to a string and passed to the operating system as the filename of the program you want run. The remaining arguments are concatenated together with a space in between each argument to form a single string that is passed as a command line to the program being run.
my $cmd0 = 'databricks jobs run-now --job-id 35 --notebook-params ';
That's wrong for both shell and run:
shell only accepts one argument and $cmd0 is incomplete.
The first argument for run is a string interpreted by the OS as the filename of a program to be run and $cmd0 isn't a filename.
So in both cases you'll get either no result or nonsense results.
Your other two experiments are also invalid in their own ways as you discovered.
this is something about how run can only accept arrays, and not strings (I'm curious why)
run can accept a single argument. It would be passed to the OS as the name of the program to be run.
It can accept two arguments. The first would be the program name, the second the command line passed to the program.
It can accept three or more arguments. The first would be the program name, the rest would be concatenated to form the command line passed to the program. (There are cases where this is more convenient coding wise than the two argument form.)
run can also accept a single array. The first element would the program name and the rest the command line passed to it. (There are cases where this is more convenient.)
I just decided to use backticks with perl5. Yes, backticks are deprecated, but they get the job done.
Backticks are subject to code injection and shell interpolation attacks and errors. But yes, if they work, they work.
P6 has direct equivalents of most P5 features. This includes backticks. P6 has two variants:
The safer P6 alternative to backticks is qx. The qx quoting construct calls the shell but does not interpolate P6 variables so it has the same sort of level of danger as using shell with a single quoted string.
The qqx variant is the direct equivalent of P5 backticks or using shell with a double quoted string so it suffers from the same security dangers.
Two mistakes:
the simplistic split cuts up the last, single parameter into multiple arguments
you are passing $cmd to run, not #cmd
use strict;
my #cmd = ('/tmp/dummy.sh', '--param1', 'param2 with spaces');
my $run = run #cmd, :err, :out;
print(#cmd ~ "\n");
print("EXIT_CODE:\t" ~ $run.exitcode ~ "\n");
print("STDOUT:\t" ~ $run.out.slurp ~ "\n");
print("STDERR:\t" ~ $run.err.slurp ~ "\n");
output:
$ cat /tmp/dummy.sh
#!/bin/bash
echo "prog: '$0'"
echo "arg1: '$1'"
echo "arg2: '$2'"
exit 0
$ perl6 dummy.pl
/tmp/dummy.sh --param1 param2 with spaces
EXIT_CODE: 0
STDOUT: prog: '/tmp/dummy.sh'
arg1: '--param1'
arg2: 'param2 with spaces'
STDERR:
If you can avoid generating $cmd as single string, I would generate it into #cmd directly. Otherwise you'll have to implement complex split operation that handles quoting.

ElastAlert flatline not finding results

Creating a flatline alert type using the ElastAlert framework.
When I use the query in the Kibana UI with the exact same syntax it returns results, but ElastAlert isn't returning any results.
Here's my elastalert-rule-file.xml
name: Test Flatline
type: flatline
run_every:
seconds: 15
relalert:
minutes: 0
es_host: localhost
es_port: 9200
threshold: 1
timeframe:
minutes: 5
index: my-index-*
filter:
- query:
query_string:
query: "_type:metric" # this returns results in both kibana and elastalert
#query: "_type:metric AND _exists_:My\ Field\ With\ Spaces.value" # this returns results in kibana but not in elastalert
timestamp_type: unix_ms
alert:
- command
command: ["my-bash-script.sh"]
So I tried play around with the query and if I just specify _type:metric then the search results in Kibana seem to match those in ElastAlert.
However when I attempt to use the query with the _exists_ lucene syntax in the second query ElastAlert doesn't return anything while Kibana seems to be fine with the syntax.
Any ideas?
I got it...just forgot to post an answer.
Apparently for the field with spaces you need to escape the backslashes so the line in question would look like this:
query: "_type:metric AND _exists_:My\\ Field\\ With\\ Spaces.value"
Furthermore, in the special case where you are using Ansible (YAML) configuration you need to add a backslash to escape each backslash.
So the entry in a YAML file would look something like this:
query: "My\\\\ field\\\\ With\\\\ Spaces.value"
You can avoid escaping by using double quotes for the field data:
query: '_type:metric AND _exists_:"My Field With Spaces.value"'

Unable to extract data with double pipe delimiter in Pig Script

I am trying to extract data which is pipe delimited in Pig. Following is my command
L = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('||');
Iam getting following error
2016-08-04 23:58:21,122 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 1, column 4> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'PigStorage' with arguments '[||]'
My input sample file has exactly 5 lines as following
POS_TIBCO||HDFS||POS_LOG||1||7806||2016-07-18||1||993||0
POS_TIBCO||HDFS||POS_LOG||2||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||3||7806||2016-07-18||1||0||5
POS_TIBCO||HDFS||POS_LOG||4||7806||2016-07-18||1||0||0
POS_TIBCO||HDFS||POS_LOG||5||7806||2016-07-18||1||0||19.99
I tried several options like using the backslash before delimiter(\||,\|\|) but everything failed. Also, I tried with schema but got the same error.I am using Horton works(HDP2.2.4) and pig (0.14.0).
Any help is appreciated. Please let me know if you need any further details.
I have faced this case, and by checking PigStorage code source, i think PigStorage argument should be parsed into only one character.
So we can use this code instead:
L0 = LOAD 'entirepath_in_HDFS/b.txt/part-m*' USING PigStorage('|');
L = FOREACH L0 GENERATE $0,$2,$4,$6,$8,$10,$12,$14,$16;
Its helpful if you know how many column you have, and it will not affect performance because it's map side.
When you load data using PigStorage, It only expects single character as delimiter.
However if still you want to achieve this you can use MyRegExLoader-
REGISTER '/path/to/piggybank.jar'
A = LOAD '/path/to/dataset' USING org.apache.pig.piggybank.storage.MyRegExLoader('||')
as (movieid:int, title:chararray, genre:chararray);