What is the structure of the executable transformation script for transform_script of GCSFileTransformOperator? - google-bigquery

Currently working on a task in Airflow that requires pre-processing a large csv file using GCSFileTransformOperator. Reading the documentation on the class and its implementation, but don't quite understand how the executable transformation script for transform_script should be structured.
For example, is the following script structure correct? If so, does that mean with GCSFileTransformOperator, Airflow is calling the executable transformation script and passing arguments from command line?
# Import the required modules
import preprocessing modules
import sys
# Define the function that passes source_file and destination_file params
def preprocess_file(source_file, destination_file):
# (1) code that processes the source_file
# (2) code then writes to destination_file
# Extract source_file and destination_file from the list of command-line arguments
source_file = sys.argv[1]
destination_file = sys.argv[2]
preprocess_file(source_file, destination_file)

GCSFileTransformOperator passes the script to subprocess.Popen, so your script will work but you will need to add a shebang #!/usr/bin/python (of wherever Python is on your path in Airflow).
Your arguments are correct and the format of your script can be anything you want. Airflow passes in the path of the downloaded file, and a temporary new file:
cmd = (
[self.transform_script]
if isinstance(self.transform_script, str)
else self.transform_script
)
cmd += [source_file.name, destination_file.name]
with subprocess.Popen(
args=cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True
) as process:
# ...
process.wait()
(you can see the source here)

Related

How to use multiple urls in a python program

I am using a tool named https://github.com/epinna/tplmap which tests for Template Injection in a site. This is how it test for each url
python tplmap.py -u https://leadform.microsoft.com/?lang=FUZZ
Developer has not included option for multiple urls
How can i use awk/sed/cat to add multiple urls from a text file?
Developer has not included option for multiple urls
I have look at tplmap.py and it seems viable to alterting so it would work as you wish, thing which needs change is function named main which looks as follows
def main():
args = vars(cliparser.options)
if not args.get('url'):
cliparser.parser.error('URL is required. Run with -h for help.')
# Add version
args['version'] = version
checks.check_template_injection(Channel(args))
what is happening here: args dict is created based on command line arguments, presence of url checked, version added and checking function called. I would alter that function to
def main():
args = vars(cliparser.options)
if not args.get('url'):
cliparser.parser.error('URL is required. Run with -h for help.')
# Add version
args['version'] = version
url_filename = args['url']
with open(url_filename, 'r') as f:
for line in f:
args['url'] = line.rstrip()
checks.check_template_injection(Channel(args))
so now it does open file provided where url was earlier provided for each line it is rstrip-ped to jettison trailing newline and provided as url in args then check is run. Note: I did not test that code, please apply described change and first to try use it with file holding few addresses and check if it does what you want.

Generate data file at install time

My python package depends on a static data file which is automatically generated from a smaller seed file using a function that is part of the package.
It makes sense to me to do this generation at the time of running setup.py install, is there a standard way in setup() to describe “run this function before installing this package's additional files” (the options in the docs are all static)? If not, where should I place the call to that function?
Best done in two steps using the cmdclass mechanism:
add a custom command to generate the data file
override build_py to call that before proceeding
from distutils.cmd import Command
from setuptools import setup
from setuptools.command.install import install
class GenerateDataFileCommand(Command):
description = 'generate data file'
user_options = []
def run(self):
pass # Do something here...
class InstallCommand(install):
def run(self):
self.run_command('generate_data_file')
return super().run()
setup(
cmdclass={
'generate_data_file': GenerateDataFileCommand,
'install': InstallCommand,
},
# ...
)
This way you can call python setup.py generate_data_file to generate the data file as a stand-alone step, but the usual setup procedure (python setup.py install) will also ensure it's called.
(However, I'd recommend including the built file in the distribution archive, so end users don't have to build it themselves – that is, override build_py (class setuptools.command.build_py.build_py) instead of install.)

How to access cluster_config dict within rule?

I'm working on writing a benchmarking report as part of a workflow, and one of the things I'd like to include is information about the amount of resources requested for each job.
Right now, I can manually require the cluster config file ('cluster.json') as a hardcoded input. Ideally, though, I would like to be able to access the per-rule cluster config information that is passed through the --cluster-config arg. In init.py, this is accessed as a dict called cluster_config.
Is there any way of importing or copying this dict directly into the rule?
From the documentation, it looks like you can now use a custom wrapper script to access the job properties (including the cluster config data) when submitting the script to the cluster. Here is an example from the documentation:
#!python
#!/usr/bin/env python3
import os
import sys
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
# do something useful with the threads
threads = job_properties[threads]
# access property defined in the cluster configuration file (Snakemake >=3.6.0)
job_properties["cluster"]["time"]
os.system("qsub -t {threads} {script}".format(threads=threads, script=jobscript))
During submission (last line of the previous example) you could either pass the arguments you want from the cluster.json to the script or dump the dict into a JSON file, pass the location of that file to the script during submission, and parse the json file inside your script. Here is an example of how I would change the submission script to do the latter (untested code):
#!python
#!/usr/bin/env python3
import os
import sys
import tempfile
import json
from snakemake.utils import read_job_properties
jobscript = sys.argv[1]
job_properties = read_job_properties(jobscript)
job_json = tempfile.mkstemp(suffix='.json')
json.dump(job_properties, job_json)
os.system("qsub -t {threads} {script} -- {job_json}".format(threads=threads, script=jobscript, job_json=job_json))
job_json should now appear as the first argument to the job script. Make sure to delete the job_json at the end of the job.
From a comment on another answer, it appears that you are just looking to store the job_json somewhere along with the job's output. In that case, it might not be necessary to pass job_json to the job script at all. Just store it in a place of your choosing.
You can manage the resources for the cluster easily per Rules.
Indeed you have the keyword "resources:" to use like this :
rule one:
input: ...
output: ...
resources:
gpu=1,
time=HH:MM:SS
threads: 4
shell: "..."
You can specify the resources by the yaml configuration files for the cluster give with the parameter --cluster-config like this:
rule one:
input: ...
output: ...
resources:
time=cluster_config["one"]["time"]
threads: 4
shell: "..."
When you will call snakemake you will just have to access to the resources like this (example for slurm cluster):
snakemake --cluster "sbatch -c {threads} -t {resources.time} " --cluster-config cluster.yml
It will send each rule with its specific resources for the cluster.
For more informations, you can check the documentations with this link : http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
Best regards

rpm spec file skeleton to real spec file

The aim is to have skeleton spec fun.spec.skel file which contains placeholders for Version, Release and that kind of things.
For the sake of simplicity I try to make a build target which updates those variables such that I transform the fun.spec.skel to fun.spec which I can then commit in my github repo. This is done such that rpmbuild -ta fun.tar does work nicely and no manual modifications of fun.spec.skel are required (people tend to forget to bump the version in the spec file, but not in the buildsystem).
Assuming the implied question is "How would I do this?", the common answer is to put placeholders in the file like ##VERSION## and then sed the file, or get more complicated and have autotools do it.
We place a version.mk file in our project directories which define environment variables. Sample content includes:
RELPKG=foopackage
RELFULLVERS=1.0.0
As part of a script which builds the RPM, we can source this file:
#!/bin/bash
. $(pwd)/Version.mk
export RELPKG RELFULLVERS
if [ -z "${RELPKG}" ]; then exit 1; fi
if [ -z "${RELFULLVERS}" ]; then exit 1; fi
This leaves us a couple of options to access the values which were set:
We can define macros on the rpmbuild command line:
% rpmbuild -ba --define "relpkg ${RELPKG}" --define "relfullvers ${RELFULLVERS}" foopackage.spec
We can access the environment variables using %{getenv:...} in the spec file itself (though this can be harder to deal with errors...):
%define relpkg %{getenv:RELPKG}
%define relfullvers %{getenv:RELFULLVERS}
From here, you simply use the macros in your spec file:
Name: %{relpkg}
Version: %{relfullvers}
We have similar values (provided by environment variables enabled through Jenkins) which provide the build number which plugs into the "Release" tag.
I found two ways:
a) use something like
Version: %(./waf version)
where version is a custom waf target
def version_fun(ctx):
print(VERSION)
class version(Context):
"""Printout the version and only the version"""
cmd = 'version'
fun = 'version_fun'
this checks the version at rpm build time
b) create a target that modifies the specfile itself
from waflib.Context import Context
import re
def bumprpmver_fun(ctx):
spec = ctx.path.find_node('oregano.spec')
data = None
with open(spec.abspath()) as f:
data = f.read()
if data:
data = (re.sub(r'^(\s*Version\s*:\s*)[\w.]+\s*', r'\1 {0}\n'.format(VERSION), data, flags=re.MULTILINE))
with open(spec.abspath(),'w') as f:
f.write(data)
else:
logs.warn("Didn't find that spec file: '{0}'".format(spec.abspath()))
class bumprpmver(Context):
"""Bump version"""
cmd = 'bumprpmver'
fun = 'bumprpmver_fun'
The latter is used in my pet project oregano # github

how to fetch a data from one file location and to run using tcl code

In tcl how to get the data from one file location and to run that data using TCL code .
for example
In the folder 1 there is config file ,i want to get the informations of config file and i want to execute the information that is present or not,
If the configuration file contains Tcl code, it's just:
# Put the filename in quotes if you want, or in a variable, or ...
source /the/path/to/the/file.tcl
If the file contains Tcl code but you don't trust it, you can use a “safe interpreter” context. This disables many commands, giving a much more restricted set of capabilities that you can then add specific exceptions to (with interp alias):
# Make the context
set i [interp create -safe]
# Set up a way for the context to let the master find out about what to
# really set
interp alias $i configure {} recordConfiguration
proc recordConfiguration args {
puts "configured with $args"
}
# Evaluate the script (note that [source] is hidden by default) in the context
$i invokehidden source /the/path/to/the/file.tcl
# Dispose the context
interp delete $i
If the file isn't Tcl code, you have to parse it. That's a substantially more complex matter, so much so that we'll need to know the format of the file before we can answer.
If you are trying to read data (like text strings) from a file then you'll have to open a channel for that particular file like this:
set fileid [open "path/to/your/file.txt" r]
Read open manual page.
Then you can use gets command to read data from the file through the channel fileid .