I would like to set up tests on my transforms into Foundry, passing test inputs and checking that the output is the expected one. Is it possible to call a transform with dummy datasets (.csv file in the repo) or should I create functions inside the transform to be called by the tests (data created in code)?
If you check your platform documentation under Code Repositories -> Python Transforms -> Python Unit Tests, you'll find quite a few resources there that will be helpful.
The sections on writing and running tests in particular is what you're looking for.
// START DOCUMENTATION
Writing a Test
Full documentation can be found at https://docs.pytest.org
Pytest finds tests in any Python file that begins with test_.
It is recommended to put all your tests into a test package under the src directory of your project.
Tests are simply Python functions that are also named with the test_ prefix and assertions are made using Python’s assert statement.
PyTest will also run tests written using Python’s builtin unittest module.
For example, in transforms-python/src/test/test_increment.py a simple test would look like this:
def increment(num):
return num + 1
def test_increment():
assert increment(3) == 5
Running this test will cause checks to fail with a message that looks like this:
============================= test session starts =============================
collected 1 item
test_increment.py F [100%]
================================== FAILURES ===================================
_______________________________ test_increment ________________________________
def test_increment():
> assert increment(3) == 5
E assert 4 == 5
E + where 4 = increment(3)
test_increment.py:5: AssertionError
========================== 1 failed in 0.08 seconds ===========================
Testing with PySpark
PyTest fixtures are a powerful feature that enables injecting values into test functions simply by adding a parameter of the same name. This feature is used to provide a spark_session fixture for use in your test functions. For example:
def test_dataframe(spark_session):
df = spark_session.createDataFrame([['a', 1], ['b', 2]], ['letter', 'number'])
assert df.schema.names == ['letter', 'number']
// END DOCUMENTATION
If you don't want to specify your schemas in code, you can also read in a file in your repository by following the instructions in documentation under How To -> Read file in Python repository
// START DOCUMENTATION
Read file in Python repository
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.yaml', '*.csv']
}
)
This tells python to bundle the yaml and csv files into the package. Then place a config file (for example config.yaml, but can be also csv or txt) next to your python transform (e.g. read_yml.py see below):
- name: tbl1
primaryKey:
- col1
- col2
update:
- column: col3
with: 'XXX'
You can read it in your transform read_yml.py with the code below:
from transforms.api import transform_df, Input, Output
from pkg_resources import resource_stream
import yaml
import json
#transform_df(
Output("/Demo/read_yml")
)
def my_compute_function(ctx):
stream = resource_stream(__name__, "config.yaml")
docs = yaml.load(stream)
return ctx.spark_session.createDataFrame([{'result': json.dumps(docs)}])
So your project structure would be:
some_folder
config.yaml
read_yml.py
This will output in your dataset a single row with one column "result" with content:
[{"primaryKey": ["col1", "col2"], "update": [{"column": "col3", "with": "XXX"}], "name": "tbl1"}]
// END DOCUMENTATION
Related
There is a small project which produces a binary application. The source code is C, I'm using autotools to create the Makefile and build the binary - it works as well.
I would like to run tests cases with that binary. Here is what I did:
SUBDIRS = src
dist_doc_DATA = README
TESTS=
TESTS+=tests/config1.conf
TESTS+=tests/config2.conf
TESTS+=tests/config3.conf
TESTS+=tests/config4.conf
TESTS+=tests/config5.conf
TESTS+=tests/config6.conf
TESTS+=tests/config7.conf
TESTS+=tests/config8.conf
TESTS+=tests/config9.conf
TESTS+=tests/config10.conf
TESTS+=tests/config11.conf
I would like to run these cases as argument with the tool. When I run make check, I got:
make[3]: Entering directory '/home/airween/src/mytool'
FAIL: tests/config1.conf
FAIL: tests/config2.conf
FAIL: tests/config3.conf
which is correct, because those files are simple configurations files.
How can I solve that make check runs my tool with the scripts above, and finally I get a list with number of success, failed, ... tests, like in that case:
============================================================================
Testsuite summary for mytool 0.1
============================================================================
# TOTAL: 11
# PASS: 0
# SKIP: 0
# XFAIL: 0
# FAIL: 11
# XPASS: 0
# ERROR: 0
Edit: so I would like to emulate these runs:
for f in `ls -1 tests/*.conf; do src/mytool ${f}; done
but - of course - I want to see the summary at the end.
Thanks.
The Autotools' built-in test runner expects you to specify the names of executable tests via the make variable TESTS. You cannot just put random filenames in there and expect make or Automake to know what to do with them.
The tests can be built programs, generated scripts, static scripts distributed with the project, or any combination of the above.
How can I solve that make check runs my tool with the scripts above, and finally I get a [test summary report]?
You have acknowledged that your configuration files are not scripts, so stop calling them that! This is in fact the crux of the problem. The easiest solution is probably to create actual executable scripts, one for each case, and name those in your TESTS variable. Each one would run the binary under test with the appropriate configuration file (that is, you're responsible for making them do that if those are the tests you want to perform).
See also the Automake Manual's chapter on tests.
Okay, the solution from here:
tests/Makefile.am:
==================
TEST_EXTENSIONS = .conf
CONF_LOG_COMPILER = ./test-suit.sh
TESTS=
TESTS+=config1.conf
TESTS+=config2.conf
TESTS+=config3.conf
TESTS+=config4.conf
TESTS+=config5.conf
TESTS+=config6.conf
TESTS+=config7.conf
TESTS+=config8.conf
TESTS+=config9.conf
TESTS+=config10.conf
TESTS+=config11.conf
test/test-suit.sh:
==================
#!/bin/sh
CONF=$1
exec ../src/mytool $CONF
And the result:
make check
...
PASS: config1.conf
PASS: config2.conf
PASS: config3.conf
PASS: config4.conf
PASS: config5.conf
PASS: config6.conf
PASS: config7.conf
PASS: config8.conf
PASS: config9.conf
PASS: config10.conf
PASS: config11.conf
============================================================================
Testsuite summary for mytool 0.1
============================================================================
# TOTAL: 11
# PASS: 11
# SKIP: 0
# XFAIL: 0
# FAIL: 0
# XPASS: 0
# ERROR: 0
============================================================================
This is what I expected.
Snakemake allows creation of a log for each rule with log parameter that specifies the name of the log file. It is relatively straightforward to pipe results from shell output to this log, but I am not able to figure out a way of logging output of run output (i.e. python script).
One workaround is to save the python code in a script and then run it from the shell, but I wonder if there is another way?
I have some rules that use both the log and run directives. In the run directive, I "manually" open and write the log file.
For instance:
rule compute_RPM:
input:
counts_table = source_small_RNA_counts,
summary_table = rules.gather_read_counts_summaries.output.summary_table,
tags_table = rules.associate_small_type.output.tags_table,
output:
RPM_table = OPJ(
annot_counts_dir,
"all_{mapped_type}_on_%s" % genome, "{small_type}_RPM.txt"),
log:
log = OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}.log"),
benchmark:
OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}_benchmark.txt"),
run:
with open(log.log, "w") as logfile:
logfile.write(f"Reading column counts from {input.counts_table}\n")
counts_data = pd.read_table(
input.counts_table,
index_col="gene")
logfile.write(f"Reading number of non-structural mappers from {input.summary_table}\n")
norm = pd.read_table(input.summary_table, index_col=0).loc["non_structural"]
logfile.write(str(norm))
logfile.write("Computing counts per million non-structural mappers\n")
RPM = 1000000 * counts_data / norm
add_tags_column(RPM, input.tags_table, "small_type").to_csv(output.RPM_table, sep="\t")
For third-party code that writes to stdout, maybe the redirect_stdout context manager could be helpful (found in https://stackoverflow.com/a/40417352/1878788, documented at
https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout).
Test snakefile, test_run_log.snakefile:
from contextlib import redirect_stdout
rule all:
input:
"test_run_log.txt"
rule test_run_log:
output:
"test_run_log.txt"
log:
"test_run_log.log"
run:
with open(log[0], "w") as log_file:
with redirect_stdout(log_file):
print(f"Writing result to {output[0]}")
with open(output[0], "w") as out_file:
out_file.write("result\n")
Running it:
$ snakemake -s test_run_log.snakefile
Results:
$ cat test_run_log.log
Writing result to test_run_log.txt
$ cat test_run_log.txt
result
My solution was the following. This is usefull both for normal log and logging exceptions with traceback. You can then wrap logger setup in a function to make it more organized. It's not very pretty though. Would be much nicer if snakemake could do it by itself.
import logging
# some stuff
rule logging_test:
input: 'input.json'
output: 'output.json'
log: 'rules_logs/logging_test.log'
run:
logger = logging.getLogger('logging_test')
fh = logging.FileHandler(str(log))
fh.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.addHandler(fh)
try:
logger.info('Starting operation!')
# do something
with open(str(output), 'w') as f:
f.write('success!')
logger.info('Ended!')
except Exception as e:
logger.error(e, exc_info=True)
How to auto generate ID of changeset with liquibase?
I don't want to set the ID of every changeset manually, is there a way to do it automatically?
I dont think that generated ids are a good idea. The reason is that liquibase uses the changeSet id to calculate the checksum (in addition to the author and fileName). So if you ever insert a changeSet between others the checksums of all subsequent changeSets will change and you will get tons of warnings/errors.
Anyway i can think of those solutions if you still want to generate Ids:
create your own ChangeLogParser
If you parse the ChangeLog on your own you are free to generate the ids as you want.
The downside is that you will have to provide a custom Xml Schema for the changeLog. The schema from Liquibase has a constraint on changeSet ids (required). With a new schema you'll probably have to do a significant amount of tweaking on the parser.
Alternatively you may choose another changeLog Format (YAML, JSON, Groovy). Their parsers may be easier to customize as they do not need that schema definition.
Do some preprocessing
You may write a simple xslt (Xml transformation) that generates a changeLog with changeSet ids, from a file that has none.
use timestamps as Ids
This would be my advice. It does not solve the question the way you asked, but it is simple, consistent, provides additional information and is a good practice for other database migration tools as well http://www.jeremyjarrell.com/using-flyway-db-with-distributed-version-control/
I wrote a Python script to generate unique IDs into Liquibase changelogs.
Be careful!
DO generate IDs
when the changelog is in development or ready for release
or when you control the checksums of the target database
DON'T generate IDs
- when the changelog(s) are deployed already
"""
###############################################################################
Purpose: Generate unique subsequent IDs into Liquibase changelogs
###############################################################################
Args:
param1: Full Windows path changelog directory (optional)
OR
--inplace: directly process changelogs (optional)
By default, XML files in the current directory are processed.
Returns:
In case of success, the output path is returned to stdout.
Otherwise, we crash and drag the system into mordor.
If you feel like wasting time you can:
a) port path handling to *nix
b) handle any obscure exceptions
c) add Unicode support (for better entertainment)
Dependencies:
Besides Python 3, in order to preserve XML comments, I had to use lxml
instead of the stock ElementTree parser.
Install lxml:
$ pip install lxml
Proxy clusterfuck? Don't panic! Simply download a .whl package from:
https://pypi.org/project/lxml/#files and install with pip.
Bugs:
Changesets having id="0" are ignored. Usually, these do not occur.
Author:
Tobias Bräutigam
Versions:
0.0.1 - re based, deprecated
0.0.2 - parse XML with lxml, CURRENT
"""
import datetime
import sys
import os
from pathlib import Path, PureWindowsPath
try:
import lxml.etree as ET
except ImportError as error:
print ('''
Error: module lxml is missing.
Please install it:
pip install lxml
''')
exit()
# Process arguments
prefix = '' # hold separator, if needed
outdir = 'out'
try: sys.argv[1]
except: pass
else:
if sys.argv[1] == '--inplace':
outdir = ''
else:
prefix = outdir + '//'
# accept Windows path syntax
inpath = PureWindowsPath(sys.argv[1])
# convert path format
inpath = Path(inpath)
os.chdir(inpath)
try: os.mkdir(outdir)
except: pass
filelist = [ f for f in os.listdir(outdir) ]
for f in filelist: os.remove(os.path.join(outdir, f))
# Parse XML, generate IDs, write file
def parseX(filename,prefix):
cnt = 0
print (filename)
tree = ET.parse(filename)
for node in tree.getiterator():
if int(node.attrib.get('id', 0)):
now = datetime.datetime.now()
node.attrib['id'] = str(int(now.strftime("%H%M%S%f"))+cnt*37)
cnt = cnt + 1
root = tree.getroot()
# NS URL element name is '' for Etree, lxml requires at least one character
ET.register_namespace('x', u'http://www.liquibase.org/xml/ns/dbchangelog')
tree = ET.ElementTree(root)
tree.write(prefix + filename, encoding='utf-8', xml_declaration=True)
print(str(cnt) +' ID(s) generated.')
# Process files
print('\n')
items = 0
for infile in os.listdir('.'):
if (infile.lower().endswith('.xml')) == True:
parseX(infile,prefix)
items=items+1
# Message
print('\n' + str(items) + ' file(s) processed.\n\n')
if items > 0:
print('Output was written to: \n\n')
print(str(os.getcwd()) + '\\' + outdir + '\n')
There are different lists available in pelicanconf.py such as
SOCIAL = (('Facebook','www.facebook.com'),)
LINKS =
etc.
I want to manage these content and create my own lists by loading these values from an external file which can be edited independently. I tried importing data as a text file using python but it doesn't work. Is there any other way?
What exactly did not work? Can you provide code?
You can execute arbitrary python code in your pelicanconf.py.
Example for a very simple CSV reader:
# in pelicanconf.py
def fn_to_list(fn):
with open(fn, 'r') as res:
return tuple(map(lambda line: tuple(line[:-1].split(';')), res.readlines()))
print(fn_to_list("data"))
CSV file data:
A;1
B;2
C;3
D;4
E;5
F;6
Together, this yields the following when running pelican:
# ...
((u'A', u'1'), (u'B', u'2'), (u'C', u'3'), (u'D', u'4'), (u'E', u'5'), (u'F', u'6'))
# ...
Instead of printing you can also assign this list to a variable, say LINKS.
Suppose I have 2 test suites in the local directory, foo and bar, and I want to run the test suite in the order of foo then bar.
I tried to run pybot -s foo -s bar ., but then it just goes and run bar then foo (i.e. in alphabetical order).
Is there a way to get pybot to run robot framework suites to be execute in the order that I define?
Robot framework can use argument files that can be used to specify order of execution (docs):
This is from older docs (not online anymore):
Another important usage for argument files is specifying input files or directories in certain order. This can be very useful if the alphabetical default execution order is not suitable:
Basically, you create something similar to start up script.
--name My Example Tests
tests/some_tests.html
tests/second.html
tests/more/tests.html
tests/more/another.html
tests/even_more_tests.html
There is neat feature that from argument file you can call another argument file that can override previously set parameters. Execution is recursive, so you can nest as many argument files as you need
Another option would be to use start up script. Than you have to deal with other aspects like which operating system you are running test on. You could also use python for starting up script on multiple platforms. There is more in this section of docs
If there are multiple test case files in an RF directory , the execution order can be specified by giving numbers as prefixes to test case names , like this.
01__my_suite.html -> My Suite
02__another_suite.html -> Another Suite
Such prefixes are not included in the generated test suite name if they are separated from the base name of the suite with two underscores:
More details are here.
http://robotframework.org/robotframework/latest/RobotFrameworkUserGuide.html#execution-order
You can use tagging.
Tag the tests as foo and bar so you can run each test separately:
pybot -i foo tests
or
pybot -i bar tests
and decide the order
pybot -i bar tests || pybot -i foo tests
or in a script.
The drawback is that you have to run the setup for each test.
Would something like this be of any use?
pybot tests/test1.txt tests/test2.txt
So, to reverse:
pybot tests/test2.txt tests/test1.txt
I had success using a listener:
Listener.py:
class Listener(object):
ROBOT_LISTENER_API_VERSION = 3
def __init__(self):
self.priorities = ['foo', 'bar']
def start_suite(self, data, suite):
#data.suites is a list of <TestSuite> instances
data.suites = self.rearrange(data.suites)
def rearrange(self, suites=[]):
#Do some sorting of suites based on self.priorities e.g. using bubblesort
n = len(suites)
if n > 1:
for i in range(0, n):
for j in range(0, n-i-1):
#Initialize the compared suites with lowest priority
priorityA = 0
priorityB = 0
#If suite[j] is prioritized, get the priority of it
if str(suites[j]) in self.priorities:
priorityA = len(self.priorities)-self.priorities.index(str(suites[j]))
#If suite[j+1] is prioritized, get the priority of it
if str(suites[j+1]) in self.priorities:
priorityB = len(self.priorities)-self.priorities.index(str(suites[j+1]))
#Compare and swap if suite[j] is of lower priority than suite[j+1]
if priorityA < priorityB:
suites[j], suites[j+1] = suites[j+1], suites[j]
return arr
Assuming foo.robot and bar.robot are contained in a toplevel suite called 'tests', you can run it like this:
pybot --listener Listener.py tests/
This will rearrange childsuites on the fly. It's possible you can modify it upfront using a prerunmodifier instead.