My data is structured as samples that are run in batches. So I have a directory hierarchy like this:
/path/to/dir/batch_1/sample_1
/path/to/dir/batch_1/sample_2
/path/to/dir/batch_1/...
/path/to/dir/batch_2/sample_1
/path/to/dir/batch_2/sample_2
/path/to/dir/batch_2/...
/path/to/dir/...
I want to apply a process to every sample for a given subset of batches. One approach that works is to generate a channel listing the samples:
path_to_samples= Channel
.fromPath(['/path/to/dir/batch_2/sample_*',
'/path/to/dir/batch_322/sample_*'], type: 'dir' )
process my_process{
input:
path(sample) from path_to_samples
"""
do stuff
"""
}
Now, I'd like to provide the batch names separately, and have the script find the corresponding samples. Something like that:
params.root_dir = '/path/to/dir/'
params.batch_names = Channel.from('batch_2', 'batch_322')
// make samples channel: incorrect
path_to_samples = params.batch_names
.map { params.root_dir + it + 'sample_*' }
.toPath()
process my_process{
input:
path(sample) from path_to_samples
"""
do stuff
"""
}
So, I am thinking incorrectly about channels? Is there a way to "flatten" the sample list through channel operations? Or is the correct approach to make a more complex Groovy closure that will list the files in each batch directory and return it as a tuple or list?
Not sure how you'd like to provide your input batch names, but you could create your list of glob patterns using a simple closure then use them to create your input channel:
params.root_dir = '/path/to/dir'
params.batch_names = /path/to/batch_names.txt'
batch_names = file(params.batch_names)
sample_dirs = batch_names.readLines().collect { "${params.root_dir}/${it}/sample_*" }
samples = Channel.fromPath( sample_dirs, type: 'dir' )
process my_process{
input:
path(sample) from samples
"""
ls -l "${sample}"
"""
}
I would be inclined to just leave the input glob pattern as a param, though. This approach offers the most flexibility, but may not suit your use case:
params.samples = '/path/to/dir/batch_{2,322}/sample_*'
samples = Channel.fromPath( params.samples, type: 'dir' )
Related
I have a #transform_pandas code which loads the input file for computing.
Inside the compute function I have a for loop which has to read the complete input data and filter accordingly for every iteration.
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
)
I have the below code where I'm trying to read source_df dataset for every iteration in for loop and filter the dataset specifically to the year and family and do the computation.
def compute(source_df):
for entire_row in vhcl_df.itertuples():
modyr = entire_row[1]
fam = str(entire_row[2])
/* source_df should be read again here.
source_df = source_df.loc[source_df['i_yr']==modyr]
source_df = source_df.loc[source_df['fam']==fam]
...
Is there a way to achieve this. Thank you for your support.
As already suggested by #nicornk in the comments, you should create a new .copy() item of your source_df right after you declare the transform.
The two filtering steps (that can ben also merged in one, if you don't need to work just on the "modyr filtered" source_df.
Please note that modyr, fam are actual colnames of vhcl_df, it is actually sufficient to
#transform_pandas(
Output("/FCA_Foundry/dataset1"),
source_df=Input(sample),
vhcl_df=Input(path)
)
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
temp_df = source_df.loc[source_df['i_yr']==modyr]
temp_df = source_df.loc[source_df['fam']==str(fam)]
which, in a more concise and clean way is actually writable as
def compute(source_df, vhcl_df):
for modyr, fam in vhcl_df.items():
temp_df = source_df.copy()
filtered_temp_df = temp_df[(temp_df.i_yr==modyr) & (temp_df.fam==str(fam))]
PS: Remember that if source_df is big, you should proceed with PySpark (see foundry docs)
Note that transform_pandas should only be used on datasets that can fit into memory. If you have larger datasets that you wish to filter down first before converting to Pandas, you should write your transformation using the transform_df() decorator and the pyspark.sql.SparkSession.createDataFrame() method.
I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.
There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.
I am using Sphinx for documentation and pytest for testing.
I need to generate a test plan but I really don't want to generate it by hand.
It occurred to me that a neat solution would be to actually embed test metadata in the tests' themselves, within their respective docstrings. This metadata would include things like % complete, time remaining etc. I could then run through all of the tests (which would at this point include mostly placeholders) and generate a test plan from them. This would then guarantee that the test plan and the tests themselves would be in sync.
I was thinking of making either a pytest plugin or a sphinx plugin to handle this.
Using pytest, the closest hook I can see looks like pytest_collection_modifyitems which gets called after all of the tests are collected.
Alternatively, I was thinking of using Sphinx and perhaps copying/modifying the todolist plugin as it seems like the closest match to this idea. The output of this would be more useful as the output would slot nicely in to the existing Sphinx based docs I have though there is a lot going on in this plugin and I don't really have the time to invest in understanding it.
The docstrings could have something like this within it:
:plan_complete: 50 #% indicator of how complete this test is
:plan_remaining: 2 #the number of hours estimated to complete this test
:plan_focus: something #what is the test focused on testing
The idea is to then generate a simple markdown/rst or similar table based on the function's name, docstring and embedded plan info and use that as the test plan.
Does something like this already exist?
In the end I went with a pytest based plugin as it was just so much simpler to code.
If anyone else is interested, below is the plugin:
"""Module to generate a test plan table based upon metadata extracted from test
docstrings. The test description is extracted from the first sentence or up to
the first blank line. The data which is extracted from the docstrings are of the
format:
:test_remaining: 10 #number of hours remaining for this test to be complete. If
not present, assumed to be 0
:test_complete: #the percentage of the test that is complete. If not
present, assumed to be 100
:test_focus: The item the test is focusing on such as a DLL call.
"""
import pytest
import re
from functools import partial
from operator import itemgetter
from pathlib import Path
whitespace_re = re.compile(r'\s+')
cut_whitespace = partial(whitespace_re.sub, ' ')
plan_re = re.compile(r':plan_(\w+?):')
plan_handlers = {
'remaining': lambda x:int(x.split('#')[0]),
'complete': lambda x:int(x.strip().split('#')[0]),
'focus': lambda x:x.strip().split('#')[0]
}
csv_template = """.. csv-table:: Test Plan
:header: "Name", "Focus", "% Complete", "Hours remaining", "description", "path"
:widths: 20, 20, 10, 10, 60, 100
{tests}
Overall hours remaining: {hours_remaining:.2f}
Overall % complete: {complete:.2f}
"""
class GeneratePlan:
def __init__(self, output_file=Path('test_plan.rst')):
self.output_file = output_file
def pytest_collection_modifyitems(self, session, config, items):
#breakpoint()
items_to_parse = {i.nodeid.split('[')[0]:i for i in self.item_filter(items)}
#parsed = map(parse_item, items_to_parse.items())
parsed = [self.parse_item(n,i) for (n,i) in items_to_parse.items()]
complete, hours_remaining = self.get_summary_data(parsed)
self.output_file.write_text(csv_template.format(
tests = '\n'.join(self.generate_rst_table(parsed)),
complete=complete,
hours_remaining=hours_remaining))
def item_filter(self, items):
return items #override me
def get_summary_data(self, parsed):
completes = [p['complete'] for p in parsed]
overall_complete = sum(completes)/len(completes)
overall_hours_remaining = sum(p['remaining'] for p in parsed)
return overall_complete, overall_hours_remaining
def generate_rst_table(self, items):
"Use CSV type for simplicity"
sorted_items = sorted(items, key=lambda x:x['name'])
quoter = lambda x:'"{}"'.format(x)
getter = itemgetter(*'name focus complete remaining description path'.split())
for item in sorted_items:
yield 3*' ' + ', '.join(map(quoter, getter(item)))
def parse_item(self, path, item):
"Process a pytest provided item"
data = {
'name': item.name.split('[')[0],
'path': path.split('::')[0],
'description': '',
'remaining': 0,
'complete': 100,
'focus': ''
}
doc = item.function.__doc__
if doc:
desc = self.extract_description(doc)
data['description'] = desc
plan_info = self.extract_info(doc)
data.update(plan_info)
return data
def extract_description(self, doc):
first_sentence = doc.split('\n\n')[0].replace('\n',' ')
return cut_whitespace(first_sentence)
def extract_info(self, doc):
plan_info = {}
for sub_str in doc.split('\n\n'):
cleaned = cut_whitespace(sub_str.replace('\n', ' '))
splitted = plan_re.split(cleaned)
if len(splitted) > 1:
i = iter(splitted[1:]) #splitter starts at index 1
while True:
try:
key = next(i)
val = next(i)
except StopIteration:
break
assert key
if key in plan_handlers:
plan_info[key] = plan_handlers[key](val)
return plan_info
From my conftest.py file, I have a command line argument configured within a pytest_addoption function: parser.addoption('--generate_test_plan', action='store_true', default=False, help="Generate test plan")
And I then configure the plugin within this function:
def pytest_configure(config):
output_test_plan_file = Path('docs/source/test_plan.rst')
class CustomPlan(GeneratePlan):
def item_filter(self, items):
return (i for i in items if 'tests/hw_regression_tests' in i.nodeid)
if config.getoption('generate_test_plan'):
config.pluginmanager.register(CustomPlan(output_file=output_test_plan_file))
#config.pluginmanager.register(GeneratePlan())
Finally, in one of my sphinx documentation source files I then just include the output rst file:
Autogenerated test_plan
=======================
The below test_data is extracted from the individual tests in the suite.
.. include:: test_plan.rst
We have done something similar in our company by using Sphinx-needs and Sphinx-Test-Reports.
Inside a test file we use the docstring to store our test-case incl meta-data:
def my_test():
"""
.. test:: My test case
:id: TEST_001
:status: in progress
:author: me
This test case checks for **awesome** stuff.
"""
a = 2
b = 5
# ToDo: chek if a+b = 7
Then we document the test cases by using autodoc.
My tests
========
.. automodule:: test.my_tests:
:members:
This results in some nice test-case objects in sphinx, which we can filter, link and present in table and flowcharts. See Sphinx-Needs.
With Sphinx-Test-Reports we are loading the results into the docs as well:
.. test-report: My Test report
:id: REPORT_1
:file: ../pytest_junit_results.xml
:links: [[tr_link('case_name', 'signature')]]
This will create objects for each test case, which we also can filter and link.
Thanks of tr_link the result objects get automatically linked to the test case objects.
After that we have all needed information in sphinx and can use e.g. .. needtable:: to get custom views on it.
Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)