I understand we have to use collect() when we run a process that takes as input two channels, where the first channel has one element and then second one has > 1 element:
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
process A {
input:
val(input1)
output:
path 'index.txt', emit: foo
script:
"""
echo 'This is an index' > index.txt
"""
}
process B {
input:
val(input1)
path(input2)
output:
path("${input1}.txt")
script:
"""
cat <(echo ${input1}) ${input2} > \"${input1}.txt\"
"""
}
workflow {
A( Channel.from( 'A' ) )
// This would only run for one element of the first channel:
B( Channel.from( 1, 2, 3 ), A.out.foo )
// and this for all of them as intended:
B( Channel.from( 1, 2, 3 ), A.out.foo.collect() )
}
Now the question: Why can this line in the example workflow from nextflow-io (https://github.com/nextflow-io/rnaseq-nf/blob/master/modules/rnaseq.nf#L15) work without using collect() or toList()?
It is the same situation, a channel with one element (the index) and a channel with > 1 (the fastq pairs) shall be used by the same process (quant), and it runs on all fastq files. What am I missing compared to my dummy example?
You need to create the first channel with a value factory which never exhausts the channel.
Your linked example implicitly creates a value channel which is why it works. The same happens when you call .collect() on A.out.foo.
Channel.from (or the more modern Channel.of) create a sequence channel which can be exhausted which is why both A and B only run once.
So
A( Channel.value('A') )
is all you need.
Related
I have a scenario where I need to do the following:
Read data from pubsub
Apply multiple Transformations to the data.
Persist the PCollection in multiple Google Big Query based on some config.
My question is how can I write data to multiple big query tables.
I searched for multiple bq writes using apache beam but could not find any solution
You can do that with 3 sinks, example with Beam Python :
def map1(self, element):
...
def map2(self, element):
...
def map3(self, element):
...
def main() -> None:
logging.getLogger().setLevel(logging.INFO)
your_options = PipelineOptions().view_as(YourOptions)
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
result_pcollection = (
p
| 'Read from pub sub' >> ReadFromPubSub(subscription='input_subscription')
| 'Map 1' >> beam.Map(map1)
| 'Map 2' >> beam.Map(map2)
| 'Map 3' >> beam.Map(map3)
)
(result_pcollection |
'Write to BQ table 1' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table1',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection |
'Write to BQ table 2' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table2',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
(result_pcollection_pub_sub |
'Write to BQ table 3' >> beam.io.WriteToBigQuery(
project='project_id',
dataset='dataset',
table='table3',
method='STREAMING_INSERTS',
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
if __name__ == "__main__":
main()
The first PCollection is the result of input from PubSub.
I applied 3 transformations in the input PCollection
Sink the result to the 3 different Bigquery table
res = Flow
=> Map 1
=> Map 2
=> Map 3
res => Sink result to BQ table 1 with `BigqueryIO`
res => Sink result to BQ table 2 with `BigqueryIO`
res => Sink result to BQ table 3 with `BigqueryIO`
In this example I used STREAMING_INSERT for ingestion to Bigquery tables, but you can adapt and change it if needed in your case.
I see the previous answers satisfy your requirement of writing the same result to multiple tables. However, I assume the below scenarios, provide a bit different pipeline.
Read data from PubSub
Filter the data based on configs (from event message keys)
Apply the different/same transformation to the filtered collections
Write results from previous collections to different BigQuery Sinks
Here, we filtered the events at early stages in the pipeline, this is helpful in:
Avoid processing the same event messages multiple times.
You can skip the messages which are not needed.
Apply relevant transformations to event messages.
Overall efficient and cost-effective system.
For example, you are processing messages from all around the world and you need to process and store the data with respect to geography - storing Europe messages in the Europe region.
Also, you need to apply transformations which are relevant to the country-specific data - add an Aadhar number to messages generated from India and Social Security number to messages generated from the USA.
And you don't want to process/store any events from specific countries - data from oceanic countries are irrelevant and not required to process/stored in our use case.
So, in this made-up example, filtering the data (based on the config) at the early stage, you will be able to store country-specific data (multiple sinks), and you don't have to process all events generated from the USA/any other region for adding an Aadhar number (event specific transformations) and you will be able to skip/drop the records or simply store them in BigQuery without applying any transformations.
If the above made-up example resembles your scenario, the sample pipeline design may look like this
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions,...
from apache_beam.io.gcp.internal.clients import bigquery
class TaggedData(beam.DoFn):
def process(self, element):
try:
# filter here
if(element["country"] == "in")
yield {"indiaelements:taggedasindia"}
if(element["country"] == "usa")
yield {"usaelements:taggedasusa"}
...
except:
yield {"taggedasunprocessed"}
def addAadhar(element):
"Filtered messages - only India"
yield "elementwithAadhar"
def addSSN(element):
"Filtered messages - only USA"
yield "elementwithSSN"
p = beam.Pipeline(options=options)
messages = (
p
| "ReadFromPubSub" >> ...
| "Tagging >> "beam.ParDo(TaggedData()).with_outputs('usa', 'india', 'oceania', ...)
)
india_messages = (
messages.india
| "AddAdhar" >> ...
| "WriteIndiamsgToBQ" >> streaming inserts
)
usa_messages = (
messages.usa
| "AddSSN" >> ...
| "WriteUSAmsgToBQ" >> streaming inserts
)
oceania_messages = (
messages.oceania
| "DoNothing&WriteUSAmsgToBQ" >> streaming inserts
)
deadletter = (
(messages.unprocessed, stage1.failed, stage2.failed)
| "CombineAllFailed" >> Flatn...
| "WriteUnprocessed/InvalidMessagesToBQ" >> streaminginserts...
)
I am reading in a file (see below). The example file has 13 rows.
A|doe|chemistry|100|A|
B|shea|maths|90|A|
C|baba|physics|80|B|
D|doe|chemistry|100|A|
E|shea|maths|90|A|
F|baba|physics|80|B|
G|doe|chemistry|100|A|
H|shea|maths|90|A|
I|baba|physics|80|B|
J|doe|chemistry|100|A|
K|shea|maths|90|A|
L|baba|physics|80|B|
M|doe|chemistry|100|A|
Then iterating over these rows using a for each ( batch size 5 ) and then calling a REST API
Depending on REST API response ( success or failure ) I am writing payloads to respective success / error files.
I have mocked the called API such that first batch of 5 records will fail and rest of the files will succeed.
While writing to success / error files am using the following transformation :
output application/csv quoteValues=true,header=false,separator="|"
---
payload
All of this works fine.
Success log file:
"F"|"baba"|"physics"|"80"|"B"
"G"|"doe"|"chemistry"|"100"|"A"
"H"|"shea"|"maths"|"90"|"A"
"I"|"baba"|"physics"|"80"|"B"
"J"|"doe"|"chemistry"|"100"|"A"
"K"|"shea"|"maths"|"90"|"A"
"L"|"baba"|"physics"|"80"|"B"
"M"|"doe"|"chemistry"|"100"|"A"
Error log file:
"A"|"doe"|"chemistry"|"100"|"A"
"B"|"shea"|"maths"|"90"|"A"
"C"|"baba"|"physics"|"80"|"B"
"D"|"doe"|"chemistry"|"100"|"A"
"E"|"shea"|"maths"|"90"|"A"
Now what I want to do is append the row/line number to each of these files so when this goes to production , whoever is monitoring these files can easily understand and correlate with the original file .
So as an example in case of error log file ( the first batch failed which is rows 1 to 5 ) I want to append these numbers to each of the rows:
"1"|"A"|"doe"|"chemistry"|"100"|"A"
"2"|"B"|"shea"|"maths"|"90"|"A"
"3"|"C"|"baba"|"physics"|"80"|"B"
"4"|"D"|"doe"|"chemistry"|"100"|"A"
"5"|"E"|"shea"|"maths"|"90"|"A"
Not sure what I should write in DataWeave to achieve this?
Inside the ForEach scope, you have access to the counter vars.counter (or whatever name you've chosen since it's configurable).
You will need to iterate over each chunk of records for adding the position for each one. You can use something like:
%dw 2.0
output application/csv quoteValues=true,header=false,separator="|"
var batchSize = 5
---
payload map ({
counter: batchSize * (vars.counter - 1) + ($$ + 1)
} ++ $
)
Or if you prefer to use the update function (this will add the record counter at the last column instead though):
%dw 2.0
output application/csv quoteValues=true,header=false,separator="|"
var batchSize = 5
---
payload map (
$ update {
case .counter! -> batchSize * (vars.counter - 1) + ($$ + 1)
}
)
Remember to replace the batchSize variable from this code with the same value you're using in the ForEach scope (if it's parameterised, it would be better).
Edit 1 -
Clarification: the - 1 and + 1 are because both indexes (the counter from the For Each scope and the $$ from the map) are zero-based.
Just another workaround and to simplify without using any external variables. The script can be split into two; 1st is for Error group and 2nd is for Success.
%dw 2.0
output application/csv quoteValues=true,header=false,separator="|"
// Will be used for creating a counter for Error group
var errorIdx = 1
// Will be used for creating a counter for Success group
var successIdx = 6
---
//errorItems for the first 5 rows
(payload[0 to 4] map (items,idx) -> (({"0":(idx) + errorIdx} ++ items)))
++
//successItems from 6 and remaining items.
(payload[5 to -1] map (items,idx) -> (({"0":(idx) + successIdx} ++ items)))
DataWeave Inline Variables:
errorIdx is a pointer for starting the error counter
successIdx is a pointer for starting the success counter
This will extract from index 0 to 4 element:
payload[0 to 4]
This will extract from index 5 to remaining elements:
payload[5 to -1]
I have a particular use case for which I have not found the solution in the Snakemake documentation.
Let's say in a given pipeline I have a portion with 3 rules a, b and c which will run for N samples.
Those rules handle large amount of data and for reasons of local storage limits I do not want those rules to execute at the same time. For instance rule a produces the large amount of data then rule c compresses and export the results.
So what I am looking for is a way to chain those 3 rules for 1 sample/wildcard, and only then execute those 3 rules for the next sample. All of this to make sure the local space is available.
Thanks
I agree that this is problem that Snakemake still has no solution for. However you may have a workaround.
rule all:
input: expand("a{sample}", sample=[1, 2, 3])
rule a:
input: "b{sample}"
output: "a{sample}"
rule b:
input: "c{sample}"
output: "b{sample}"
rule c:
input:
lambda wildcards: f"a{wildcards.sample-1}"
output: "c{sample}"
That means that the rule c for sample 2 wouldn't start before the output for rule a for sample 1 is ready. You need to add a pseudo output a0 though or make the lambda more complicated.
So building on Dmitry Kuzminov's answer, the following can work (both with numbers as samples and strings).
The execution order will be a3 > b3 > a1 > b1 > a2 > b2.
I used a different sample order to show it can be made different from the sample list.
samples = [1, 2, 3]
sample_order = [3, 1, 2]
def get_previous(wildcards):
if wildcards.sample != sample_order[0]: # if different from a3 in this case
previous_sample = sample_order[sample_order.index(wildcards.sample) - 1]
return f'b_out_{previous_sample}'
else: # if is the first sample in the order i.e. a3
return #here put dummy file always present e.g. the file containing those rules or the Snakemake
rule all:
expand("b_out_{S}", S=sample)
rule a:
input:
"a_in_{sample}",
get_previous
output:
"a_out_{sample}"
rule b:
input:
"a_out_{sample}"
output:
"b_out_{sample}"
I'm new to snakemake and I don't know how to figure out this problem.
I've got my rule which has two inputs:
rule test
input_file1=f1
input_file2=f2
f1 is in [A{1}$, A{2}£, B{1}€, B{2}¥]
f2 is in [C{1}, C{2}]
The numbers are wildcards that come from an expand call. I need to find a way to pass to the file f1 and f2 a pair of files that match exactly with the number. For example:
f1 = A1
f2 = C1
or
f1 = B1
f2 = C1
I have to avoid combinations such as:
f1 = A1
f2 = C2
I would create a function that makes this kind of matches between the files, but the same should manage the input_file1 and the input_file2 at the same time. I thought to make a function that creates a dictionary with the different allowed combinations but how would I "iterate" over it during the expand?
Thanks
Assuming rule test gives you in output a file named {f1}.{f2}.txt, then you need some mechanism that correctly pairs f1 and f2 and create a list of {f1}.{f2}.txt files.
How you create this list is up to you, expand is just a convenience function for that but maybe in this case you may want to avoid it.
Here's a super simple example:
fin1 = ['A1$', 'A2£', 'B1€', 'B2¥']
fin2 = ['C1', 'C2']
outfiles = []
for x in fin1:
for y in fin2:
## Here you pair f1 and f2. This is a very trivial way of doing it:
if y[1] in x:
outfiles.append('%s.%s.txt' % (x, y))
wildcard_constraints:
f1 = '|'.join([re.escape(x) for x in fin1]),
f2 = '|'.join([re.escape(x) for x in fin2]),
rule all:
input:
outfiles,
rule test:
input:
input_f1 = '{f1}.txt',
input_f2 = '{f2}.txt',
output:
'{f1}.{f2}.txt',
shell:
r"""
cat {input} > {output}
"""
This pipeline will execute the following commands
cat A2£.txt C2.txt > A2£.C2.txt
cat A1$.txt C1.txt > A1$.C1.txt
cat B1€.txt C1.txt > B1€.C1.txt
cat B2¥.txt C2.txt > B2¥.C2.txt
If you touch the starting input files with touch 'A1$.txt' 'A2£.txt' 'B1€.txt' 'B2¥.txt' 'C1.txt' 'C2.txt' you should be able to run this example.
I have some JSON stored in a database column that looks like this:
pokeapi=# SELECT height FROM pokeapi_pokedex WHERE species = 'Ninetales';
-[ RECORD 1 ]------------------------------------------
height | {"default": {"feet": "6'07\"", "meters": 2.0}}
As part of a 'generation' algorithm I'm working on I'd like to take this value into a %hash, multiply it by (0.9..1.1).rand (to allow for a 'natural 10% variance in the height), and then create a new %hash in the same structure. My select-height method looks like this:
method select-height(:$species, :$form = 'default') {
my %heights = $.data-source.get-height(:$species, :$form);
my %height = %heights * (0.9..1.1).rand;
say %height;
}
Which actually calls my get-height routine to get the 'average' heights (in both metric and imperial) for that species.
method get-height (:$species, :$form) {
my $query = dbh.prepare(qq:to/STATEMENT/);
SELECT height FROM pokeapi_pokedex WHERE species = ?;
STATEMENT
$query.execute($species);
my %height = from-json($query.row);
my %heights = self.values-or-defaults(%height, $form);
return %heights;
}
However I'm given the following error on execution (I assume because I'm trying to multiple the hash as a whole rather than the individual elements of the hash):
$ perl6 -I lib/ examples/height-weight.p6
{feet => 6'07", meters => 2}
Odd number of elements found where hash initializer expected:
Only saw: 1.8693857987465123e0
in method select-height at /home/kane/Projects/kawaii/p6-pokeapi/lib/Pokeapi/Pokemon/Generator.pm6 (Pokeapi::Pokemon::Generator) line 22
in block <unit> at examples/height-weight.p6 line 7
Is there an easier (and working) way of doing this without duplicating my code for each element? :)
Firstly, there is an issue with logic of your code. Initially, you are getting a hash of values, "feet": "6'07\"", "meters": 2.0 parsed out of json, with meters being a number and feet being a string. Next, you are trying to multiply it on a random value... And while it will work for a number, it won't for a string. Perl 6 allomorphs allow you to do that, actually: say "5" * 3 will return 15, but X"Y' pattern is complex enough for Perl 6 to not naturally understand it.
So you likely need to convert it before processing, and to convert it back afterwards.
The second thing is exact line that leads to the error you are observing.
Consider this:
my %a = a => 5;
%a = %a * 10 => 5; # %a becomes a hash with a single value of 10 => 5
# It happens because when a Hash is used in math ops, its size is used as a value
# Thus, if you have a single value, it'll become 1 * 10, thus 10
# And for %a = a => 1, b => 2; %a * 5 will be evaluated to 10
%a = %a * 10; # error, the key is passed, but not a value
To work directly on hash values, you want to use map method and process every pair, for example: %a .= map({ .key => .value * (0.9..1.1).rand }).
Of course, it can be golfed or written in another manner, but the main issue is resolved this way.
You've accepted #Takao's answer. That solution requires manually digging into %hash to get to leaf hashes/lists and then applying map.
Given that your question's title mentions "return ... same structure" and the body includes what looks like a nested structure, I think it's important there's an answer providing some idiomatic solutions for automatically descending into and duplicating a nested structure:
my %hash = :a{:b{:c,:d}}
say my %new-hash = %hash».&{ (0.9 .. 1.1) .rand }
# {a => {b => {c => 1.0476391741359872, d => 0.963626602773474}}}
# Update leaf values of original `%hash` in-place:
%hash».&{ $_ = (0.9 .. 1.1) .rand }
# Same effect:
%hash »*=» (0.9..1.1).rand;
# Same effect:
%hash.deepmap: { $_ = (0.9..1.1).rand }
Hyperops (eg ») iterate one or two data structures to get to their leaves and then apply the op being hypered:
say %hash».++ # in-place increment leaf values of `%hash` even if nested
.&{ ... } calls the closure in braces using method call syntax. Combining this with a hyperop one can write:
%hash».&{ $_ = (0.9 .. 1.1) .rand }
Another option is .deepmap:
%hash.deepmap: { $_ = (0.9..1.1).rand }
A key difference between hyperops and deepmap is that the compiler is allowed to iterate data structures and run hyperoperations in parallel in any order whereas deepmap iteration always occurs sequentially.