Snakemake: Can you expand on two dependent variables? - snakemake

I'm running associations for a list of genes and markers. I have a list of genes genes = ['gene1', 'gene2', ...] and a dictionary where the keys are gene names and the values are lists of markers that I want to associate with that gene, i.e. markers = {'gene1': ['marker1.1', 'marker1.2', ...], 'gene2': ['marker2.1', 'marker2.2', ...], ...}. I have a rule that outputs a file gene/assoc/marker for a given gene and a marker.
Is it possible expand on the genes list and the markers dictionary simultaneously, such that the gene that is being expanded on works as a key into the dict? Something akin to the following:
markers = {
'gene1': ['marker1.1', 'marker1.2', ...],
'gene2': ['marker2.1', 'marker2.2', ...],
...
}
genes = markers.keys()
rule all:
input:
expand('{gene}/assoc/{marker}', gene=genes, marker=markers[current_gene])

You can use expand in advance:
gimme_files = []
for marker in markers:
_gimme_per_marker = expand('{gene}/assoc/{marker}', gene=marker, marker=markers[marker])
gimme_files.extend(_gimme_per_marker)
rule all:
input:
gimme_files

Related

Nextflow DSL2 output from different processes mixed up as input in later processes

I have a DSL2 Nextflow pipeline that branches out to 2 FILTER processes. Then in the CONCAT process, I reuse the two previous process outputs as input. Also in the SUMMARIZE process, I reuse previous process ouputs as input.
I am finding that when I run the pipeline with 2 or more pairs of fastq samples, that the inputs are mixed up.
For example, at the CONCAT step, I end up concating the bwa_2_ch output of one pair of fastq samples with the filter_1_ch of another pair of fastq samples instead of samples with the same pair_id.
I believe am not writing the workflow { } channels and inputs entirely correctly the workflow runs through the steps properly without mixing samples. But I am not sure how to define the inputs so that there is no mix up.
//trimmomatic read trimming
process TRIM {
tag "trim ${pair_id}"
publishDir "${params.outdir}/$pair_id/trim_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id), path("trimmed_${pair_id}_...")
script:
"""
"""
}
//bwa alignment
process BWA_1 {
tag "align-1 ${pair_id}f"
publishDir "${params.outdir}/$pair_id/..."
input:
tuple val(pair_id), path(reads)
path index
output:
tuple val(pair_id), path("${pair_id}_...}")
script:
"""
"""
}
process FILTER_1 {
tag "filter ${pair_id}"
publishDir "${params.outdir}/$pair_id/filter_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id),
path("${pair_id}_...")
script:
"""
"""
}
process FILTER_2 {
tag "filter ${pair_id}"
publishDir "${params.outdir}/$pair_id/filter_results"
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id),
path("${pair_id}_...")
script:
"""
"""
}
//bwa alignment
process BWA_2 {
tag "align-2 ${pair_id}"
publishDir "${params.outdir}/$pair_id/bwa_2_results"
input:
tuple val(pair_id), path(reads)
path index
output:
tuple val(pair_id), path("${pair_id}_...}")
script:
"""
"""
}
//concatenate pf and non_human reads
process CONCAT{
tag "concat ${pair_id}"
publishDir "${params.outdir}/$pair_id"
input:
tuple val(pair_id), path(program_reads)
tuple val(pair_id), path(pf_reads)
output:
tuple val(pair_id), path("${pair_id}_...")
script:
"""
"""
}
//summary
process SUMMARY{
tag "summary ${pair_id}"
publishDir "${params.outdir}/$pair_id"
input:
tuple val(pair_id), path(trim_reads)
tuple val(pair_id), path(non_human_reads)
output:
file("summary_${pair_id}.csv")
script:
"""
"""
}
workflow {
Channel
.fromFilePairs(params.reads, checkIfExists: true)
.set {read_pairs_ch}
// trim reads
trim_ch = TRIM(read_pairs_ch)
// map to pf genome
bwa_1_ch = BWA_1(trim_ch, params.pf_index)
// filter mapped reads
filter_1_ch = FILTER_1(bwa_1_ch)
filter_2_ch = FILTER_2(bwa_1_ch)
// map to pf and human genome
bwa_2_ch = BWA_2(filter_2_ch, params.index)
// concatenate non human reads
concat_ch = CONCAT(bwa_2_ch,filter_1_ch)
// summarize
summary_ch = SUMMARY(trim_ch,concat_ch)
}
Mix-ups like this usually occur when a process erroneously receives two or more queue channels. Most of the time, what you want is one queue channel and one or more value channels when you require multiple input channels. Here, I'm not sure exactly what pair_id would be bound to, but it likely won't be what you expect:
input:
tuple val(pair_id), path(program_reads)
tuple val(pair_id), path(pf_reads)
What you want to do is replace the above with:
input:
tuple val(pair_id), path(program_reads), path(pf_reads)
And then use the join operator to create the required inputs. For example:
workflow {
Channel
.fromFilePairs( params.reads, checkIfExists: true )
.set { read_pairs_ch }
pf_index = file( params.pf_index )
bwa_index = file( params.bwa_index )
// trim reads
trim_ch = TRIM( read_pairs_ch )
// map to pf genome
bwa_1_ch = BWA_1( trim_ch, pf_index)
// filter mapped reads
filter_1_ch = FILTER_1(bwa_1_ch)
filter_2_ch = FILTER_2(bwa_1_ch)
// map to pf and human genome
bwa_2_ch = BWA_2(filter_2_ch, bwa_index)
// concatenate non human reads
concat_ch = bwa_2_ch \
| join( filter_1_ch ) \
| CONCAT
// summarize
summary_ch = trim_ch \
| join( concat_ch ) \
| SUMMARY
}

pandas: how to remove duplicates from a deeply nested list of lists

I have a panda dataframe like the following:
df = pd.DataFrame({ 'text':['the weather is nice though', 'How are you today','the beautiful girl and the nice boy']})
df['sentence_number'] = df.index + 1
df['token'] = df['text'].str.split().tolist()
df= df.explode('token').reset_index(drop=True)
I have to have a column for tokens as I need it for another project. I have applied the following to my dataframe.
import spacy
nlp = spacy.load("en_core_web_sm")
dep_children_sm = []
def dep_children_tagger(txt):
children = [[[child for child in n.children] for n in doc] for doc in nlp.pipe(txt)]
dep_children_sm.append(children)
dep_children_tagger(df.text)
since one has to apply the n.children method on the sentence level, I have to use the text column and not the token column, so the output has a list of repetitions. I would now like to remove this repetitions from my list 'dep_children_sm', and i have done the following,
children_flattened =[item for sublist in dep_children_sm for item in sublist]
list(k for k,_ in itertools.groupby(children_flattened))
but nothing happens, and I still have the repeated lists. I have also tried to add drop_duplicates() to the text column when calling the function, but the problem is that I have duplicate sentences in my original dataframe and unfortunately cannot do that.
desired output = [[[], [the], [weather, nice, though], [], []], [[], [How, you, today], [], []], [[], [], [the, beautiful, and, boy], [], [], [], [the, nice]]]
It seems that you want to apply your function on the unique texts. Thus you can first use the pandas.Series.unique method on df.text
>>> df['text'].unique()
array(['the weather is nice though', 'How are you today',
'the beautiful girl and the nice boy'], dtype=object)
Then I would simplify your function to directly output the result. There is no need for a global list. Also, your function was adding an extra level of list, which seems unwanted.
def dep_children_tagger(txt):
return [[[child for child in n.children] for n in doc] for doc in nlp.pipe(txt)]
Finally, apply your function on the unique texts:
dep_children_sm = dep_children_tagger(df['text'].unique())
This gives:
>>> dep_children_sm
[[[], [the], [weather, nice, though], [], []],
[[], [How, you, today], [], []],
[[], [], [the, beautiful, and, boy], [], [], [], [the, nice]]]
ok, I figured out how to solve this issue. the problem was that the nlp.text outputs a list of list of lists on spacy tokens, and since at no point there is any string in this nested list, the itertools does not work.
since I cannot remove the duplicates from the text column in my analysis, I have done the following instead.
d =[' '.join([str(c) for c in lst]) for lst in children_flattened]
list(set(d))
this outputs a list of strings excluding duplicates
# ['[] [How, you, today] [] []',
# '[] [the] [weather, nice, though] [] []',
# '[] [] [the, beautiful, and, boy] [] [] [] [the, nice]']

combine two lists to PCollection

I'm using Apache Beam. When writing to tfRecord I need to include the ID of the item along with its text and embedding.
The tutorial works with just one list of text but I also have a list of the IDs to match the list of text so I was wondering how I could pass the ID to the following function:
def to_tf_example(entries):
examples = []
text_list, embedding_list = entries
for i in range(len(text_list)):
text = text_list[i]
embedding = embedding_list[i]
features = {
# need to pass in ID here like so:
'id': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[ids.encode('utf-8')])),
'text': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
'embedding': tf.train.Feature(
float_list=tf.train.FloatList(value=embedding.tolist()))
}
example = tf.train.Example(
features=tf.train.Features(
feature=features)).SerializeToString(deterministic=True)
examples.append(example)
return examples
My first thought was just to include the ids in the text column of my database and then extract them via slicing or regex or something but was wondering if there was a better way, I assume converting to a PCollection but don't know where to start. Here is the pipeline:
with beam.Pipeline(args.runner, options=options) as pipeline:
query_data = pipeline | 'Read data from BigQuery' >>
beam.io.Read(beam.io.BigQuerySource(project='my-project', query=get_data(args.limit), use_standard_sql=True))
# list of texts
text = query_data | 'get list of text' >> beam.Map(lambda x: x['text'])
# list of ids
ids = query_data | 'get list of ids' >> beam.Map(lambda x: x['id'])
( text
| 'Batch elements' >> util.BatchElements(
min_batch_size=args.batch_size, max_batch_size=args.batch_size)
| 'Generate embeddings' >> beam.Map(
generate_embeddings, args.module_url, args.random_projection_matrix)
| 'Encode to tf example' >> beam.FlatMap(to_tf_example)
| 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
file_path_prefix='{0}'.format(args.output_dir),
file_name_suffix='.tfrecords')
)
query_data | 'Convert to entity and write to datastore' >> beam.Map(
lambda input_features: create_entity(
input_features, args.kind))
I altered generate_embeddings to return List[int], List[string], List[List[float]] and then used the following function to pass the list of ids and text in:
def generate_embeddings_for_batch(batch, module_url, random_projection_matrix):
embeddings = generate_embeddings([x['id'] for x in batch], [x['text'] for x in batch], module_url, random_projection_matrix)
return embeddings
Here I'll assume generate_embeddings has the signature List[str], ... -> (List[str], List[List[float]])
What you want to do is avoid splitting your texts and ids into separate PCollections. So you might want to write something like
def generate_embeddings_for_batch(
batch,
module_url,
random_projection_matrix) -> Tuple[int, str, List[float]]:
embeddings = generate_embeddings(
[x['text'] for x in batch], module_url, random_projection_matrix)
text_to_embedding = dict(embeddings)
for id, text in batch:
yield x['id'], x['text'], text_to_embedding[x['text']]
From there you should be able to write to_tf_example.
It would probably make sense to look at using TFX.

Change Shape of Array in Numpy

I have an array with shape (144,).
I have a vector of data with shape (2,144). For example, two readings from two sensors. Every readings has 144 values.
I would like to stick a time slot to each sensor reading, in order to have a matrix of (2,144,2): the first axis is the number of sensors; the second the number of readings, and the third the number of entries of each record, in this case 2 because I sticked the time axis.
I first tried to reshape the time axis vector to match the right shape, with:
np.broadcast_to(time_axis,(144,2))
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (144,) and requested shape (144,2)
I tried also with:
numOfVec = 2
num = 144
time_axis = np.broadcast_to(time_axis,(numOfVec,num)).T
# Add time axis
out = np.vstack((time_axis,synthetic.T))
UPDATE
I tried the hint given in a comment:
time_axis = self.datetime_range(10)
time_axis = np.reshape(time_axis,(1,num))
time_axis = np.repeat(time_axis,numOfVec,axis=0)
# Add time axis
out = np.stack((time_axis,synthetic))
It works but since I have to jsonify the data, the result is not correct:
"data": [
[
[
"00:00:00",
"00:10:00",
"00:20:00",
"00:30:00",
...
]
]
I would like to obtain something like this:
"data": [
[
[
"00:00:00",
"19.2"
],
[
"00:10:00",
"29.1"
]
]
]
I found the solution
# Convert to 2D array
time_axis = np.reshape(time_axis,(num,1))
# Add third dimensions
time_axis = np.expand_dims(time_axis, axis=0)
# Repeat time axis on third dimension
time_axis = np.repeat(time_axis,numOfVec,axis=0)
# Add time axis to sensor readings by sticking along the second dimension (axis = 2)
out = np.concatenate((time_axis,synthetic),axis=2)

N-grams based on POS tags : Spacy

I have a list of 20 rules to extract spacy tri-grams chunks from a sentence.
Chunks can be of pos-tags trigrams:-
Rule 1: [VERB,ADJ,NOUN]
Rule 2: [NOUN,VERB, ADV]
Rule 3: [NOUN,ADP,NOUN] etc.
Example Input:
"Education of children was our revenue earning secondary business."
Desired Output:
["Education of children","earning secondary business"]
I have already tried spacy Matcher and need something more optimised than running a for loop as the dataset is very large.
I think you are looking for rule-based matching. Your code will look something like:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
list_of_rules = [
["VERB", "ADJ", "NOUN"],
["NOUN", "VERB", "ADV"],
["NOUN", "ADP", "NOUN"],
# more rules here...
]
rules = [[{"POS": i} for i in j] for j in list_of_rules]
matcher = Matcher(nlp.vocab)
matcher.add("rules", None, *rules)
doc = nlp("Education of children was our revenue earning secondary business.")
matches = matcher(doc)
print([doc[start:end].text for _, start, end in matches])
which will print
['Education of children', 'earning secondary business']