Nextflow: split or subset a channel containing tuples - channel

I have joined two channels as a way of filtering out items that do not have all the necessary files. The resulting items of the joined channel look like:
[sample1, [sample1.csv], [sample1_1.fastq, sample1_2.fastq]]
I now wish to remove the csv entry so that the items of the channel have the form:
[sample1, [sample1_1.fastq, sample1_2.fastq]]
for use in existing downstream processes.
I've been looking at multiMap and branch but can't seem to find anything that does what I want. What am I missing?

You can use the map operator for this:
joined_ch
.map { sample, csv_files, fastq_files ->
tuple( sample, fastq_files )
}
.view()

Related

Extract key value pair from JSON string dynamically (Redshift)

I am working with a column which stores the data for camera effects usage in a JSON string.
The values inside it look something like this:
{"camera": {"GIFs": ["floss_dance", "kermit"], "filters": ["blur"], "GIF_count": 2, "Filter_count": 1}}
If I want to extract data for GIFs, I use this code:
json_extract_path_text(camera_effects, 'camera', 'GIFs') which will yield the result as ["floss_dance", "kermit"]
If I want to extract particular GIF names, I use json_extract_array_element_text(camera_effects, 'camera', 'GIFs') which will give me floss_dance and kermit in separate rows.
My issue is that we keep adding new effects to our camera, which means the number of elements in the 'camera' array will keep increasing and using the json_extract_path_text(camera_effects, 'camera', 'name_of_effect') is not a dynamic code.
Is there a way to extract a list of all key:value pairs that exist for 'camera' and then use the keys as a column, and values as rows?
Note: I am using Redshift SQL

read specific files names in adf pipeline

I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')

How to specify which key/value pairs to exclude in spaCy's Doc.to_disk(path, exclude=['user_data'])?

My nlp pipeline has some doc extensions that store 3 items (a string for file name and two dicts which map non-serializable objects). I'd like only to exclude the non-serializable key/value pairs in the user data, but keep the filename.
doc.to_disk(path, exclude=['user_data'])
works as expected, excluding all user data. There are apparently options to instead exclude either 'user_data_keys' or 'user_data_values' but I find no explanation of their usage, and furthermore I can't think of any good reason to store either all the keys without the values or all the values without the keys!
I would like to exclude both keys and values of only certain fields in the doc.user_data. If this is possible, how is it done?
You will need to specify which keys or values you want to exclude.
https://spacy.io/api/doc#serialization-fields
data = doc.to_bytes(exclude=["text", "tensor"])
doc.from_disk("./doc.bin", exclude=["user_data"])
Per this thread here, you can try the following work around:
def remove_unserializable_results(doc):
doc.user_data = {}
for x in dir(doc._):
if x in ['get', 'set', 'has']: continue
setattr(doc._, x, None)
for token in doc:
for x in dir(token._):
if x in ['get', 'set', 'has']: continue
setattr(token._, x, None)
return doc
nlp.add_pipe(remove_unserializable_results, last=True)

How to show Feature Names in Graphviz?

I'm building a tree in Graphviz and I can't seem to be able to get the feature names to show up, I have defined a list with the feature names like so:
names = list(df.columns.values)
Which prints:
['Gender',
'SuperStrength',
'Mask',
'Cape',
'Tie',
'Bald',
'Pointy Ears',
'Smokes']
So the list is being created, later I build the tree like so:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=names)
But the final image still has the feature names listed like X[]:
How can I get the actual feature names to show up? (Cape instead of X[3], etc.)
I can only imagine this has to do with passing the names as an array of the values. It works fine if you pass the columns directly:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=df.columns)
If needed, you can also slice the columns:
export_graphviz(tree, out_file=ddata, filled=True, rounded=True, special_characters=False, impurity=False, feature_names=df.columns[5:])

Need help on using Spark Filter

I am new in Apache spark, need help in forming either SQL query or spark filter on dataframe.
Below is how my data is formed, i.e. i have large amount of users which contains below data.
{ "User1":"Joey", "Department": ["History","Maths","Geography"] }
I have multiple search conditions like below ones, wherein i need to search array of data based on operator defined by user say for example may be and / or.
{
"SearchCondition":"1",
"Operator":"and",
"Department": ["Maths","Geography"]
}
Can point me to a path of how to achieve this in spark ?
Thanks,
-Jack
I assume you use Scala and you have parsed the data in a DataFrame
val df = spark.read.json(pathToFile)
I would use DataSets for this because they provide type safety
case class User(department: Array[String], user1: String)
val ds = df.as[User]
def pred(user: User): Boolean = Set("Geography","Maths")subsetOf(user.department.toSet)
ds.filter(pred _)
You can read more about DataSets here and here.
If you prefer to use Dataframes you can do it with user defined functions
import org.apache.spark.sql.functions._
val pred = udf((arr: Seq[String]) => Set("Geography","Maths")subsetOf(arr.toSet))
df.filter(pred($"Department"))
At the same package you can find a spark built-in function for this. You can do
df.filter(array_contains($"Department", "Maths")).filter(array_contains($"Department", "Geography"))
but someone could argue that this is not so efficient and the optimizer can`t improve it a lot.
Note that for each search condition you need a different predicate.